## Section Headings Curation and sentences spliter of Chilean Policies

In this notebook there are a series of dictionaries and methods to curate section headings of El Salvador policies. Policies from El Salvador have a rather definite structure, so that the law text is organized under section headings. There are two kinds of sections, the ones that are general and that can be often found in many policies, and the ones which are specific. The sections headings which are more general often come with a whole range of name variants which makes the task of machine recognition difficult.

The goal of this notebook is to group all pretreatment methods that would harmonize sections heading to make the further processing machine friendly.

In [2]:
from pathlib import Path
import boto3, json, operator, os, re, string
import numpy as np

### Dictionaries of particular vocabularies to help in the curation of section headings

In [66]:
# Most policies come with the final signatures. This is a piece of text that we want to be able to recognize. To make the
# detection of signatures easier, this dictionary contain the most common terms that can be found in these lines of text.
official_positions = {"ALCALDE" : 0,
"Alcalde" : 0,
"MINISTRA" : 0,
"Ministra" : 0,
"MINISTRO" : 0,
"Ministro" : 0,
"PRESIDENTA" : 0,
"Presidenta" : 0,
"PRESIDENTE" : 0,
"Presidente" : 0,
"REGIDOR" : 0,
"Regidor"  : 0,
"REGIDORA" : 0,
"regidora" : 0,
"SECRETARIA" : 0,
"Secretaria" : 0,
"SECRETARIO" : 0,
"Secretario" : 0,
"SINDICA" : 0,
"Sindica" : 0,
"SINDICO" : 0,
"Sindico" : 0,
"VICEPRESIDENTA" : 0,
"Vicepresidenta" : 0,
"VICEPRESIDENTE" : 0,
"Vicepresidente" : 0
}

end_of_file_tags = {
    "Anótese" : 0,
    "Publíquese" : 0
}
# This dictionary contains some correspondences among different text headings. This is under development and needs further
# improvement.The idea is to merge in a single name all the headings that point to the same conceptual concept. For example,
# "Definiciones" is a heading that can come alone or together with other terms so it can appear as "Definiciones básicas" or
# "Definiciones generales". With the dictionary we can fetch all headings that contain the word "Definiciones" and change the
# heading to "Definiciones".
merges = {
    "CONCEPTOS" : "DISPOSICIONES GENERALES",
    "Considerando:" : "CONSIDERANDO",
    "DEFINICIONES" : "DISPOSICIONES GENERALES",
    "DISPOSICIONES FINALES" : "DISPOSICIONES GENERALES",
    "DISPOSICIONES GENERALES" : "DISPOSICIONES GENERALES",
    "DISPOSICIONES PRELIMINARES" : "DISPOSICIONES GENERALES",
    "DISPOSICIONES REGULADORAS" : "DISPOSICIONES ESPECIALES",
    "DISPOSICIONES RELATIVAS" : "DISPOSICIONES ESPECIALES",
    "DISPOSICIONES ESPECIALES" : "DISPOSICIONES ESPECIALES",
    "DISPOSICIONES TRANSITORIAS" : "DISPOSICIONES GENERALES",
    "INCENTIVOS" : "INCENTIVOS",
    "INFRACCIONES" : "INFRACCIONES",
    "INFRACCION ES" : "INFRACCIONES",
    "OBJETIVO" : "OBJETO",
    "OBJETO" : "OBJETO",
    "DERECHOS" : "DERECHOS, OBLIGACIONES Y PROHIBICIONES",
    "DEBERES" : "DERECHOS, OBLIGACIONES Y PROHIBICIONES",
    "OBLIGACIONES" : "DERECHOS, OBLIGACIONES Y PROHIBICIONES",
    "OBLIGACIONE" : "DERECHOS, OBLIGACIONES Y PROHIBICIONES",
    "OBLIGACION" : "DERECHOS, OBLIGACIONES Y PROHIBICIONES",
    "OBLIGATORIEDAD" : "DERECHOS, OBLIGACIONES Y PROHIBICIONES",
    "PROHIBICIONES" : "DERECHOS, OBLIGACIONES Y PROHIBICIONES",
    "PROHIBICION" : "DERECHOS, OBLIGACIONES Y PROHIBICIONES",
    "POR TANTO" : "POR TANTO",
    "POR LO TANTO" : "POR TANTO",
    "Decreto:" : "RESUELVO",
    "Resuelvo:" : "RESUELVO",
    "Se resuelve" : "RESUELVO",
    "S e  r e s u e l v e:" : "RESUELVO",
    "Visto:" : "VISTO",
    "Vistos:" : "VISTO",
    "Vistos estos antecedentes:" : "VISTO",
    "--------------" : "HEADING"
}
merges_lower = {}
for key, value in merges.items():
    merges_lower[key.lower()] = value
# Eventhough the general gramar rule in Spanish is not to accent uppercase, there are many cases where a word in a heding might
# appear accented. This is a dictionary to armonize all headings without accents. The list is rather comprehensive, but there is
# still room for improvement.
# If we find some bug beyond simple misspelling which will be solved by spell checker, we can include it here. The example is in
# the first row with "ACTIVIDADESUSOS" which was found several times in headings.
bugs = {"ACTIVIDADESUSOS" : "ACTIVIDADES DE USOS"}

### Connection to the AWS S3 bucket
To effectively run this cell you need Omdena's credentials. Please keep them local and do not sync them in GitHub repos nor cloud drives. Before doing anything with this json file, please think of security!!

In [4]:
json_folder = Path("C:/Users/user/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/credentials/")
# json_folder = Path("C:/Users/jordi/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/credentials/")
filename = "Omdena_key.json"
file = json_folder / filename

with open(file, 'r') as f:
    cred = json.load(f) 

for key in cred:
    KEY = key
    SECRET = cred[key]

s3 = boto3.resource(
    service_name = 's3',
    region_name = 'us-east-2',
    aws_access_key_id = KEY,
    aws_secret_access_key = SECRET
)

### Functions and regular expressions

In [51]:

# Function to calculate the uppercase ratio in a string. It is used to detect section headings
def uppercase_ratio(string):
    if len(re.findall(r'[a-z]',string)) == 0:
        return 1
    else:
        return(len(re.findall(r'[A-Z]',string))/len(re.findall(r'[a-z]',string)))

def end_of_heading(line):
    if "URL" in line and "https:" in line:
        return True
    else:
        return False
    
def end_of_document(line):
    end_of_file = False
    for key in end_of_file_tags:
        if key in line:
            end_of_file = True
            break
    return end_of_file
# Regular expression to clear html tags (here is basically to remove the page tags)
cleanr = re.compile(r'<.*?>')
# Te function to clear html tags
def clean_html_tags(string):
  return cleanr.sub('', string)

def is_section(line):
    for key in merges:
        if key in line:
            line = merges[key]
            break
    return line
    
def is_por_tanto(line):
    if "POR TANTO" in line:
        return True
    else:
        return False

# Function to remove the last lines of a document, the ones that contain the signatures of the officials. It depends on the
# dictionary "official_positions"
def remove_signatures(line):
    signature = False
    for key in official_positions:
        if key in line:
            signature = True
            break
    return signature

# Function to change accented words by non-accented counterparts. It depends on the dictionary "accent_marks_bugs" 
accents_out = re.compile(r'[áéíóúÁÉÍÓÚ]')
accents_dict = {"á":"a","é":"e","í":"i","ó":"o","ú":"u","Á":"A","É":"E","Í":"I","Ó":"O","Ú":"U"}
def remove_accents(string):
    for accent in accents_out.findall(string):
        string = string.replace(accent, accents_dict[accent])
    return string

# Function to merge headlines expressing the same concept in different words. It depends on the dictionary "merges"
def merge_concepts(line):
    for key in merges:
        if key in line:
            line = merges[key]
            break
    return line

def clean_bugs(line):
    for key in bugs:
        if key in line:
            line = line.replace(key, bugs[key])
    return line

clean_special_char = re.compile(r'(\*\.)|(\”\.)')  
def clean_special_characters(line):
    char = clean_special_char.findall(line)
    for item in char:
        for character in item:
            if character != '':
                line = line.replace(character, "")
    return line

clean_acron = re.compile(r'(A\s*\.M\s*\.)|(\bart\s*\.)|(\bArt\s*\.)|(\bART\s*\.)|(\bArts\s*\.)|(\bAV\s*\.)|(\bDr\s*\.)|(\bIng\s*\.)|(\bLic\s*\.)|(\bLicda\s*\.)|(\bLIC\s*\.)|(mts\s*\.)|(\bNo\s*\.)|(P\s*\.M\s*\.)|(prof\s*\.)|(profa\s*\.)|(sp\s*\.)|(ssp\s*\.)|(to\s*\.)|(ta\s*\.)|(var\s*\.)')  
def clean_acronyms(line):
    acro = clean_acron.findall(line)
    for item in acro:
        for acronym in item:
            if acronym != '':
                line = line.replace(acronym, clean_punct.sub('', acronym))
    return line

whitespaces = re.compile(r'[ ]{2,}')
def clean_whitespace(line):
    return whitespaces.sub(' ', line).rstrip().lstrip()

decimal_points = re.compile(r'(\b\d+\s*\.\s*\d+)')
def change_decimal_points(line):
    dec = decimal_points.findall(line)
    for decimal in dec:
        if decimal != '':
#             print(decimal)
            line = line.replace(decimal, clean_punct.sub(',', decimal))
    return line
                
# Regular expression to clear punctuation from a string
clean_punct = re.compile('[%s]' % re.escape(string.punctuation))
# Regular expression to clear words that introduce unnecessary variability to headings. Some regex still not work 100% we need
# to improve them.
clean_capitulo = re.compile(r'(APARTADO \S*)|(APARTADO\s)|(^ART\.\s*\S*)|(^ART\.\s*)|(^Art\.\s*\S*)|(^Art\.\s*)|(^Arts\.\s*\S*)|(Capítulo \S*)|(CAPITULO \S*)|(CAPITULO\S*)|(CAPÍTULO \S*)|(CAPITULÓ \S*)|(CAPITULOS \S*)|(CAPITUO \S*)|(CATEGORIA\b)|(CATEGORÍA\b)|(SUBCATEGORIA\b)|(SUBCATEGORÍA\b)|(TITULO\s\S*)|(TÍTULO\s\S*)')
clean_bullet_char = re.compile(r'\b[A-Za-z]\s*\.|\b[A-Za-z]\s*\.\s*|\b[A-Za-z]\s*\-\s*|\b[A-Za-z]\s*\)\s*|\.\s*\b[B-Za-z]\b|\b[A-Z]{1,4}\s*\.|^\d+\s*\.\s*\D+|\d+\)')
clean_bullet_point = re.compile(r'^-\s*')
# Function sentence
def clean_sentence(string):
    string = clean_capitulo.sub('', string)
    string = clean_bullet_char.sub('', string).rstrip().lstrip()
    string = clean_bullet_point.sub('', string).rstrip().lstrip()
#     string = clean_punct.sub('', string).rstrip().lstrip()
    if string != "":
        return string
    else:
        return None    
    
# points = re.compile(r'(\b\w+\s*\.\s*\b[^\d\W]+)')
# def check_points(line):
#     return points.findall(line)
#     print(points.findall(line))

points = re.compile(r'(\b\w+\b\s*){3,}')
def check_sentence(line):
    if points.findall(line):
        return True
    else:
        return False

def split_into_sentences(line):
    sentence_list = []
    for sentence in line.split("."):
        if check_sentence(sentence):
            sentence = sentence.rstrip().lstrip()
            sentence_list.append(sentence)
    return sentence_list

# Function to add items to the dictionary with duplicate removal
def add_to_dict(string, dictionary, dupl_dict):
    if string in dupl_dict or string == None:
        pass
    else:
        dupl_dict[string] = 0
        if string in dictionary:
            dictionary[string] = dictionary[string] + 1
        else:
            dictionary[string] = 1
    return dictionary
def full_cleaning(line):
    line = clean_html_tags(line)
    line = remove_accents(line)
    line = clean_special_characters(line)
    line = clean_bugs(line)
    line = clean_acronyms(line)
    line = clean_whitespace(line)
    line = clean_sentence(line)
    return line

In [30]:
test_string = "Que el Art. 204 Ordinal 3*. y 5”. de la Constitución, regula. A. Hola, em dic Jordi. B. No sé massa perquè l'Art *. 22 conté 22.34€. Tanmateix sembla que la Licda. una cosa. voldria  55.22. no fotis"
# test_string = "Prova senzilleta per veure què passa si no hi ha punt"
test_string = clean_sentence(test_string)
test_string = clean_special_characters(test_string)
test_string = clean_acronyms(test_string)
test_string = clean_whitespace(test_string)
test_string = change_decimal_points(test_string)
print(test_string)
sentences = []
[sentences.append(sentence) for sentence in split_into_sentences(test_string)]
print(sentences)
# print(sentences)

# if check_sentence(test_string):
#     

Que el Art 204 Ordinal 3 y 5 de la Constitución, regula. Hola, em dic Jordi. No sé massa perquè l'Art 22 conté 22,34€. Tanmateix sembla que la Licda una cosa. voldria 55,22. no fotis
['Que el Art 204 Ordinal 3 y 5 de la Constitución, regula', 'Hola, em dic Jordi', "No sé massa perquè l'Art 22 conté 22,34€", 'Tanmateix sembla que la Licda una cosa']


### Pipeline to process files from S3 bucket
By executing this cell you will go through all policies in El Salvador and process section headings that will be saved in a dictionary. This should be merged with the notebook that builds up the final json files out of plain txt files.

In [17]:
in_folder = "text-extraction/"
out_folder = "JSON/"
counter = 0
name4 = {}
name5 = {}
name6 = {}
name7 = {}
for obj in s3.Bucket('wri-latin-talent').objects.all().filter(Prefix='text-extraction'):
    if in_folder in obj.key and obj.key.replace(in_folder, "") != "":# and filename in obj.key   # Un comment the previous string to run the code just in one sample document.
        file = obj.get()['Body'].read().decode('utf-8')  #get the file from S3 and read the body content
        lines = file.split("\n") # Split by end of line and pipe lines into a list
        file_name = obj.key.replace(in_folder, "").replace('.pdf.txt', '')        
        name4[file_name[0:4]] = 0
        name5[file_name[0:5]] = 0
        name6[file_name[0:6]] = 0
        name7[file_name[0:7]] = 0

        counter += 1

In [None]:
print(counter)
print(len(name4))
print(len(name5))
print(len(name6))
print(len(name7))

In [79]:
in_folder = "text-extraction/"
out_folder = "JSON/"
filename = "00a55afe4f55256567397a68df5d7f97e642480b" # This is only if you want to test on a single file
# bag_of_words = {}
# sentences = []
# sentences_dict = {}
json_file = {}
for obj in s3.Bucket('wri-latin-talent').objects.all().filter(Prefix='text-extraction'):
    if in_folder in obj.key and obj.key.replace(in_folder, "") != "":# and filename in obj.key   # Un comment the previous string to run the code just in one sample document.
        file = obj.get()['Body'].read().decode('utf-8')  #get the file from S3 and read the body content
        lines = file.split("\n") # Split by end of line and pipe lines into a list
        key = obj.key.replace(in_folder, out_folder).replace('pdf.txt', 'json')
        file_name = key.replace('.json', '').replace(out_folder, '')
#         print(file_name)
        json_file[file_name] = {}
        duplicates_dict = {} #Sometimes the same heading can be found more than once in a document. This will help on removing them
        section = ""
        line_counter = 0
        i = 0
        is_title = False
        for line in lines:
            if is_section(line, i) and is_title is False or is_por_tanto(line):
                if remove_signatures(line):
                    break
                else:
                    line = clean_html_tags(line)
                    line = remove_accents(line)
                    line = clean_bugs(line)
                    line = clean_sentence(line)
                    if line == None:
                        continue
                    section = merge_concepts(line)
#                     print("** Section:", section)
                    json_file[file_name][section] = {"tags" : [], "sentences" : {}}
                    is_title = is_por_tanto(section)
    #                     bag_of_words = add_to_dict(line, bag_of_words, duplicates_dict)

            else:
                
                line = clean_html_tags(line)
                line = remove_accents(line)
                line = clean_special_characters(line)
                line = clean_bugs(line)
                line = clean_acronyms(line)
                line = clean_whitespace(line)
                line = clean_sentence(line)
                if line == None:
                    continue
                if is_title:
                    line_counter += 1
                    sentence_id = file_name[0:7] + '_' + str(line_counter)
                    json_file[file_name][section]["sentences"][sentence_id] = {"text" : line, "labels" : []}
                    
                else:
                    for sentence in split_into_sentences(line):
                        line_counter += 1
                        sentence_id = file_name[0:7] + '_' + str(line_counter)
                        json_file[file_name][section]["sentences"][sentence_id] = {"text" : sentence, "labels" : []}
                is_title = False
            i += 1 
#         s3.Object('wri-latin-talent', key).put(Body = str(json.dumps(json_file)))#This will save all the contents in the string variable "content" into a txt file in the Pre-processed folder
        


In [83]:
out_folder = Path("C:/Users/user/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/Data/Processed/")
filename = "ElSalvador.json"
file = out_folder / filename
with open(file, 'w') as fp:
    json.dump(json_file, fp, indent=4)

In [None]:
print(len(sentences_dict))
for k in sorted(sentences_dict):
    print(k, ":", sentences_dict[k])

#### Short summary

In [None]:
print("After preprocessing there are {} different headings in El Salvador policies".format(len(bag_of_words)))
print("{} documents have been processed".format(i))
print("There are {} lines of text as sentences".format(len(sentences)))

#### Dictionary items sorted by occurrence

In [None]:
dict( sorted(bag_of_words.items(), key=operator.itemgetter(1),reverse=True))

#### Dictionary items sorted by heading text

In [None]:
for k in sorted(bag_of_words):
    print(k, ":", bag_of_words[k])

#### Saving sentences as csv

In [None]:
print(sentences[0:2])

In [None]:
# path = Path("C:/Users/user/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/Data/")
path = Path("C:/Users/jordi/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/Data/")
filename = "sentences.npy"
file = path / filename
np_sentences = np.array(sentences)
with open(file, 'wb') as f:
    np.save(f, np_sentences)

#### Pipeline to process one file from HD folder
This is a pipeline to process a test file in a local folder.

In [3]:
path = "C:/Users/user/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Documents_de_mostra/Chile/"
files = os.listdir(path)
print(arr)

['002c53058e85d383b057fa4cc25a6eb8e7d401e3', '0031d55c90473158c09acded547d67d44be22325', '01203a974410a65782afca6ff2c3bdb24a84b158', '019ae0595cb8d53ae0316cf46564755b211cdc9f', '6cef0d1b7182adadd6fe887a5e76e90324c503a1', '74c55bda33a822e06e04492d5f01b9fd864e5ba6', '7546484f6dac7941d25ec5d834ce8666497290c7', '75ffb099a140f6e837c429236ce6f0b33d31a666', '76f84f42d18d124006755dd3a7f17d41c23224a5', 'aa87f53a385577ac05df15c1df9f7876e85a8661', 'aa8d47340d977381c929e5a1737bb7e94333db8f', 'f27de0a7e5242d0c32df1d637e85f2f68f394497', 'f33fed112e8a030bd07a13e9fc02f7b68e50f3e8', 'ff6cdccf62923f94bac18add8875f4b799f0adb0']


In [61]:
path = "C:/Users/user/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Documents_de_mostra/Chile/"
data_folder = Path(path)
filename = "00a55afe4f55256567397a68df5d7f97e642480b.pdf.txt"


files = os.listdir(path)

bag_of_words = {}
json_file = {}

i = 0
for filename in files:
    file_ = data_folder / filename
    with open(file_, 'r', encoding = 'utf-8') as file:
        lines = file.readlines()
        json_file[filename] = {}
        duplicates_dict = {} #Sometimes the same heading can be found more than once in a document. This will help on removing them
        section_counter = 0
        line_counter = 0
        heading_flag = True
        heading_content = False
        json_file[filename]["HEADING"] = {"tags" : [], "sentences" : {}}
        for line in lines:
            # Processing document heading
            if end_of_heading(line):
                heading_flag = False
                heading_content = False
            if heading_flag:
                if "Tipo Norma" in line:
                    heading_content = True
                if heading_content:
                    line = full_cleaning(line)
                    if ":" in line:
                        line_counter += 1
                        sentence_id = filename[0:7] + '_' + str(line_counter)
                        json_file[filename]["HEADING"]["sentences"][sentence_id] = {"text" : line, "labels" : []}
                    else:
                        json_file[filename]["HEADING"]["sentences"][sentence_id]["text"] = json_file[filename]["HEADING"]["sentences"][sentence_id]["text"] + " " + line
                
                
            line = clean_whitespace(line)
            if uppercase_ratio(line) == 1 and len(line) > 60:
                if remove_signatures(line):
                    break
                else:
                    line = remove_accents(line)
                    line = clean_bugs(line)
                    line = clean_sentence(line)
                    if line == None:
                        continue
                    section = merge_concepts(line)
    #                 print("** Section:", section)
                    json[section] = {"tags" : [], "sentences" : []}
                    bag_of_words = add_to_dict(line, bag_of_words, duplicates_dict)

    #         else:
    #             line = clean_html_tags(line)
    #             line = remove_accents(line)
    #             line = clean_special_characters(line)
    #             line = clean_bugs(line)
    #             line = clean_acronyms(line)
    #             line = clean_whitespace(line)
    #             line = clean_sentence(line)
    #             if line == None:
    #                 continue
    #             for sentence in split_into_sentences(line):
    #                  json[section]["sentences"].append({"text" : sentence , "tags" : []})
            i += 1
    #     data = file.read().replace('\n', '')

In [62]:
json_file

{'002c53058e85d383b057fa4cc25a6eb8e7d401e3': {'HEADING': {'tags': [],
   'sentences': {'002c530_1': {'text': 'Tipo Norma :Decreto 3157 EXENTO',
     'labels': []},
    '002c530_2': {'text': 'Fecha Publicacion :16-09-2016', 'labels': []},
    '002c530_3': {'text': 'Fecha Promulgacion :18-08-2016', 'labels': []},
    '002c530_4': {'text': 'Organismo :MUNICIPALIDAD DE PANQUEHUE',
     'labels': []},
    '002c530_5': {'text': 'Titulo :APRUEBA "ORDENANZA PARA LA EXTRACCION DE ARIDOS EN CAUCES Y ALVEOS DE CURSOS NATURALES DE AGUA QUE CONSTITUYEN BIENES NACIONALES DE USO PUBLICO Y EN POZOS LASTREROS DE PROPIEDAD PARTICULAR EN LA COMUNA DE PANQUEHUE" Y SUS RESPECTIVOS ANEXOS',
     'labels': []},
    '002c530_6': {'text': 'Tipo Version :Unica De : 16-SEP-2016',
     'labels': []},
    '002c530_7': {'text': 'Inicio Vigencia :16-09-2016', 'labels': []},
    '002c530_8': {'text': 'Id Norma :1094879', 'labels': []}}}},
 '0031d55c90473158c09acded547d67d44be22325': {'HEADING': {'tags': [],
   'sente

### Dictionary items sorted by occurrence

In [40]:
dict( sorted(bag_of_words.items(), key=operator.itemgetter(1),reverse=True))

{'APRUEBA "ORDENANZA PARA LA EXTRACCION DE ARIDOS EN CAUCES Y ALVEOS DE CURSOS NATURALES DE AGUA QUE CONSTITUYEN BIENES NACIONALES DE USO PUBLICO Y EN POZOS LASTREROS DE PROPIEDAD PARTICULAR EN LA COMUNA DE PANQUEHUE" Y SUS RESPECTIVOS ANEXOS': 1,
 'ORDENANZA PARA LA EXTRACCION DE ARIDOS EN CAUCES Y ALVEOS DE CURSOS NATURALES DE AGUA QUE CONSTITUYEN BIENES NACIONALES DE USO PUBLICO Y EN POZOS LASTREROS DE PROPIEDAD PARTICULAR, DE LA COMUNA DE PANQUEHUE': 1,
 'DE LOS PERMISOS DE EXTRACCION ARTESANAL DE SUBSISTENCIA': 1,
 'DEL TERMINO DEL PERMISO, DE LAS SANCIONES Y DE LOS PROCEDIMIENTOS DE FISCALIZACION': 1,
 'ANEXO N° 1 - ORDENANZA EXTRACCION DE ARIDOS COMUNA DE PANQUEHUE': 1,
 'ANEXO N° 2 - ORDENANZA EXTRACCION DE ARIDOS COMUNA DE PANQUEHUE': 1,
 'ACOGE A TRAMITACION ESTUDIO DE IMPACTO AMBIENTAL DEL PROYECTO SISTEMA DE TRATAMIENTO INTEGRAL DE LAS AGUAS SERVIDAS DE PUERTO MONTT, SEGUNDA PARTE': 1,
 'MODIFICA DECRETO Nº 355 EXENTO, DE 2008, DE LA SUBSECRETARIA DE AGRICULTURA QUE AUTORIZ

### Dictionary items sorted by heading text

In [None]:
for k in sorted(bag_of_words):
    print(k, ":", bag_of_words[k])