## Section Headings Curation of El Salvador Policies

In this notebook there are a series of dictionaries and methods to curate section headings of El Salvador policies. Policies from El Salvador have a rather definite structure, so that the law text is organized under section headings. There are two kinds of sections, the ones that are general and that can be often found in many policies, and the ones which are specific. The sections headings which are more general often come with a whole range of name variants which makes the task of machine recognition difficult.

The goal of this notebook is to group all pretreatment methods that would harmonize sections heading to make the further processing machine friendly.

In [1]:
from pathlib import Path
import re, boto3, json, string, operator
import numpy as np

### Dictionaries of particular vocabularies to help in the curation of section headings

In [2]:
# Most policies come with the final signatures. This is a piece of text that we want to be able to recognize. To make the
# detection of signatures easier, this dictionary contain the most common terms that can be found in these lines of text.
official_positions = {"ALCALDE" : 0,
"Alcalde" : 0,
"MINISTRA" : 0,
"Ministra" : 0,
"MINISTRO" : 0,
"Ministro" : 0,
"PRESIDENTA" : 0,
"Presidenta" : 0,
"PRESIDENTE" : 0,
"Presidente" : 0,
"REGIDOR" : 0,
"Regidor"  : 0,
"REGIDORA" : 0,
"regidora" : 0,
"SECRETARIA" : 0,
"Secretaria" : 0,
"SECRETARIO" : 0,
"Secretario" : 0,
"SINDICA" : 0,
"Sindica" : 0,
"SINDICO" : 0,
"Sindico" : 0,
"VICEPRESIDENTA" : 0,
"Vicepresidenta" : 0,
"VICEPRESIDENTE" : 0,
"Vicepresidente" : 0
}
# This dictionary contains some correspondences among different text headings. This is under development and needs further
# improvement.The idea is to merge in a single name all the headings that point to the same conceptual concept. For example,
# "Definiciones" is a heading that can come alone or together with other terms so it can appear as "Definiciones básicas" or
# "Definiciones generales". With the dictionary we can fetch all headings that contain the word "Definiciones" and change the
# heading to "Definiciones".
merges = {
    "CONCEPTOS" : "DEFINICIONES",
    "DEFINICIONES" : "DEFINICIONES",
    "DISPOSICIONES FINALES" : "DISPOSICIONES GENERALES",
    "DISPOSICIONES GENERALES" : "DISPOSICIONES GENERALES",
    "DISPOSICIONES PRELIMINARES" : "DISPOSICIONES PRELIMINARES",
    "DISPOSICIONES REGULADORAS" : "DISPOSICIONES ESPECIALES",
    "DISPOSICIONES RELATIVAS" : "DISPOSICIONES ESPECIALES",
    "DISPOSICIONES ESPECIALES" : "DISPOSICIONES ESPECIALES",
    "DISPOSICIONES TRANSITORIAS" : "DISPOSICIONES GENERALES",
    "INFRACCIONES" : "INFRACCIONES",
    "INFRACCION ES" : "INFRACCIONES",
    "OBJETIVO" : "OBJETO",
    "OBJETO" : "OBJETO",
    "OBLIGACIONES" : "OBLIGACIONES Y PROHIBICIONES",
    "OBLIGACIONE" : "OBLIGACIONES Y PROHIBICIONES",
    "OBLIGACION" : "OBLIGACIONES Y PROHIBICIONES",
    "OBLIGATORIEDAD" : "OBLIGACIONES Y PROHIBICIONES",
    "POR TANTO" : "POR TANTO",
    "POR LO TANTO" : "POR TANTO",
    "PROHIBICIONES" : "OBLIGACIONES Y PROHIBICIONES",
    "PROHIBICION" : "OBLIGACIONES Y PROHIBICIONES"
}
# Eventhough the general gramar rule in Spanish is not to accent uppercase, there are many cases where a word in a heding might
# appear accented. This is a dictionary to armonize all headings without accents. The list is rather comprehensive, but there is
# still room for improvement.
# If we find some bug beyond simple misspelling which will be solved by spell checker, we can include it here. The example is in
# the first row with "ACTIVIDADESUSOS" which was found several times in headings.
bugs = {"ACTIVIDADESUSOS" : "ACTIVIDADES DE USOS"}

### Connection to the AWS S3 bucket
To effectively run this cell you need Omdena's credentials. Please keep them local and do not sync them in GitHub repos nor cloud drives. Before doing anything with this json file, please think of security!!

In [4]:
# json_folder = Path("C:/Users/user/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/credentials/")
json_folder = Path("C:/Users/jordi/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/credentials/")
filename = "Omdena_key.json"
file = json_folder / filename

with open(file, 'r') as f:
    key_dict = json.load(f) 

for key in key_dict:
    KEY = key
    SECRET = key_dict[key]

s3 = boto3.resource(
    service_name = 's3',
    region_name = 'us-east-2',
    aws_access_key_id = KEY,
    aws_secret_access_key = SECRET
)

### Functions and regular expressions

In [None]:

# Function to calculate the uppercase ratio in a string. It is used to detect section headings
def uppercase_ratio(string):
    return(len(re.findall(r'[A-Z]',string))/len(string))

# Regular expression to clear html tags (here is basically to remove the page tags)
cleanr = re.compile(r'<.*?>')
# Te function to clear html tags
def clean_html_tags(string):
  return cleanr.sub('', string)

# Function to remove the last lines of a document, the ones that contain the signatures of the officials. It depends on the
# dictionary "official_positions"
def remove_signatures(line):
    signature = False
    for key in official_positions:
        if key in line:
            signature = True
            break
    return signature

# Function to change accented words by non-accented counterparts. It depends on the dictionary "accent_marks_bugs" 
accents_out = re.compile(r'[áéíóúÁÉÍÓÚ]')
accents_dict = {"á":"a","é":"e","í":"i","ó":"o","ú":"u","Á":"A","É":"E","Í":"I","Ó":"O","Ú":"U"}
def remove_accents(string):
    for accent in accents_out.findall(string):
        string = string.replace(accent, accents_dict[accent])
    return string

# Function to merge headlines expressing the same concept in different words. It depends on the dictionary "merges"
def merge_concepts(line):
    for key in merges:
        if key in line:
            line = merges[key]
            break
    return line

def clean_bugs(line):
    for key in bugs:
        if key in line:
            line = line.replace(key, bugs[key])
    return line

# Function to add items to the dictionary with duplicate removal
def add_to_dict(string, dictionary, dupl_dict):
    if string in dupl_dict or string == None:
        pass
    else:
        dupl_dict[string] = 0
        if string in dictionary:
            dictionary[string] = dictionary[string] + 1
        else:
            dictionary[string] = 1
    return dictionary

# Regular expression to clear punctuation from a string
clean_punct = re.compile('[%s]' % re.escape(string.punctuation))
# Regular expression to clear words that introduce unnecessary variability to headings. Some regex still not work 100% we need
# to improve them.
clean_capitulo = re.compile(r'(APARTADO \S*)|(APARTADO\s)|(\bART. \S*)|(\bART. \d*)|(\bArt. \S*)|(Capítulo \S*)|(CAPITULO \S*)|(CAPITULO\S*)|(CAPÍTULO \S*)|(CAPITULÓ \S*)|(CAPITULOS \S*)|(CAPITUO \S*)|(CATEGORIA\b)|(CATEGORÍA\b)|(SUBCATEGORIA\b)|(SUBCATEGORÍA\b)|(TITULO\s\S*)|(TÍTULO\s\S*)')
# Function to remove 1. unwanted words; 2. punctuation; 3. leading white spaces
def clean_headings(string):
    string = clean_capitulo.sub('', string)
    clean_string = clean_punct.sub('', string).rstrip().lstrip()
    if clean_string != "":
        return clean_string
    else:
        pass


### Pipeline to process files from S3 bucket
By executing this cell you will go through all policies in El Salvador and process section headings that will be saved in a dictionary. This should be merged with the notebook that builds up the final json files out of plain txt files.

In [None]:
folder = "text-extraction/"
filename = "00a55afe4f55256567397a68df5d7f97e642480b" # This is only if you want to test on a single file
bag_of_words = {}
sentences = []

i = 0
for obj in s3.Bucket('wri-latin-talent').objects.all().filter(Prefix='text-extraction'):
    if folder in obj.key and obj.key.replace(folder, "") != "": # and filename in obj.key # Un comment the previous string to run the code just in one sample document.
#         print(i, "**", obj.key)
        file = obj.get()['Body'].read().decode('utf-8')  #get the file from S3 and read the body content
        lines = file.split("\n") # Split by end of line and pipe lines into a list
        duplicates_dict = {} #Sometimes the same heading can be found more than once in a document. This will help on removing them
        for line in lines:
            if uppercase_ratio(clean_html_tags(line)) > 0.6 and len(line) > 6:
                if remove_signatures(line):
                    break
                else:
                    line = clean_html_tags(line)
                    line = remove_accents(line)
                    line = clean_bugs(line)
                    line = clean_headings(line)
                    if line == None:
                        continue
                    line = merge_concepts(line)
                    bag_of_words = add_to_dict(line, bag_of_words, duplicates_dict)
            else:
                sentences.append(line)
#                 print("--", line)
#             s3.Object('wri-latin-talent', key).put(Body = content)#This will save all the contents in the string variable "content" into a txt file in the Pre-processed folder
        i += 1

#### Short summary

In [None]:
print("After preprocessing there are {} different headings in El Salvador policies".format(len(bag_of_words)))
print("{} documents have been processed".format(i))
print("There are {} lines of text as sentences".format(len(sentences)))

#### Dictionary items sorted by occurrence

In [None]:
dict( sorted(bag_of_words.items(), key=operator.itemgetter(1),reverse=True))

#### Dictionary items sorted by heading text

In [None]:
for k in sorted(bag_of_words):
    print(k, ":", bag_of_words[k])

#### Saving sentences as csv

In [None]:
print(sentences[0:2])

In [None]:
path = Path("C:/Users/user/Google Drive/Els_meus_documents (1)/projectes/CompetitiveIntelligence/WRI/Notebooks/Data/")
filename = "sentences.npy"
file = path / filename
np_sentences = np.array(sentences)
with open(file, 'wb') as f:
    np.save(f, np_sentences)

#### Pipeline to process one file from HD folder
This is a pipeline to process a test file in a local folder.

In [None]:
data_folder = Path("../Documents_de_mostra/")
filename = "00a55afe4f55256567397a68df5d7f97e642480b.pdf.txt"
bag_of_words = {}

i = 0
file = data_folder / filename
with open(file, 'r', encoding = 'utf-8') as file:
    lines = file.readlines()
    duplicates_dict = {}
    for line in lines:
        line = clean_html_tags(line)
        if uppercase_ratio(line) > 0.6 and len(line) > 6:
            if remove_signatures(line):
                break
            else:
#                 print(line)
                line = remove_accents(line)
                clean_line = clean_headings(line)
                bag_of_words = add_to_dict(clean_line, bag_of_words, duplicates_dict)
#                 print(clean_line)
        i += 1
#     data = file.read().replace('\n', '')

In [None]:
bag_of_words