# **Exploratory Data Analysis**
### CAPSTONE PROJECT - MNA

#### **Team:**
- Rafael J. Mateo C. - A01793054

**0. Description**

In this notebook we will be processing eleven (11) laws of the 300+ included in the dataset in order to facilitate the analysis. We will focus on mexican environmental laws for this EDA, although our model will be trained on many others too.

**1. Libraries and Helper Functions**

First we import the libraries we'll be using

In [57]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import spacy
import numpy as np

Also, we define some helper functions that we'll use for this project

In [58]:

#The corpus is encoded as latin1, so we must decode it first
def decode_text(text: str) -> str:
    try:
        return text.encode("latin_1").decode("utf-8")
    except UnicodeDecodeError:
        return text.encode("latin_1").decode("utf-8", errors="replace")

#This method removes newlines so it's easier to analyse the corpus
def remove_new_lines(lines: list) -> list:
    clean_text = []
    for line in lines:
        if not isinstance(line, str):
            continue
        line = line.strip(' ')
        if line == '':
            continue

        clean_text.append(decode_text(line))
    
    return clean_text
    
#This method reads the files and execute the methods above
def read_without_new_lines(file_name:str):
    with open(file_name, encoding='latin_1') as f:
        text = f.read().splitlines()
        clean_text = remove_new_lines(text)
        if len(text) != 0:
            title = clean_text[0]
        return (title, " ".join(clean_text))
    

The next methods will help us interpreting the corpus by tokenizing, extracting POS and named entitities

In [59]:
nlp = spacy.load('es_core_news_sm')

# Converts the text to a spacy corpus
def preprocess(text: str):
    text = nlp(text)
    return text

#Extracts the token
def get_tokens(doc: str) -> list:
    return [token.text for token in doc]

# Extracts POS
def get_pos(doc: str) -> list:
    return [token.pos_ for token in doc]

def get_pos_tags(doc):
    store ={}
    pos_count = doc.count_by(spacy.attrs.POS)
    for tag,count in sorted(pos_count.items()):
        store[doc.vocab[tag].text] = count
    
    return store

def extract_entities(doc):
    return [ent.label_ for ent in doc.ents]

**2. Loading the files**

Now let's start by first loading the files located in the directory

In [60]:
#Directory where the files are located
directory = 'env_data'
corpus = []

# Let's get the content from every file in the directory
for path, folders, files in os.walk(directory):
    for file in files:
        if file.endswith('.txt'):
            title ,text =  read_without_new_lines(os.path.join(path, file))
            
            corpus.append({
                'Title': title,
                'Filename': file,
                'Text': text
            })          
        
  
#We put them in a dataframe  
df_original = pd.DataFrame.from_dict(corpus)
# Let's make a copy and leave the other as backup
df = df_original.copy()

df.head(5)

        
    

Unnamed: 0,Title,Filename,Text
0,Ley Federal de Responsabilidad Ambiental,ley_responsabilidad_ambiental.txt,Ley Federal de Responsabilidad Ambiental LEY F...
1,LEY DE AGUAS NACIONALES,ley_aguas.txt,LEY DE AGUAS NACIONALES Nueva Ley publicada en...
2,Ley de Desarrollo Rural Sustentable,ley_desarrollo_sustentable.txt,Ley de Desarrollo Rural Sustentable LEY DE DES...
3,Ley General de Pesca y Acuacultura Sustentables,ley_pesca.txt,Ley General de Pesca y Acuacultura Sustentable...
4,Ley General de Vida Silvestre,ley_vida_silvestre.txt,Ley General de Vida Silvestre LEY GENERAL DE V...


**3. Extracting the information**

**3.1. Getting tokens and POS Tags**

Let's start extracting some data from the corpus

In [61]:
df['Document'] = df['Text'].apply(preprocess)
df['Tokens'] = df['Document'].apply(get_tokens)
df['POS'] = df['Document'].apply(get_pos)

df.head(5)



Unnamed: 0,Title,Filename,Text,Document,Tokens,POS
0,Ley Federal de Responsabilidad Ambiental,ley_responsabilidad_ambiental.txt,Ley Federal de Responsabilidad Ambiental LEY F...,"(Ley, Federal, de, Responsabilidad, Ambiental,...","[Ley, Federal, de, Responsabilidad, Ambiental,...","[PROPN, PROPN, ADP, PROPN, PROPN, PROPN, PROPN..."
1,LEY DE AGUAS NACIONALES,ley_aguas.txt,LEY DE AGUAS NACIONALES Nueva Ley publicada en...,"(LEY, DE, AGUAS, NACIONALES, Nueva, Ley, publi...","[LEY, DE, AGUAS, NACIONALES, Nueva, Ley, publi...","[PROPN, ADP, PROPN, PROPN, PROPN, PROPN, ADJ, ..."
2,Ley de Desarrollo Rural Sustentable,ley_desarrollo_sustentable.txt,Ley de Desarrollo Rural Sustentable LEY DE DES...,"(Ley, de, Desarrollo, Rural, Sustentable, LEY,...","[Ley, de, Desarrollo, Rural, Sustentable, LEY,...","[PROPN, ADP, PROPN, PROPN, PROPN, PROPN, ADP, ..."
3,Ley General de Pesca y Acuacultura Sustentables,ley_pesca.txt,Ley General de Pesca y Acuacultura Sustentable...,"(Ley, General, de, Pesca, y, Acuacultura, Sust...","[Ley, General, de, Pesca, y, Acuacultura, Sust...","[PROPN, PROPN, ADP, PROPN, CCONJ, PROPN, PROPN..."
4,Ley General de Vida Silvestre,ley_vida_silvestre.txt,Ley General de Vida Silvestre LEY GENERAL DE V...,"(Ley, General, de, Vida, Silvestre, LEY, GENER...","[Ley, General, de, Vida, Silvestre, LEY, GENER...","[PROPN, PROPN, ADP, PROPN, PROPN, PROPN, PROPN..."


In [63]:
#Let's make a DF with the title and document columns only
df_analysis = df[['Title', 'Document']]

Now, we count every POS tag and store it in a column

In [64]:
df_analysis.loc[:,['Count']] = df_analysis['Document'].apply(get_pos_tags)
df_analysis['Count'].head(5)

0    {'ADJ': 828, 'ADP': 1339, 'ADV': 132, 'AUX': 1...
1    {'ADJ': 6127, 'ADP': 10093, 'ADV': 1505, 'AUX'...
2    {'ADJ': 3040, 'ADP': 5513, 'ADV': 268, 'AUX': ...
3    {'ADJ': 3758, 'ADP': 5704, 'ADV': 810, 'AUX': ...
4    {'ADJ': 3346, 'ADP': 5470, 'ADV': 699, 'AUX': ...
Name: Count, dtype: object

Also, let's extract the entities in the corpus

We create a dataframe containing only the counts of every tag for every file we are analyzing

In [65]:
count_list = df_analysis['Count'].to_list()
pos_count_df =  pd.DataFrame(count_list)

pos_count_df.head(5)

Unnamed: 0,ADJ,ADP,ADV,AUX,CCONJ,DET,INTJ,NOUN,NUM,PRON,PROPN,PUNCT,SCONJ,SYM,VERB,SPACE,PART
0,828,1339,132,163,357,1180,1.0,1640,91,324,777,711,131,1.0,478,,
1,6127,10093,1505,411,3812,7513,2.0,13692,858,2175,12408,6664,694,114.0,3788,101.0,
2,3040,5513,268,189,1705,4679,,6753,326,837,3740,3125,191,,1566,354.0,
3,3758,5704,810,217,2197,4124,14.0,7617,368,1281,7546,3342,338,48.0,2103,18.0,
4,3346,5470,699,234,1690,3821,4.0,6945,558,1268,7640,3056,353,77.0,1876,211.0,


To make it more readable, lets extract the labels of every POS tag

In [66]:
columns = pos_count_df.columns.to_list()
column_names = [spacy.explain(tag).title() for tag in columns]
pos_count_df.columns = column_names
pos_count_df.head(5)

Unnamed: 0,Adjective,Adposition,Adverb,Auxiliary,Coordinating Conjunction,Determiner,Interjection,Noun,Numeral,Pronoun,Proper Noun,Punctuation,Subordinating Conjunction,Symbol,Verb,Space,Particle
0,828,1339,132,163,357,1180,1.0,1640,91,324,777,711,131,1.0,478,,
1,6127,10093,1505,411,3812,7513,2.0,13692,858,2175,12408,6664,694,114.0,3788,101.0,
2,3040,5513,268,189,1705,4679,,6753,326,837,3740,3125,191,,1566,354.0,
3,3758,5704,810,217,2197,4124,14.0,7617,368,1281,7546,3342,338,48.0,2103,18.0,
4,3346,5470,699,234,1690,3821,4.0,6945,558,1268,7640,3056,353,77.0,1876,211.0,


Let's rearrange the columns order to improve readability

In [67]:
pos_count_df['Title'] = df_analysis['Title']
# We want the title to be the first column
pos_count_df = pos_count_df[['Title'] + column_names ]
pos_count_df.head(5)

Unnamed: 0,Title,Adjective,Adposition,Adverb,Auxiliary,Coordinating Conjunction,Determiner,Interjection,Noun,Numeral,Pronoun,Proper Noun,Punctuation,Subordinating Conjunction,Symbol,Verb,Space,Particle
0,Ley Federal de Responsabilidad Ambiental,828,1339,132,163,357,1180,1.0,1640,91,324,777,711,131,1.0,478,,
1,LEY DE AGUAS NACIONALES,6127,10093,1505,411,3812,7513,2.0,13692,858,2175,12408,6664,694,114.0,3788,101.0,
2,Ley de Desarrollo Rural Sustentable,3040,5513,268,189,1705,4679,,6753,326,837,3740,3125,191,,1566,354.0,
3,Ley General de Pesca y Acuacultura Sustentables,3758,5704,810,217,2197,4124,14.0,7617,368,1281,7546,3342,338,48.0,2103,18.0,
4,Ley General de Vida Silvestre,3346,5470,699,234,1690,3821,4.0,6945,558,1268,7640,3056,353,77.0,1876,211.0,


**3.2. Extracting Named Entities**

Let's extract the named entities from the corpus. Below the description of every entity:

| Entity | Description                                                                     |
|--------|---------------------------------------------------------------------------------|
| PER    | People, including fictional                                                     |
| ORG    | Companies, agencies, institutions, etc.                                         |
| LOC    | Non-GPE locations, mountain ranges, bodies of water                             |
| MISC   | Miscellaneous entities, e.g., events, nationalities, products, or works of art. |



In [68]:
df_analysis.loc[:,['Entities']] = df_analysis['Document'].apply(extract_entities)
df_analysis['Entities'].head(5)

0    [ORG, ORG, MISC, LOC, MISC, LOC, LOC, MISC, MI...
1    [ORG, LOC, MISC, MISC, MISC, PER, MISC, LOC, L...
2    [ORG, ORG, LOC, MISC, ORG, ORG, LOC, LOC, MISC...
3    [PER, LOC, ORG, LOC, MISC, MISC, PER, MISC, LO...
4    [LOC, LOC, MISC, MISC, PER, MISC, LOC, LOC, MI...
Name: Entities, dtype: object

Now we create a dataframe with each entity and its ocurrence

In [69]:
ner_count_df = pd.DataFrame()
ner_count_df['Title'] = df_analysis['Title']
#Entities are in a list, so we join them so it's easier to count
ner_count_df['Entities'] = df_analysis['Entities'].apply(lambda x: ' '.join(x))
# We create a list of unique entities to count their ocurrences
ents = np.unique(df_analysis['Entities'].to_list()[0])

for ent in ents:
    ner_count_df[f'{ent}_Count'] = ner_count_df['Entities'].str.count(ent)

ner_count_df.drop(columns='Entities',inplace=True)    
ner_count_df

Unnamed: 0,Title,LOC_Count,MISC_Count,ORG_Count,PER_Count
0,Ley Federal de Responsabilidad Ambiental,119,260,63,69
1,LEY DE AGUAS NACIONALES,1101,1593,244,5268
2,Ley de Desarrollo Rural Sustentable,619,632,229,302
3,Ley General de Pesca y Acuacultura Sustentables,441,1139,200,2869
4,Ley General de Vida Silvestre,408,995,133,2939
5,Ley de Vertimientos en las Zonas Marinas Mexic...,106,226,34,115
6,Ley General de Desarrollo Forestal Sustentable,640,1021,95,2985
7,Ley Federal del Mar,192,132,55,56
8,Ley General de Cambio Climático,639,651,249,280
9,Ley General para la Prevenci�n y Gesti�n Integ...,318,719,92,2212
