# Data Cleaning for CCA21 Project
# Working with El Corpus del Español

This notebook creates the pipeline to do the text preprocesing steps for:

1. Topic Models
2. Dynamic Topic Models
3. Word2Vec
4. Diachronic Word Embeddings

`1` and `2` require the data in a different format from `3` and `4`. 

Notebook index:
1. Libraries
2. Helper functions
3. Pipeline
4. Test

# 1. Libraries 

In [1]:
import re
import zipfile
import os
import sys
import spacy
import pandas as pd

# 2. Helper functions

In [4]:
def loadcorpus(corpus_name, corpus_style="text"):
    '''
    Iterates through the files in the folder, and 
    unzips the files, storing them in a dictionary with 
    each zip file mapping to a list of the texts.
    
    Input:
        corpus_name (str): indicates the working directory and the name
                        of the foldet that contains the corpus

    Output:
        text_raw (dict):
            key - name of the enclosing folder
            value - string that corresponds to that folder
    '''
    texts_raw = {}
    for file in os.listdir(corpus_name + "/"):
        if corpus_style in file:
            print(file)
            zfile = zipfile.ZipFile(corpus_name + "/" + file)
            for file in zfile.namelist():
                texts_raw[file] = []
                with zfile.open(file) as f:
                    for line in f:
                        texts_raw[file].append(line)
    return texts_raw

In [5]:
def att_sources(corpus_name, source_name):
    '''
    Returns list of the sources (websites) the text comes from source_name
    
    Input:
        corpus_name (str)
        source_name (str)
    
    Output:
        list_source_name (list)
    '''
    zfile = zipfile.ZipFile(corpus_name + "/" + source_name)
    list_source_name = []

    for file in zfile.namelist():
        with zfile.open(file) as f:
            for line in f:
                list_source_name.append(line)

    return list_source_name

In [40]:
def clean_raw_text(raw_texts):
    '''
    Decodes and removes some reg expresssions from text
    Reg expressions removed: [¡!@#$:);,¿?&]
    Notice that I don't remove dots (.) to be able to mark sentences
    
    Input:
        raw_texts (str): text

    
    Output:
        clean text(list): list with clean texts
        
    '''
    clean_texts = []
    for text in raw_texts:
        try:
            text = text.decode("utf-8")
            text = re.sub('[¡!@#$:);,¿?&]', '', text)
            clean_texts.append(text)
        except AttributeError:
            print("ERROR CLEANING", "Text:")
            print(text)
            continue
        except UnicodeDecodeError:
            print("Unicode Error, Skip")
            continue
    return clean_texts

In [51]:
def dic_match_key_text(raw_dic_texts, max_num_loops, max_texts, max_onetext_length):
    '''
    Creates dictionary of text to match text and sources
    
    Input:
        raw_dic_texts (dict):
            key - name of the enclosing folder
            value - string that corresponds to that folder

        max_num_loops(int): number that determines the overall
                             number of loops
        
        max_texts(int): number that detemines the number of texts 
                        included in the list
        
        max_onetext_length(int): number that blocks larger than
                                 n character texts
        

    Output:
        websites_text(dict):
            key - id that matches the text and the source
            value - (str) text
    '''
    websites_text = {}
    i=0
    
    for key in raw_dic_texts:
        i =+ 1

        if len(websites_text) > max_texts:
            break
        texts_for_key = clean_raw_text(raw_dic_texts[key])
        for one_text in texts_for_key:
            if len(one_text) >= max_onetext_length:
                break
            key_text = one_text.split()[0]
            try:
                websites_text[key_text] = one_text[7:]
            except IndexError:
                continue
        if i==max_num_loops:
                break
    return websites_text

In [58]:
def merge_text_sources(source_list, websites_text, span_df, max_num_loops):
    '''
    Merges the list of sources and the text comming from those sources 
    into a pandas dataframe
    
    Input:
        source_list (list): list of url sources 
        websites_text (dict): 
            key (int)- id to source
            value (str)- text 
        span_df (pandas dr): pandas dataframe that has only the maes of the columns
        max_num_loops (int): number to break the loop and get smaller 
                             pandas dataframes
                             
    Output:
        
    '''
    i = 0
    for website in source_list[3:]:
        '''
        Loops over the list of url sources
        '''
        i =+ 1
        try:
            textID, Number_of_words, Genre, Country, \
                Website, URL, Title = website.decode("utf-8").split("\t")
        except UnicodeDecodeError:
            continue
        try:
            span_df.loc[textID.strip()] = \
                        [Title.strip(), Genre.strip(), Country.strip(), 
                        Website.strip(), URL.strip(), Number_of_words.strip(),  
                        websites_text[textID.strip()]]
        except KeyError:
            continue
        if i==max_num_loops:
            break
        
        return span_df

# 3. Pipeline

In [None]:
raw_span['MX-B-0.txt'][1]

In [29]:
test1 = raw_span['MX-B-0.txt'][0]
test1_re = test1.decode("utf-8")
test1_re = re.sub('[¡!@#$:).;,¿?&]', '', test1_re)
#test1_re = re.sub("\d+", "", test1_re)
test1_re

'747390 10 Actividades que sirven para bajar de peso 1 0 Actividades que sirven para bajar de peso  Si quieres bajar de peso pero no te gusta hacer ejercicio  es momento que practiques algunas actividades para quemar calorías de forma divertida y sin que sientas el esfuerzo que realizas  De acuerdo con información publicada en mother nature network ( mnn   el número de calorías que el cuerpo elimina depende de factores como edad  peso  sexo y actividades  Nuestro organismo utiliza las calorías como energía para realizar todas sus funciones desde la digestión hasta generar pensamientos  sin embargo  para prevenir el sobrepeso  necesitamos realizar actividades vigorosas y divertidas   Quema 300 calorías  Las siguientes actividades te ayudarán a quemar hasta 300 calorías en sólo algunos minutos  Conócelas y aprovecha para practicar las en estas vacaciones de verano  1  Jugar frisbee  Lo ideal es que lo practiques durante 80 minutos  ya sea en la playa o en un día de campo  Además  te ayud

In [34]:
test1_re
first_word = test1_re.split()[0]
first_word

'747390'

# 4. Test

In [6]:
# loads corpus as a dictionary

raw_span = loadcorpus("data/SPAN")

text_EC-jss.zip
text_CU-rag.zip
text_MX-vzo.zip
text_AR-tez.zip
text_CR-jfy.zip
text_HN-paj.zip
text_GT-miv.zip
text_PR-epz.zip
text_CL-wts.zip
text_PE-tae.zip
text_PY-ukd.zip
text_UY-nde.zip
text_NI-exu.zip
text_DO-egn.zip
text_ES-sbo.zip
text_PA-qlz.zip
text_US-ufh.zip
text_SV-xkl.zip
text_CO-pem.zip
text_BO-teh.zip
text_VE-wsc.zip


In [7]:
# loads the url where the texts come from as a list

source_list = att_sources("data/SPAN", "span_sources.zip")

In [8]:
span_df = pd.DataFrame(columns=["Title", "Genre", "Country",
                                    "Website", "URL", "Number of words",
                                    "Text"])

In [52]:
websites_text = dic_match_key_text(raw_span, max_num_loops=10, max_texts=1000, max_onetext_length=10000)

In [59]:
span_df = merge_text_sources(source_list, websites_text, span_df, max_num_loops=10)

In [60]:
span_df.shape

Unnamed: 0,Title,Genre,Country,Website,URL,Number of words,Text
389972,descubrecuador: Las Cascadas Verdes y la Casca...,b,EC,0latitud.blogspot.com,http://0latitud.blogspot.com/2010/01/las-casca...,294,Páginas martes 19 de enero de 2010 A unas dos...
390030,Acoso textual: Vargas Llosa oye cantar el gall...,b,EC,acoso-textual.blogspot.com,http://acoso-textual.blogspot.com/2011/04/varg...,2114,( Fragmento de la obra Tardes de lluvia en el ...
390040,MANABI ES....ECUADOR: LA CASA DE LOS CACHOS,b,EC,actividadesculturalesmanabi.blogspot.com,http://actividadesculturalesmanabi.blogspot.co...,245,En nuestro recorrido semanal por el basto terr...
390051,hoja almendro muy beneficiosa para nuestro acu...,b,EC,acuariovalhallafish.blogspot.com,http://acuariovalhallafish.blogspot.com/2011/0...,1699,acuario valhalla fish pone a su disposicion di...
390052,ACUARIO VALHALLA FISH: carbon activo propiedad...,b,EC,acuariovalhallafish.blogspot.com,http://acuariovalhallafish.blogspot.com/2012/0...,1922,acuario valhalla fish pone a su disposición di...
...,...,...,...,...,...,...,...
1616129,Eloy Alfaro | AfroEcuatorianos,g,EC,afros.wordpress.com,http://afros.wordpress.com/historia/eloy-alfaro/,1351,Blogroll Categorías Comentarios ALFARO Y LOS ...
1616139,Ritos Funerarios | AfroEcuatorianos,g,EC,afros.wordpress.com,http://afros.wordpress.com/religiosidad-afroec...,510,Blogroll Categorías Comentarios El hecho de l...
1616149,Las 5 cosas de las que nos arrepentimos antes ...,g,EC,agustinsaga.com,http://agustinsaga.com/personas/las-5-cosas-de...,1001,Las 5 cosas de las que nos arrepentimos antes...
1616169,Democratizar la palabra - Alai,g,EC,alainet.org,http://alainet.org/publica/democom/,327,Democratizar la palabra Movimientos convergen...
