# Argentine Election Analysis

## Introduction
In this notebook I analyze a Spanish dataset set up during the [Argentine legislative election](https://en.wikipedia.org/wiki/Argentine_legislative_election,_2017) of 2017. 
This dataset contains the data of 9 facebook bots, crawled over a period of 16 days, following 45 sources.

__Note__: If you haven't done it already, go through the set up in the *README* of [this repo](https://github.com/rugantio/nlp_fbtrex/).

### Roadmap
Download dataset -> cast JSON to txt -> tokenization -> normalization -> phrase modeling -> topic mining -> burst the bubble -> word2vec algebra & predictive analysis

## Dataset
The dataset was prepared by the [__Facebook Tracking Exposed__](https://facebook.tracking.exposed/) project and can be retrieved in a convenient JSON format from the specific GitHub [__repo__](https://github.com/tracking-exposed/experiments-data/tree/master/silver).
There are two separate files that we'll try to breakdown:
* __fbtrex-data-\*.json__ - Contains all impressions relative to single users
* __semantic-entities.json__ - Contains all available metadata regarding posts

The text field of every posts is enclosed in *semantic-entities.json*, while I can use *fbtrex-data-\*.json* to correlate which user has visualized this content, thus providing an easy way to investigate the Facebook filter bubble.
Given a ready working environment, as explained is the *README* of this repo, just go ahead and download the files:

In [None]:
%%bash
#Download Argentine dataset in a data subdir
mkdir data && cd data
wget https://github.com/tracking-exposed/experiments-data/raw/master/silver/fbtrex-data-1.json.zip
wget https://github.com/tracking-exposed/experiments-data/raw/master/silver/semantic-entities.json.zip

__Note__: This commands are supposed to be executed in a bash environment, not in the notebook itself. The operation may fail due to permissions.

Extract the content from the zip archive:

In [None]:
%%bash
#Extract JSON from zipped archives
cd data
unzip fbtrex-data-1.json.zip
unzip semantic-entities.json.zip

__Note__: To try out this notebook I made a shorter version of the JSON, I highly recommend to do the same

## Data preprocessing


Now that we have the dataset in JSON format, we can use the [JSON Python library](https://docs.python.org/3/library/json.html) to decode its content and store it in a Python variable. The variable type depends on the actual content of the provided file, by [default](https://docs.python.org/3/library/json.html#json-to-py-table) a JSON object is decoded to a dict and an arrays to a list. The recommended approach for working with encoded text files, is to use the [codecs Python library](https://docs.python.org/3/library/codecs.html):

In [1]:
import codecs
import json

with codecs.open('data/semantic-entities.json',encoding='utf-8') as data_json:    
    data = json.load(data_json)

To print to stdout the content of the parsed JSON file just use [pprint](https://docs.python.org/3/library/pprint.html), the data pretty printer:

In [None]:
import pprint
pprint.pprint(data)

It's useful to check if the casting was performed correctly before proceding, the resulting decoded type can be inspected with:

In [3]:
print(type(data))

<class 'list'>


__Note__: If you are using Spyder IDE you can keep track of variable simply looking at the variable explorer window.

So the JSON is now a list. How many entities do we have?

In [4]:
data_len = len(data)
print('There are {} total elements to analyze.'.format(data_len+1))

There are 438 total elements to analyze.


Let's go deeper. We decoded the JSON to a list, but what kind of list is it? What happened to JSON objects?

In [None]:
for i in range(data_len):
    print(type(data[i]))

Of course, *data* is not a simple list, it's a nested list of dictionaries! Let's print the *dict_keys*:

In [6]:
for i in range(data_len):
    print(data[i].keys())

dict_keys(['_id', 'id', 'original', 'time', 'annotations', 'lang', 'langConfidence', 'text', 'url', 'timestamp', 'publicationTime'])
dict_keys(['_id', 'id', 'original', 'time', 'annotations', 'lang', 'langConfidence', 'text', 'url', 'timestamp', 'publicationTime'])
dict_keys(['_id', 'id', 'original', 'message', 'code', 'data', 'error', 'timestamp'])
dict_keys(['_id', 'id', 'original', 'message', 'code', 'data', 'error', 'timestamp'])
dict_keys(['_id', 'id', 'original', 'time', 'annotations', 'lang', 'langConfidence', 'text', 'url', 'timestamp', 'publicationTime'])
dict_keys(['_id', 'id', 'original', 'time', 'annotations', 'lang', 'langConfidence', 'text', 'url', 'timestamp', 'publicationTime'])
dict_keys(['_id', 'id', 'original', 'time', 'annotations', 'lang', 'langConfidence', 'text', 'url', 'timestamp', 'publicationTime'])
dict_keys(['_id', 'id', 'original', 'time', 'annotations', 'lang', 'langConfidence', 'text', 'url', 'timestamp', 'publicationTime'])
dict_keys(['_id', 'id', 'origi

This is interesting: in the provided dataset there are some entities that don't have a *text* field. So let's first take only the elements that have a text field and put them in a new non-nested list:

In [7]:
tex = []
for i in range(data_len):
    if 'text' in data[i]:
        tex.append(data[i]['text'])

This is better. We now have an actual working list. Again, how many entities do we have?

In [8]:
tex_len = len(tex)
print('There are actually {} text elements to analyze!'.format(tex_len+1))

There are actually 430 text elements to analyze!


This is good enough for now, later we can make a deeper analysis, associating each *text* key with its *id* key and its *time* key to correlate which user visualizes which entity and when.  

It's good practice to have a new txt file for every step in NLP processing. So let's create a new txt file populated with the *text keys* of the *tex list*, __one per line__. 

Since some of the text values are made of more than one paragraphs, we need to substitute linebreaks (newline character) with a space character. Some caution is needed because some paragraphs have a double linebreak.  

In [9]:
#Swap linebreaks with a space
for i in range(tex_len):
    tex[i] = tex[i].replace('\n\n','\n')
    tex[i] = tex[i].replace('\n',' ')

#Create new txt with text keys (one per line)
with codecs.open('data/text.txt','w',encoding='utf-8') as text:
    for i in range(tex_len):
        text.write('%s\n' % tex[i])

To view the file and check that everything was executed as it should you don't need another editor:

In [10]:
#Print the first 5000 characters
with codecs.open('data/text.txt',encoding='utf-8') as text:    
    print(text.read(5000))

Cerró el comedor cordobés al que Macri le había prometido ayuda Luis Almadaes cordobés ytenía un comedor comunitario y una fundación para ayudar a gente en situación de calle. Desesperado, en mayo le envió un mensaje por redes sociales al presidente Mauricio Macripara que lo ayude a sostener el emprendimiento. El mandatario lo llamó por teléfono días después y el 12 de julio se vieron personalmente en esa provincia. Allí, le prometió ayuda. Pero nunca llegó y por eso no tuvo más alternativa que bajar la persiana. Su fundación "Yo Te Ayudo Amigo del Alma" ayudaba a unas 20 personas en situación de calle para que vendieran golosinas en la peatonal y pudieran capacitarse en un oficio. Almada dijo a La Naciónque debió cerrar la fundación: "Puse mi esfuerzo,cumplí con lo que me comprometí pero solo no puedo. No llamé al Presidente para pedirle alfajores, sino para que nos ayudara con tratamiento de adicciones, con asistentes sociales. Macri me dijo que lo hagamos.Los tiempos de Nación no so

Data preprocessing is over, we now have a txt ready to feed our NLP modules!
## Language processing with SpaCy

Text mining tasks have become incredibly easy thanks to [spaCy](http://alpha.spacy.io/), a NLP Python module which provides:
* Non-destructive tokenization
* Syntax-driven sentence segmentation
* Pre-trained word vectors
* Part-of-speech tagging
* Named entity recognition
* Labelled dependency parsing
* A built-in visualizer 

...and much more, all with just one function!

SpaCy also provides some already trained [models](https://alpha.spacy.io/models/) which you can use out-of-the-box to process different languages. SpaCy's core is written in pure C (via Cython), it's currently the [fastest](https://alpha.spacy.io/usage/facts-figures) parser available and makes [multithreading](https://explosion.ai/blog/multithreading-with-cython) profitable by virtue of Cython.

Follow the *README* of this repo and install the Spanish language model. Now import the model, open text.txt file and store content in a single string:

In [11]:
%%time 
import spacy

#Initialize SpaCy's pipeline
nlp = spacy.load('es_core_web_sm')

#Open text file and store content in a single str
with codecs.open('data/text.txt',encoding='utf-8') as text:
    full_txt = text.read()

CPU times: user 802 ms, sys: 400 ms, total: 1.2 s
Wall time: 2.51 s


Now that we have a processing pipeline, we can call a *nlp* instance as if it were a function on a string of text. This will produce a [Doc](https://alpha.spacy.io/api/doc) object, a special container that holds all linguistic annotations of the text fed in.

Let's first explore how SpaCy processes a single entity, before diving into the dataset:

In [12]:
%%time
#Snip single line of text
with codecs.open('data/text.txt',encoding='utf-8') as text:
    line_txt = text.readline()

#Standard way of processing text 
doc = nlp(line_txt)

CPU times: user 446 ms, sys: 744 ms, total: 1.19 s
Wall time: 601 ms


In [13]:
print(doc)

Cerró el comedor cordobés al que Macri le había prometido ayuda Luis Almadaes cordobés ytenía un comedor comunitario y una fundación para ayudar a gente en situación de calle. Desesperado, en mayo le envió un mensaje por redes sociales al presidente Mauricio Macripara que lo ayude a sostener el emprendimiento. El mandatario lo llamó por teléfono días después y el 12 de julio se vieron personalmente en esa provincia. Allí, le prometió ayuda. Pero nunca llegó y por eso no tuvo más alternativa que bajar la persiana. Su fundación "Yo Te Ayudo Amigo del Alma" ayudaba a unas 20 personas en situación de calle para que vendieran golosinas en la peatonal y pudieran capacitarse en un oficio. Almada dijo a La Naciónque debió cerrar la fundación: "Puse mi esfuerzo,cumplí con lo que me comprometí pero solo no puedo. No llamé al Presidente para pedirle alfajores, sino para que nos ayudara con tratamiento de adicciones, con asistentes sociales. Macri me dijo que lo hagamos.Los tiempos de Nación no so

Looks exactly the same! But what happened under the hood? Have a look at how [spaCy's pipeline](https://alpha.spacy.io/usage/processing-pipelines) is made:

Text -> tokenizer -> tagger -> parser -> ner -> Doc

Text analysis is built in from bottom-up. The *tokenizer* creates a Doc data structure, breaking the text in tokens and storing their metadata in a tensor. The *tagger* takes these tokens (and their context) and uses the information to make predictions of the part-of-speech tags. The *parser* assigns dependency labels between tokens and segments text in sentences. The *ner*, named entity recognizer, detects and labels named entities.
### Sentence detection and segmentation

In [14]:
for num, sent in enumerate(doc.sents):
    print ('Sentence {}:'.format(num + 1),sent,end='\n')

Sentence 1: Cerró el comedor cordobés al que Macri le había prometido ayuda Luis Almadaes cordobés ytenía un comedor comunitario y una fundación para ayudar a gente en situación de calle.
Sentence 2: Desesperado, en mayo le envió un mensaje por redes sociales al presidente Mauricio Macripara que lo ayude a sostener el emprendimiento. 
Sentence 3: El mandatario lo llamó por teléfono días después y el 12 de julio se vieron personalmente en esa provincia.
Sentence 4: Allí, le prometió ayuda. 
Sentence 5: Pero nunca llegó y por eso no tuvo más alternativa que bajar la persiana.
Sentence 6: Su fundación "Yo Te Ayudo Amigo del Alma" ayudaba a unas 20 personas en situación de calle para que vendieran golosinas en la peatonal y pudieran capacitarse en un oficio.
Sentence 7: Almada dijo a La Naciónque debió cerrar la fundación: "Puse mi esfuerzo,cumplí con lo que me comprometí pero solo no puedo.
Sentence 8: No llamé al Presidente para pedirle alfajores, sino para que nos ayudara con tratamient

### Part-of-speech (POS) tagging and grammar analysis
Using [Pandas](http://pandas.pydata.org/), Python Data Analysis library, we can have a clean table visualization.
- Text: The original word text.
- POS: The simple part-of-speech tag.
- Tag: The detailed part-of-speech tag, with full morphology!
- Dep: Syntactic dependency, i.e. the relation between tokens.

In [15]:
import pandas as pd

token_text = [token.orth_ for token in doc]
token_pos = [token.pos_ for token in doc]
token_tag = [token.tag_ for token in doc]
token_dep = [token.dep_ for token in doc]

pd.DataFrame(list(zip(token_text,token_pos,token_tag,token_dep)), columns=['Text', 'POS','Tag','Dep'])

Unnamed: 0,Text,POS,Tag,Dep
0,Cerró,VERB,VERB__Mood=Ind|Number=Sing|Person=3|Tense=Past...,ROOT
1,el,DET,DET__Definite=Def|Gender=Masc|Number=Sing|Pron...,det
2,comedor,NOUN,NOUN__Gender=Masc|Number=Sing,nsubj
3,cordobés,ADJ,ADJ__Number=Sing,amod
4,al,ADP,ADP__AdpType=Preppron|Gender=Masc|Number=Sing,case
5,que,PRON,PRON__PronType=Rel,obj
6,Macri,PROPN,PROPN___,nsubj
7,le,PRON,PRON__Case=Dat|Number=Sing|Person=3|PronType=Prs,obj
8,había,AUX,AUX__Mood=Ind|Number=Sing|Person=3|Tense=Imp|V...,aux
9,prometido,VERB,VERB__Gender=Masc|Number=Sing|Tense=Past|VerbF...,acl


### Navigating the parse tree
SpaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of .dep is a hash value. You can get the string value with .dep\_.
- Text: The original token text.
- Dep: The syntactic relation connecting child to head.
- Head text: The original text of the token head.
- Head POS: The part-of-speech tag of the token head.
- Children: The immediate syntactic dependents of the token.

In [16]:
token_text = [token.text for token in doc]
token_head_pos = [token.head.pos_ for token in doc]
token_head_text = [token.head.text for token in doc]
token_dep = [token.dep_ for token in doc]
token_children = [[child for child in token.children] for token in doc]
pd.DataFrame(list(zip(token_text,token_dep,token_head_text,token_head_pos,token_children)), columns=['Text','Dep','Head text','Head POS','Children'])


Unnamed: 0,Text,Dep,Head text,Head POS,Children
0,Cerró,ROOT,Cerró,VERB,"[comedor, ytenía, .]"
1,el,det,comedor,NOUN,[]
2,comedor,nsubj,Cerró,VERB,"[el, cordobés]"
3,cordobés,amod,comedor,NOUN,[prometido]
4,al,case,que,PRON,[]
5,que,obj,prometido,VERB,[al]
6,Macri,nsubj,prometido,VERB,[]
7,le,obj,prometido,VERB,[]
8,había,aux,prometido,VERB,[]
9,prometido,acl,cordobés,ADJ,"[que, Macri, le, había, ayuda, Luis]"


### Named entity recognition (NER)

In [17]:
for num, ent in enumerate(doc.ents):
    print ('Entity {}:'.format(num + 1),ent,'-', ent.label_,end='\n')

Entity 1: Macri - PER
Entity 2: Luis Almadaes - PER
Entity 3: Desesperado - LOC
Entity 4: Mauricio Macripara - PER
Entity 5: Allí - PER
Entity 6: Yo Te Ayudo Amigo del Alma - MISC
Entity 7: Almada - LOC
Entity 8: La Naciónque - LOC
Entity 9: Puse - PER
Entity 10: No llamé al Presidente - MISC
Entity 11: Macri me dijo - MISC
Entity 12: Los tiempos de Nación - MISC
Entity 13: Puse - PER
Entity 14: No pedí nada para mí - MISC
Entity 15: Vino el Presidente - PER
Entity 16: Almada - LOC
Entity 17: Provincia - LOC
Entity 18: Municipalidad - LOC
Entity 19: Hace - ORG
Entity 20: Carolina Stanley - PER
Entity 21: Presidente - PER
Entity 22: Ahora - MISC
Entity 23: Almada - LOC
Entity 24: Cómo - MISC
Entity 25: No tengo más plata - MISC
Entity 26: Laburo - MISC
Entity 27: El día de la visita presidencial - MISC
Entity 28: Talleres - ORG
Entity 29: Sin embargo - MISC
Entity 30: Almada - LOC
Entity 31: Estado Nacional - LOC
Entity 32: Ayer - LOC
Entity 33: Ello nos obliga a seguirlos - MISC
Entity

### Visualization with displaCy
SpaCy has an integrated visualization library that can display the content in two styles: *dep* and *ent*.
The *dep* style shows the dependency between words using arcs, the *ent* style prints out the text with colored NER labels wrapped around words.

The method *.serve()* launches a local web server for visualization while the method *.render()* provides an image.

__Note__: Style *dep* is not working well in Spanish because *tag* is used instead of *POS* for annotating words, but the *tag* field is much larger than *POS* thus causing overlapping. 

__Note2__: Style *ent* can't be viewed in Github, but in Jupyter is great.

In [18]:
from spacy import displacy
#displacy.serve(doc, style='dep')
options = {'distance':425, 'arrow_spacing':6}
displacy.render(doc,style='dep', jupyter=True, options=options)

In [19]:
#displacy.serve(doc, style='dep')
displacy.render(doc,style='ent', jupyter=True)

### Text normalization: stemming, lemmatization and shape analysis
Let's now move on to single token analysis. *Normalization* is a way of processing text that involves changing the words to make them less unique. We talk about *stemming* when we take the words and we remove the end, producing a new word that is not in the language dictionary. *Lemmatization* takes inflected words as input and tries to give the root word as output, so in some way is similar to stemming, but it produces meaningful (actually existing) words. The token *shape* is the decapitalization char mask that gets applied to the original (orthodox) token.

In [20]:
token_lemma = [token.lemma_ for token in doc]
token_shape = [token.shape_ for token in doc]

pd.DataFrame(list(zip(token_text, token_lemma, token_shape)),columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,Cerró,cerró,Xxxxx
1,el,el,xx
2,comedor,comedor,xxxx
3,cordobés,cordobés,xxxx
4,al,al,xx
5,que,que,xxx
6,Macri,macri,Xxxxx
7,le,le,xx
8,había,había,xxxx
9,prometido,prometido,xxxx


Lemmatization is actually not supported for the Spanish language model. We still have some normalization, as seen from the shape mask applied to every word.

### Token-level entity analysis
The standard way to access entity annotations is the *doc.ents* property, but you can also access token entity annotations using the *token.ent_iob* and *token.ent_type* attributes; *token.ent_iob* indicates whether an entity starts, continues or ends on the tag.

IOB Scheme:
- *I* - Token is *inside* an entity.
- *O* - Token is *outside* an entity.
- *B* - Token is the *beginning* of an entity.

In [21]:
token_entity_type = [token.ent_type_ for token in doc]
token_entity_iob = [token.ent_iob_ for token in doc]

pd.DataFrame(list(zip(token_text, token_entity_type, token_entity_iob)), columns=['token_text', 'entity_type', 'inside_outside_begin'])

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,Cerró,,O
1,el,,O
2,comedor,,O
3,cordobés,,O
4,al,,O
5,que,,O
6,Macri,PER,B
7,le,,O
8,había,,O
9,prometido,,O


### Token-level attributes
Other useful metadata is provided, such as the relative frequency of tokens, and whether or not a token matches any of these categories:
- stopword
- punctuation
- whitespace
- number
- url

and many more token [attributes](https://alpha.spacy.io/api/token#attributes)!


In [22]:
token_attributes = [(token.orth_,token.prob,token.is_stop,token.is_punct,token.is_space,token.like_num,token.like_url) for token in doc]

df = pd.DataFrame(token_attributes,columns=['text','log_probability','stop?','punctuation?','whitespace?','number?','url?'])

df.loc[:, 'stop?':'url?'] = (df.loc[:, 'stop?':'url?'].applymap(lambda x: u'Yes' if x else u''))
                                               
df

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,url?
0,Cerró,-20.0,,,,,
1,el,-20.0,Yes,,,,
2,comedor,-20.0,,,,,
3,cordobés,-20.0,,,,,
4,al,-20.0,Yes,,,,
5,que,-20.0,Yes,,,,
6,Macri,-20.0,,,,,
7,le,-20.0,Yes,,,,
8,había,-20.0,Yes,,,,
9,prometido,-20.0,,,,,


The relative frequency is not stored in the model, but that's not important since we don't intend to rely on it anyways.
### Text normalization, lemmatization and sentence segmentation
Now that we have explored all that spaCy can do for us, we can use it to parse our *text.txt* and generate a new *parsed_text.txt* that has the same text, normalized, lemmatized and segmented in sentences. I decided not to remove stop words because they can be part of composite words, as explained in the phrase modeling section of this notebook. 

We first define a helper function that constructs a generator to loop over the *text.txt* and yield the review one-by-one. A generator is similar to an iterator but it can be used only once because its content is generated on the fly and not stored in memory, saving precious computation.

Then pass on the reviews to spaCy using the *.pipe()* method via a generator function to parse the reviews, lemmatize the text, and yield segmantized sentences. The standard way to initialize spaCy would be to call *nlp(text.txt)* on each review, but I will make use instead of the *.pipe()* method which allows efficient [multi-threading](https://spacy.io/docs/usage/processing-text#multithreading). Two [arguments](https://alpha.spacy.io/api/language#pipe) are given to *.pipe()*: *batch_size* is the number of texts to buffer and *n_threads* which is the number of worker threads to use (default is 2, if -1 OpenMP will decide how many to use at run time, like in our case). You can also pass a *disable* option to turn off some components of the pipeline that is not needed to further optimize the processing. Note that all processing algorithms are linear-time in the length of the string.

Finally, we write the sentences to a new txt file, *parsed_text.txt*.

In [24]:
%%time
#Helper function that yields all reviews via generator 
def get_review(filename):
    with codecs.open(filename, encoding='utf_8') as textfile:
        for review in textfile:
            yield review
            
#Generator function to parse reviews, lemmatize the text, and yield sentences
def lemmatize_corpus(filename):
    for parsed_review in nlp.pipe(get_review(filename),batch_size=10, n_threads=-1):
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent if not (token.is_punct or token.is_space)])

with codecs.open('data/new_text.txt', 'w', encoding='utf_8') as f:
    for sent in lemmatize_corpus('data/text.txt'):
        f.write(sent + '\n')

CPU times: user 38.4 s, sys: 44.5 s, total: 1min 22s
Wall time: 24.8 s


## Phrase Modeling with Gensim


Phrase modeling is a form of text manipulation that consists in producing new one token words from two or more tokens. As we saw in named entity recognition, there are groups of words that represent things that have nothing to do with the single words themselves that make up the group. For example *New York* is supposed to be different in meaning from *New* and *York*. We would like to have these single token words joined together in a single word, with an underscore instead of a space. 

There is a very simple formula for measuring the co-occurrence of these composite words in the corpus, meaning the frequency these words appear together in sequence, compared to the frequency they appear alone. We will use gensim to achieve this task. 