# Argentine Election Analysis

## Introduction
In this notebook I analyze a Spanish dataset set up during the [Argentine legislative election](https://en.wikipedia.org/wiki/Argentine_legislative_election,_2017) of 2017. 
This dataset contains the data of 9 facebook bots, crawled over a period of 16 days, following 45 sources.

__Note__: If you haven't done it already, go through the set up in the *README* of [this repo](https://github.com/rugantio/nlp_fbtrex/).

## Dataset
The dataset was prepared by the [__Facebook Tracking Exposed__](https://facebook.tracking.exposed/) project and can be retrieved in a convenient JSON format from the specific GitHub [__repo__](https://github.com/tracking-exposed/experiments-data/tree/master/silver).
There are two separate files that we'll try to breakdown:
* __fbtrex-data-\*.json__ - Contains all impressions relative to single users
* __semantic-entities.json__ - Contains all available metadata regarding posts

The text field of every posts is enclosed in *semantic-entities.json*, while I can use *fbtrex-data-\*.json* to correlate which user has visualized this content, thus providing an easy way to investigate the Facebook filter bubble.
Given a ready working environment, as explained is the *README* of this repo, just go ahead and download the files:

In [1]:
%%bash
#Download Argentine dataset in a data subdir
mkdir data && cd data
wget https://github.com/tracking-exposed/experiments-data/raw/master/silver/fbtrex-data-1.json.zip
wget https://github.com/tracking-exposed/experiments-data/raw/master/silver/semantic-entities.json.zip

--2017-11-03 22:56:43--  https://github.com/tracking-exposed/experiments-data/raw/master/silver/fbtrex-data-1.json.zip
Caricato certificato CA "/etc/ssl/certs/ca-certificates.crt"

Risoluzione di github.com... 192.30.253.113, 192.30.253.112
Connessione a github.com|192.30.253.113|:443... connesso.
Richiesta HTTP inviata, in attesa di risposta... 302 Found
Posizione: https://raw.githubusercontent.com/tracking-exposed/experiments-data/master/silver/fbtrex-data-1.json.zip [segue]
--2017-11-03 22:56:44--  https://raw.githubusercontent.com/tracking-exposed/experiments-data/master/silver/fbtrex-data-1.json.zip
Risoluzione di raw.githubusercontent.com... 151.101.112.133
Connessione a raw.githubusercontent.com|151.101.112.133|:443... connesso.
Richiesta HTTP inviata, in attesa di risposta... 200 OK
Lunghezza: 7873135 (7,5M) [application/zip]
Salvataggio in: "fbtrex-data-1.json.zip"

     0K .......... .......... .......... .......... ..........  0%  375K 20s
    50K .......... .......... .....

__Note__: This commands are supposed to be executed in a bash environment, not in the notebook itself. The operation may fail due to permissions.

Extract the content from the zip archive:

In [2]:
%%bash
#Extract JSON from zipped archives
cd data
unzip fbtrex-data-1.json.zip
unzip semantic-entities.json.zip

## Data preprocessing


Now that we have the dataset in JSON format, we can use the [JSON Python library](https://docs.python.org/3/library/json.html) to decode its content and store it in a Python variable. The variable type depends on the actual content of the provided file, by [default](https://docs.python.org/3/library/json.html#json-to-py-table) a JSON object is decoded to a dict and an arrays to a list. The recommended approach for working with encoded text files, is to use the [codecs Python library](https://docs.python.org/3/library/codecs.html):

In [5]:
import codecs
import json

with codecs.open('data/semantic-entities.json',encoding='utf-8') as data_json:    
    data = json.load(data_json)

To print to stdout the content of the parsed JSON file just use [pprint](https://docs.python.org/3/library/pprint.html), the data pretty printer:

In [6]:
import pprint
pprint.pprint(data)

It's useful to check if the casting was performed correctly before proceding, the resulting decoded type can be inspected with:

In [None]:
type(data)

__Note__: If you are using Spyder IDE you can keep track of variable simply looking at the variable explorer window.

So the JSON is now a list. How many entities do we have?

In [None]:
data_len = len(data)
print('There are {} total elements to analyze'.format(data_len+1))

Let's go deeper. We decoded the JSON to a list, but what kind of list is it? What happened to JSON objects?

In [8]:
for i in range(data_len):
    print(type(data[i]))

Of course, *data* is not a simple list, it's a nested list of dictionaries! Let's print the *dict_keys*:

In [None]:
for i in range(data_len):
    print(data[i].keys())

This is interesting: in the provided dataset there are some entities that don't have a *text* field. So let's first take only the elements that have a text field and put them in a new non-nested list:

In [None]:
tex = []
for i in range(data_len):
    if 'text' in data[i]:
        tex.append(data[i]['text'])

This is better. We now have an actual working list. Again, how many entities do we have?

In [None]:
tex_len = len(tex)
print('There are actually {} text elements to analyze'.format(tex_len+1))

This is good enough for now, later we can make a deeper analysis, associating each *text* key with its *id* key and its *time* key to correlate which user visualizes which entity and when.  

It's good practice to have a new txt file for every step in NLP processing. So let's create a new txt file populated with the *text keys* of the *tex list*, __one per line__. 

Since some of the text values are made of more than one paragraphs, we need to substitute linebreaks (newline character) with a space character. Some caution is needed because some paragraphs have a double linebreak.  

In [None]:
#Swap linebreaks with a space
for i in range(tex_len):
    tex[i] = tex[i].replace('\n\n','\n')
    tex[i] = tex[i].replace('\n',' ')

#Create new txt with text keys (one per line)
with codecs.open('data/text.txt','w',encoding='utf-8') as text:
    for i in range(tex_len):
        text.write('%s\n' % tex[i])

To view the file and check that everything was executed as it should you don't need another editor:

In [None]:
#Print the first couple of lines
with codecs.open('data/text.txt',encoding='utf-8') as text:    
    print(text.readline())
    next(text)
    print(text.readline())
    
#Print the first 5000 characters
with codecs.open('data/text.txt',encoding='utf-8') as text:    
    print(text.read(5000))

Data preprocessing is over, we now have a txt ready to feed our NLP modules!
## Language processing

Text mining tasks have become incredibly easy thanks to [spaCy](http://alpha.spacy.io/), a NLP Python module which provides:
* Non-destructive tokenization
* Syntax-driven sentence segmentation
* Pre-trained word vectors
* Part-of-speech tagging
* Named entity recognition
* Labelled dependency parsing
* A built-in visualizer 
...and much more, all with just one function!

SpaCy also provides some already trained [models](https://alpha.spacy.io/models/) which you can use out-of-the-box to process different languages. SpaCy's core is written in pure C (via Cython), it's currently the [fastest](https://alpha.spacy.io/usage/facts-figures) parser available and makes [multithreading](https://explosion.ai/blog/multithreading-with-cython) profitable by virtue of Cython.

Follow the *README* of this repo and install the Spanish language model.