# Argentine Election Analysis

## Introduction
In this notebook I analyze a Spanish dataset set up during the [Argentine legislative election](https://en.wikipedia.org/wiki/Argentine_legislative_election,_2017) of 2017. 

This dataset contains the data of 9 facebook bots, crawled over a period of 16 days, following 45 sources.

## Dataset
The dataset was prepared by the [__Facebook Tracking Exposed__](https://facebook.tracking.exposed/) project and can be retrieved in a convenient JSON format from the specific Github [__repo__](https://github.com/tracking-exposed/experiments-data/tree/master/silver).
There are two separate files that we'll try to breakdown:
* __fbtrex-data-\*.json__ - Contains all impressions relative to single users
* __semantic-entities.json__ - Contains all available metadata regarding posts

The text field of every posts is enclosed in "semantic-entities.json", while I can use "fbtrex-data-\*.json" to correlate which user has visualized this content, thus providing an easy way to investigate the Facebook filter bubble.
Given a ready working environment, as explained is the README.md of this repo, just go ahead and download the files:

In [3]:
%%bash
#Download Argentine datasetin a data subdir
mkdir data && cd data
wget https://github.com/tracking-exposed/experiments-data/blob/master/silver/fbtrex-data-1.json.zip
wget https://github.com/tracking-exposed/experiments-data/blob/master/silver/semantic-entities.json.zip

In [None]:
__Note__: This commands are supposed to be executed in a bash environment, not in the notebook itself. The operation may fail due to permissions.

Extract the content from the zip archive:

In [2]:
%%bash
#Extract JSON from zipped archives
cd data
unzip fbtrex-data-1.json.zip
unzip semantic-entities.json.zip

## Data preprocessing


Now that we have the dataset in JSON format, we can use the [JSON Python library](https://docs.python.org/3/library/json.html) to decode its content and store it in a Python variable. The variable type depends on the actual content of the provided file, by [default](https://docs.python.org/3/library/json.html#json-to-py-table) a JSON object is decoded to a dict and an arrays to a list. The recommended approach for working with encoded text files, is to use the [codecs Python library](https://docs.python.org/3/library/codecs.html):

In [5]:
import codecs
import json
with codecs.open('data/semantic-entities.json',encoding='utf_8') as data_json:    
    data = json.load(data_json)

To print to stdout the content of the parsed JSON file just use [pprint](https://docs.python.org/3/library/pprint.html), the data pretty printer:

In [6]:
import pprint
pprint.pprint(data)

It's useful to check if the casting was performed correctly before proceding, the resulting decoded type can be inspected with:

In [None]:
type(data)

So the JSON is now a list. How many entities do we have?

In [None]:
data_len = len(data)
print('There are {} total elements to analyze'.format(data_len+1))

Let's go deeper. We decoded the JSON to a list, but what kind of list is it? What happened to JSON objects?

In [8]:
for i in range(data_len):
    print(type(data[i]))

It's a nested list of dictionaries! Let's print the dict_keys:

In [None]:
for i in range(data_len):
    print(data[i].keys())cd

This is interesting: in the provided dataset there are some entities that don't have a "__text__" field. So let's first take only the elements that have a text field and  put them in a new non-nested list:

In [None]:
tex = []
for i in range(data_len):
    if 'text' in data[i]:
        tex.append(data[i]['text'])

This is better. We now have an actual list. Again, how many entities do we have?

In [None]:
tex_len = len(tex)
print('There are actually {} text elements to analyze'.format(tex_len+1))