# NLP with Python for Facebook Mining

## Introduction
In this review I present a non-exhaustive list tools for analyzing a text corpus using Python.  

### Requirements (tools and libs)
* The code uses __python3__, it can be easily backported to python2 if needed, but that's up to you.
* This notebook can be viewed in github and easily cloned, although to run the code you will need to install [__jupyter__](https://jupyter.org/).
* Standard python scientific stack is needed. You can install it through the [conda](https://conda.io/) package manager, through your package manager distribution, or using __pip__, Python's own package manager.
Modules used consist in [__Pandas__](http://pandas.pydata.org/) for data analysis and visualization, [__scikit-learn__](http://scikit-learn.org/) for comparative machine learning.
* NLP libraries: mainly only [__spaCy__](https://spacy.io/) and [__gensim__](https://radimrehurek.com/gensim/). In the future maybe try Stanford's [__CoreNLP__](https://stanfordnlp.github.io/CoreNLP/). I will avoid using mainstream [__NLTK__](www.nltk.org) for now.

### Datasets
The datasets are generously provided by the [__Facebook Tracking Exposed__](https://facebook.tracking.exposed/) and can be easily retrieved in their Github [repo](https://github.com/tracking-exposed/experiments-data/). In this specific notebook we will use the Argentinian Election dataset (in Spanish) which contains two json files crawled by 9 bots over the course of more than two weeks! 

To learn how to make bots to crawl facebook and build your own dataset, refer to the main fbtrex project [backend](https://github.com/tracking-exposed/facebook) (UPDATE: I'm currently working on fbtrex guide for researchers).

### Setting up
It is recommended to work with virtual environments. Python provides easy venv creation and management and allowing to keep a clean global environment, while having a bleeding edge development branch in production.
##### Virtual Environment
Python 3.3+ comes with a module called [__venv__](https://docs.python.org/3/library/venv.html). For applications that require an older version of Python, [virtualenv](https://virtualenv.pypa.io/en/stable/) must be used.
__Note__: Installing with pip install has to be done in your console for permission purposes. Actually all bash snippets are not supposed to be executed inside this nb. 


In [12]:
%%bash
#Create a nlp virtual environment for efficient sandboxing
python -m venv nlp

Running this command creates a nlp directory in this nb directory and places a pyvenv.cfg file in it with a home key pointing to the Python installation from which the command was run. It also creates a bin subdirectory containing a copy of the python binary. It also creates an (initially empty) lib/pythonX.Y/site-packages subdirectory, where pip install modules will end up.

In [13]:
%%bash
#Activate nlp venv
source nlp/bin/activate

__Note__: You don't specifically need to activate an environment; activation just prepends the virtual environment's binary directory to your path, so that "python" invokes the virtual environment's Python interpreter and you can run installed scripts without having to use their full path. However, all scripts installed in a virtual environment should be runnable without activating it, and run with the virtual environment's Python automatically.

##### Install deps
As said, most of these tools can be installed globally via your linux distribution's package manager. If not, now that you are in the venv you can use pip to install the single modules. 

In [15]:
%%bash
#Install minimal scientific python stack
pip install pandas && pip install scikit-learn

##### Install NLP modules
Let's begin this analysis using only __spaCy__ for data analysis. I suggest to try the [__alpha__](http://alpha.spacy.io/) version, which is provided via the [__spacy-nightly__](https://pypi.python.org/pypi/spacy-nightly) module. Basic and API documentation can be found in the alpha spaCy [subdomain](https://alpha.spacy.io/usage/). For the soon-legacy documentation refer to [main domain](https://spacy.io/docs/usage/).

In [17]:
%%bash
#Install the alpha spacy2.0 instead of 1.9
pip install spacy-nightly

Once installed, we can move on to download spacy language models. In this review we will only use the __Spanish__ model, [es_core_web_sm](https://alpha.spacy.io/models/es), for text mining purposes. I will also load the __English__ model [en_core_web_sm](https://alpha.spacy.io/models/en) to show off spaCy's wonderful skills. 

In [2]:
%%bash
#Download Spanish and English models
spacy download es_core_web_sm
spacy download en_core_web_sm