## Twitter account classification

We're gonna try to guess someone's political views based on their tweets, more specifically on the hashtags and urls they use.

To do so, we extracted tweets from US representatives of democrat and republican parties. For each tweet, we have the list of hashtags and urls. In a first step, we will process these tweets.

The second step will be to transform text features into a representation encoding that the machine can understand. We will use TF-IDF encoding here (term frequency inverse document frequency).

And finally, we're gonna use Support Vector Machine (SVM) to categorise our data.

You will see that these steps can be enough to get very good categorization performances on a real-world dataset.

In [1]:
import json
import csv
import pandas as pd
from collections import Counter
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer

## 1. Load the data and preprocess

Loading the dataset using with `with open(...` (see the [doc](https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python)).

Create a variable called `tweets_raw` as an empty list. You will then iterate on the line of the text documents and add the data we want.

You have to create two successive for loop: one for the democrat file and one other for the republican file. Here is what you could do in these for loop:

- When the file is open, iterate on each lines
- Decode this line using `json.loads(line)`
- For each line, create a list containing the label (0 for democrat or 1 for republican), the value from the field `account` in the decoded line and the value from the field `hashtags` in the decoded line.

In [2]:
# Your code here


To check that you have the right format, here are the few lines you should have in `tweets_raw`:

`[[0, 'RepDarrenSoto', ['DQAwards']],
 [0, 'RepDarrenSoto', ['FlashbackFriday']],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', ['Dreamers']],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', []],`
 
 In addition, the length of `tweets_raw` should be 244503.

In [5]:
tweets_raw

[[0, 'RepDarrenSoto', ['DQAwards']],
 [0, 'RepDarrenSoto', ['FlashbackFriday']],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', ['Dreamers']],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', ['PuertoRico']],
 [0, 'RepDarrenSoto', ['TaxScamBill', 'Sayfie']],
 [0, 'RepDarrenSoto', ['NowOrNeverglades']],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', ['windenergy', 'taxreform']],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', ['Everglades']],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', ['RussiaTrump', 'sayfie']],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', ['GOPTaxScam']],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', ['GOPTaxScam', 'PuertoRico', 'sayfie']],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', ['Sayfie']],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', []],
 [0, 'RepDarrenSoto', ['TheHillNewsmak

In [6]:
len(tweets_raw)

244503

If your good, convert this list into a DataFrame with the following column names: `"label", "account", "hashtags"`.

In [7]:
# Your code here


In [9]:
tweets_raw.head()

Unnamed: 0,label,account,hashtags
0,0,RepDarrenSoto,[DQAwards]
1,0,RepDarrenSoto,[FlashbackFriday]
2,0,RepDarrenSoto,[]
3,0,RepDarrenSoto,[Dreamers]
4,0,RepDarrenSoto,[]


Now, create a variable `tweets`. It is a version of `tweets_raw` that concatenate the hashtags for each account. Remember that the `sum` of two strings is the concatenation of these strings.

In [10]:
# Your code here


Here is what you should get running `tweets.head()`:

<img src="output2.png">

Now, do some cleaning:

- Create a new column named `hashtags_cleaned` containing the `hashtag` column as lowercase
- Create a new column named `document` converting lists from `hastags` to a string with words separated by a space

For instance the first line would be `hanukkah netneutrality stockmanvtrump netneutr...`.


In [12]:
# Your code here


Then, try to look at the most common hashtags. What are the top 3 for democrats and republicans?


In [14]:
# Your code here


## 2. Vectorizer

To use our data as input of machine learning algorithms, we need to convert the text into numbers. We're gonna use the utility class of sklearn [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

Try to use these parameters:

- `min_df`: eliminate document specific stopwords by removing words that appear in too few document (let's say 2%)
- `max_df`: eliminate dataset specific stop words by removing words that appear in too many documents (let's say 50%)

If you need, have a look at the documentation to see how to use these.

In [18]:
# Your code here


Calling `get_feature_names()` on your vectorizer gives you a dictionary of the words used. Each word corresponds to a feature which is a binary variable represented by a 0 or a 1).

Get the number of features you're working on.


In [55]:
# Your code here


## 3. Train test split

Now that we are satisfied with the representation of our document, we're gonna split the dataset into testing / training set. use the `train_test_split` function over our new representation of the emails. In addition, look at the shape of your dataset to check your split.

In [57]:
# Your code here


## 4. Parameters optimization with Grid Search

Now that you have your training and test set, you will use a SVM to categorize tweets. Use the grid search algorithm to find a good combination of parameters. You should search for the following parameters: `C`, `kernel`.

You can also try to use the convenient `Pipeline` from Sklearn: look at the documentation to find how it works! You can hyperoptimize the parameters from tfidf like the `min_df`.

In [None]:
# Your code here


You should be able to reach more that 95% correct!

## 5. Train with parameters and test

Now that you have your best estimator, you can either directly use it, or re-train a final classifier over your whole training set. 

**Re-create a final pipeline, train over the whole training set, with this time the best hyperparameters you got from your grid search**

In [78]:
# Your code here


And finally, re-evalute your final accuracy over the testing set

In [29]:
# Your code here


NB: This problem is very easy, and your testing your accuracy over very few samples. Hence the overwhelming good performance on this problem. 

## 6. Get most important features from new document

As a reminder, every observation of our testing set is representing a twitter profile, by all of his hashtags.

To have a better understanding / have better explainability of our classification model, we can try to extract the best features according to the tf-idf metric.

**First, extract one sample of the testing set you're gonna be working with**

In [31]:
# Your code here


**Second, you can transform this document into its tf-idf encoding**

In [33]:
# Your code here


This matrix is a sparse matrix, so you can't access directly to it in a proper way. 

**You can use its coordinate version by calling [tocoo](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csc_matrix.tocoo.html)**

In [35]:
# Your code here


**Now you can order the entries of the matrix basd on the values of tf-idf**

- Hint: use `zip` function to merge the original column index & data value in a tuple
- Hint 2: use `sorted` function with a lambda function as a key to sort

In [76]:
# Your code here


Now that we have our sorted tf-idf entries, we can see which one are the top ones and the corresponding hashtags original name.

**Using your sorted items and their corresponding column indexes, get the top 10 hashtags that impact the most the document representation**

In [77]:
# Your code here


You can now print the top keywords for this document

In [None]:
# Your code here
