# Intro to NLP Lab

In this lab, you'll be classifying randomly selected tweets from political officials into whether or not they are partisan tweets or neutral. In the following import statement, we're selecting only the columns that are important, but there may be more useful features in that set. Feel free to explore. 

In [1]:
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.feature_extraction.text import CountVectorizer, \
HashingVectorizer, TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_union, make_pipeline

import string

In [2]:
import pandas as pd

df = pd.read_csv('datasets/political_media.csv',
                usecols=[7, 20])
df.head()

Unnamed: 0,bias,text
0,partisan,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,partisan,VIDEO - #Obamacare: Full of Higher Costs and ...
2,neutral,Please join me today in remembering our fallen...
3,neutral,RT @SenatorLeahy: 1st step toward Senate debat...
4,partisan,.@amazon delivery #drones show need to update ...


## Set up

Please split the dataset into a training and test set and convert the `bias` feature into 0s and 1s.

In [3]:
df['bias'] = df['bias'].apply(lambda x: 1 if x == 'partisan' else 0)
X_train, X_test, y_train, y_test = train_test_split(df['text'].values,
                                                   df['bias'].values)

## Modeling

Please try the following techniques to transform the data. For each technique, do the following:

1. Transform the training data
2. Fit a `RandomForestClassifier` to the transformed training data
3. Transform the test data
4. Discuss the goodness of fit of your model using the test data and a classification report and confusion matrix

### 1. `CountVectorizer()`

In [4]:
cv = CountVectorizer()
cv.fit(X_train)

X_train_cv = cv.transform(X_train)
X_test_cv = cv.transform(X_test)

rf = RandomForestClassifier()
rf.fit(X_train_cv, y_train)
print(rf.score(X_test_cv, y_test))
predictions = rf.predict(X_test_cv)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

0.7176
[[848  47]
 [306  49]]
             precision    recall  f1-score   support

          0       0.73      0.95      0.83       895
          1       0.51      0.14      0.22       355

avg / total       0.67      0.72      0.65      1250



### 2. `CountVectorizer()` with your choice of `min_df` and `max_df`

In [5]:
cv = CountVectorizer(min_df=0.10, max_df=0.90)
cv.fit(X_train)

X_train_cv = cv.transform(X_train)
X_test_cv = cv.transform(X_test)

rf = RandomForestClassifier()
rf.fit(X_train_cv, y_train)
print(rf.score(X_test_cv, y_test))
predictions = rf.predict(X_test_cv)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

0.6808
[[796  99]
 [300  55]]
             precision    recall  f1-score   support

          0       0.73      0.89      0.80       895
          1       0.36      0.15      0.22       355

avg / total       0.62      0.68      0.63      1250



### 3. `CountVectorizer()` with English stop words

In [6]:
cv = CountVectorizer(stop_words='english')
cv.fit(X_train)

X_train_cv = cv.transform(X_train)
X_test_cv = cv.transform(X_test)

rf = RandomForestClassifier()
rf.fit(X_train_cv, y_train)
print(rf.score(X_test_cv, y_test))
predictions = rf.predict(X_test_cv)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

0.7104
[[815  80]
 [282  73]]
             precision    recall  f1-score   support

          0       0.74      0.91      0.82       895
          1       0.48      0.21      0.29       355

avg / total       0.67      0.71      0.67      1250



### 4. `TfidfVectorizer()` 

In [7]:
tfidf = TfidfVectorizer()
tfidf.fit(X_train)

X_train_tfidf = tfidf.transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

rf = RandomForestClassifier()
rf.fit(X_train_tfidf, y_train)
print(rf.score(X_test_tfidf, y_test))
predictions = rf.predict(X_test_tfidf)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

0.7272
[[851  44]
 [297  58]]
             precision    recall  f1-score   support

          0       0.74      0.95      0.83       895
          1       0.57      0.16      0.25       355

avg / total       0.69      0.73      0.67      1250



### 5. `TfidfVectorizer()` with English stop words

In [8]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf.fit(X_train)

X_train_tfidf = tfidf.transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

rf = RandomForestClassifier()
rf.fit(X_train_tfidf, y_train)
print(rf.score(X_test_tfidf, y_test))
predictions = rf.predict(X_test_tfidf)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

0.716
[[822  73]
 [282  73]]
             precision    recall  f1-score   support

          0       0.74      0.92      0.82       895
          1       0.50      0.21      0.29       355

avg / total       0.68      0.72      0.67      1250



### Moving forward

With the remainder of your time, please try and find the best model and data transformation to predict partisan tweets. This is a challenging data set and can be approached from a number of ways.

Some techniques to try are:

1. Different types of data transformation 
2. Custom preprocessors for `CountVectorizer`
3. Custom stopword lists
4. Use of a dimensionality reduction technique (like `TruncatedSVD`)
5. Optimizing hyperparameters using `GridSearchCV`
6. Trying a different modeling technique such as `KNeighborsClassifier` or `LogisticRegression`