# NLP Fun
In this notebook I'll be practicing NLP techniques. The goal for the provided dataset is to classify randomly selected tweets from political officials and classify them into whether or not they are partisan tweets or neutral. 

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

%matplotlib inline

### Import and transform

In [17]:
df = pd.read_csv('datasets/political_media.csv')
print(df.shape)
df.head()

(5000, 21)


Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,audience,audience:confidence,bias,bias:confidence,message,...,orig__golden,audience_gold,bias_gold,bioid,embed,id,label,message_gold,source,text
0,766192484,False,finalized,1,8/4/15 21:17,national,1.0,partisan,1.0,policy,...,,,,R000596,"<blockquote class=""twitter-tweet"" width=""450"">...",3.83249e+17,From: Trey Radel (Representative from Florida),,twitter,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,766192485,False,finalized,1,8/4/15 21:20,national,1.0,partisan,1.0,attack,...,,,,M000355,"<blockquote class=""twitter-tweet"" width=""450"">...",3.11208e+17,From: Mitch McConnell (Senator from Kentucky),,twitter,VIDEO - #Obamacare: Full of Higher Costs and ...
2,766192486,False,finalized,1,8/4/15 21:14,national,1.0,neutral,1.0,support,...,,,,S001180,"<blockquote class=""twitter-tweet"" width=""450"">...",3.39069e+17,From: Kurt Schrader (Representative from Oregon),,twitter,Please join me today in remembering our fallen...
3,766192487,False,finalized,1,8/4/15 21:08,national,1.0,neutral,1.0,policy,...,,,,C000880,"<blockquote class=""twitter-tweet"" width=""450"">...",2.98528e+17,From: Michael Crapo (Senator from Idaho),,twitter,RT @SenatorLeahy: 1st step toward Senate debat...
4,766192488,False,finalized,1,8/4/15 21:26,national,1.0,partisan,1.0,policy,...,,,,U000038,"<blockquote class=""twitter-tweet"" width=""450"">...",4.07643e+17,From: Mark Udall (Senator from Colorado),,twitter,.@amazon delivery #drones show need to update ...


For this project I'm really just concerned with the bias and text columns for predicting the outcomes.

In [18]:
# Redefine df
df = pd.read_csv('datasets/political_media.csv',usecols=[7,20])
df.head()

Unnamed: 0,bias,text
0,partisan,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,partisan,VIDEO - #Obamacare: Full of Higher Costs and ...
2,neutral,Please join me today in remembering our fallen...
3,neutral,RT @SenatorLeahy: 1st step toward Senate debat...
4,partisan,.@amazon delivery #drones show need to update ...


In [19]:
# Binarize bias column for 1 being partisan
df['bias'] = df['bias'].apply(lambda x: 1 if x == 'partisan' else 0)

# Train test split
X_train, X_test, y_train, y_test = train_test_split(df['text'].values,
                                                   df['bias'].values, 
                                                    random_state=42)

In [20]:
# See % of partisan from training set 
# This is also baseline accuracy for predicting partisan
np.mean(y_train)

0.27226666666666666

### Modeling Time
I'll try a couple different techniques here starting with CountVectorizer. We'll be using my favorite model, Random Forest Classifier to make predictions for this dataset to compare how setting up the sparse matrix can affect the accuracy.

#### CountVectorizer()
CountVectorizer takes a set of words and splits them up into one column per word with the count of the words for that row in that column. 

In [21]:
cv = CountVectorizer()
cv.fit(X_train)

X_train_cv = cv.transform(X_train)
X_test_cv = cv.transform(X_test)

rfc = RandomForestClassifier()
rfc.fit(X_train_cv, y_train)

predictions = rfc.predict(X_test_cv)

print('Test prediction score: ', rfc.score(X_test_cv, y_test))
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

Test prediction score:  0.7648
[[918  42]
 [252  38]]
             precision    recall  f1-score   support

          0       0.78      0.96      0.86       960
          1       0.47      0.13      0.21       290

avg / total       0.71      0.76      0.71      1250



Looks like our model classified 96% of the total neutral tweets correctly which is pretty awesome.. but it only classified 13% of the total partisan tweets correctly. Looking at the precision, our model is still doing much better than baseline which was at 27%.

Let's take a peak at what the sparse matrix looks like.

In [23]:
print('X_train shape: ', pd.DataFrame(X_train_cv.todense()).shape)
pd.DataFrame(X_train_cv.todense(), columns=cv.get_feature_names()).head()

X_train shape:  (3750, 15366)


Unnamed: 0,00,000,000th,0017a43b2370,001a4bcf6878,00am,00kq8qqlaa,00pm,01,010914,...,űşm,űşneil,űşre,űşs,űşshistorymonth,űşt,űşve,űť,űťimproper,űź
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Overall pretty massive matrix- which is expected with count vectorizing words in NLP. 

Let's run the model once more with setting the min_df and max_df parameters. These 2 parameters ignore terms that have a documment frequency strictly lower/higher than the given threshold.

In [24]:
cv = CountVectorizer(min_df=0.10, max_df=0.90)
cv.fit(X_train)

X_train_cv = cv.transform(X_train)
X_test_cv = cv.transform(X_test)

rfc = RandomForestClassifier()
rfc.fit(X_train_cv, y_train)

predictions = rfc.predict(X_test_cv)

print('Test prediction score: ', rfc.score(X_test_cv, y_test))
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

Test prediction score:  0.7176
[[841 119]
 [234  56]]
             precision    recall  f1-score   support

          0       0.78      0.88      0.83       960
          1       0.32      0.19      0.24       290

avg / total       0.68      0.72      0.69      1250



So these results are a bit interesting. By tuning the parameters, our model got a small boost in predicting the partisan tweets correctly. There was a tradeoff however as the predictive power for classifying neutral tweets suffered slightly.

Still- with classifying only 19% of the total partisan tweets correctly I think we can make more adjustments to get a good boost for our model. Let's try to remove stop words next!

As soem words are very commmon and provide no information on the text content, we can remove them with the stop words parameter. 

In [25]:
cv = CountVectorizer(stop_words='english')
cv.fit(X_train)

X_train_cv = cv.transform(X_train)
X_test_cv = cv.transform(X_test)

rfc = RandomForestClassifier()
rfc.fit(X_train_cv, y_train)

predictions = rfc.predict(X_test_cv)

print('Test prediction score: ', rfc.score(X_test_cv, y_test))
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

Test prediction score:  0.7432
[[860 100]
 [221  69]]
             precision    recall  f1-score   support

          0       0.80      0.90      0.84       960
          1       0.41      0.24      0.30       290

avg / total       0.71      0.74      0.72      1250



From an overall accuracy standpoint this model performed slightly worse than our no-parameter count vectorizer model. Comparing off the 2 models, the decision to choose between them may be decided by what you would deem more important. Better to have type 1 errors or type 2?

#### TfidVectorizer()

Let's take a step away from count vectorizer and try to classify these tweets through TFIDF or term frequency - inverse document frequency. Fancy jargon for classifying by looking for words that occur a lot in one document but doesn't occur in other documments as often.

Breaking it down further termm frequency is the frequency of a certain term in a document. Inverse document frequency is defined as the frequency of documents that contain that termm over the whole corpus. 

Can be defined as,
$$ tf(t, d) = \frac{N_{term}}{N_{document}} $$

$$ idf(t, D) = log(\frac{N_{documents}}{N_{\text{documents that contain term t}}}) $$

Enough talking- let's model!

In [26]:
tvec = TfidfVectorizer(stop_words='english')
# We'll use stop words because.. it just makes too much sense.
tvec.fit(X_train)

X_train_tvec = tvec.transform(X_train)
X_test_tvec = tvec.transform(X_test)

rfc = RandomForestClassifier()
rfc.fit(X_train_tvec, y_train)
print(rfc.score(X_test_tvec, y_test))
predictions = rfc.predict(X_test_tvec)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

0.764
[[890  70]
 [225  65]]
             precision    recall  f1-score   support

          0       0.80      0.93      0.86       960
          1       0.48      0.22      0.31       290

avg / total       0.72      0.76      0.73      1250



Our initial model with CountVectorizer yielded us an accuracy of 76.48%, so nearly identical to the accuracy of the tfidf model. The precision scores for both neutral and partisan were higher than the original model, so when this model predicted neutral or partisan it was correct more often than our previous model.

## Moving Forward

Honestly all of our models were very similar in accuracy scores. Choosing between themm would be a question whether you deem type 1 or type 2 errors more impmortant. As next steps several things I could have tried are custom stop word lists, use dimmensionality reduction technique (like TruncatedSVD), optimizing hyperparameters (maybe gridsearch), and just different classification models like logistic regression or K nearest neighbors.