![enter image description here][1]

## Naive Bayes Trump/Hillary tweet classifier ##

By: Ignacio Chavarria ([@ignacio_chr][2])

Ever wonder which candidate you tweet like? Follow these steps to find out:

 1. Fork this notebook
 2. Replace text string portion of last cell with a personal tweet
 3. Run notebook!

Model test set accuracy score: 93%

Credits: 

 - Spam filter: http://radimrehurek.com/data_science_python/
 - Cover photo: http://www.wsj.com


  [1]: http://si.wsj.net/public/resources/images/OG-AH736_Twitte_G_20160718163322.jpg
  [2]: http://www.twitter.com/ignacio_chr

----------

**Part 1:** 

 - Importing dataset and libraries
 - Cleaning and exploring the data
 - Setting up dataframe for prediction model

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from math import ceil
import re
import calendar
from pandas import Series
from datetime import datetime
import csv
import matplotlib.pyplot as plt
from textblob import TextBlob
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.learning_curve import learning_curve

In [None]:
df1 = pd.read_csv('../input/tweets.csv', encoding="utf-8")
df1 = df1[['handle','text','is_retweet']]

df = df1.loc[df1['is_retweet'] == False]
df = df.copy().reset_index(drop=True)

Create and apply function with RegEx to filter and extract **mentions** from tweets:

In [None]:
def all_mentions(tw):
    test2 = re.findall('(\@[A-Za-z_]+)', tw)
    if test2:
        return test2
    else:
        return ""

df['top_mentions'] = df['text'].apply(lambda x: all_mentions(x))

mention_list_trump = []
mention_list_clinton = []
for n in range(len(df['top_mentions'])):
    if df['handle'][n] == 'realDonaldTrump':
        mention_list_trump += df['top_mentions'][n]
    elif df['handle'][n] == 'HillaryClinton':
        mention_list_clinton += df['top_mentions'][n]

Graph **mentions** most used by candidates:

In [None]:
data1 = Series(mention_list_trump).value_counts().head(n=7)
sns.set_style("white")
plt.figure(figsize=(12, 2))
sns.barplot(x=data1, y=data1.index, orient='h', palette="Reds_r").set_title("Trump's most used mentions")

data2 = Series(mention_list_clinton).value_counts().head(n=7)
sns.set_style("white")
plt.figure(figsize=(12, 2))
sns.barplot(x=data2, y=data2.index, orient='h', palette="Blues_r").set_title("Clinton's most used mentions")

Seems like one thing both candidates had in common was their **frequent mention of Trump**. Surprinsingly, while Trump was Hillary's 2nd most mentioned account, she was not on Trump's list.

Create and apply function with RegEx to filter and extract **hashtags** from tweets:

In [None]:
def get_hashtags(tw):
    test3 = re.findall('(\#[A-Za-z_]+)', tw)
    if test3:
        return test3
    else:
        return ""

df['top_hashtags'] = df['text'].apply(lambda x: get_hashtags(x))

hashtags_list_trump = []
hashtags_list_clinton = []
for n in range(len(df['top_hashtags'])):
    if df['handle'][n] == 'realDonaldTrump':
        hashtags_list_trump += df['top_hashtags'][n]
    elif df['handle'][n] == 'HillaryClinton':
        hashtags_list_clinton += df['top_hashtags'][n]

Graph **hashtags** most used by candidates:

In [None]:
data3 = Series(hashtags_list_trump).value_counts().head(n=7)
sns.set_style("white")
plt.figure(figsize=(11.5, 2))
sns.barplot(x=data3, y=data3.index, orient='h', palette="Reds_r").set_title("Trump's most used hashtags")

data4 = Series(hashtags_list_clinton).value_counts().head(n=7)
sns.set_style("white")
plt.figure(figsize=(12.2, 2))
sns.barplot(x=data4, y=data4.index, orient='h', palette="Blues_r").set_title("Clinton's most used hashtags")

It seems Trump was much more fond of #hashtags than Hillary, using his favorite hashtag (#*Trump*) almost **9x more** than she used her favorite (#*DemsInPhilly*).

In [None]:
df['length_no_url'] = df['text']
df['length_no_url'] = df['length_no_url'].apply(lambda x: len(x.lower().split('http')[0]))
df['message'] = df['text'].apply(lambda x: x.lower().split('http')[0])

def candidate_code(x):
    if x == 'HillaryClinton':
        return 'Hillary'
    elif x == 'realDonaldTrump':
        return 'Trump'
    else:
        return ''

df['label'] = df['handle'].apply(lambda x: candidate_code(x))

Create new dataframe for prediction model with only *candidate name* (as **label**) and *tweets* (as **message**):

In [None]:
messages = df[['label','message']]

In [None]:
print(messages[:5])

In [None]:
def split_into_tokens(message):
    message = message  # convert bytes into proper unicode
    return TextBlob(message).words

In [None]:
messages.message.head()

In [None]:
messages.message.head().apply(split_into_tokens)

In [None]:
def split_into_lemmas(message):
    message = message.lower()
    words = TextBlob(message).words
    # for each word, take its "base form" = lemma 
    return [word.lemma for word in words]

messages.message.head().apply(split_into_lemmas)

In [None]:
bow_transformer = CountVectorizer(analyzer=split_into_lemmas).fit(messages['message'])
print(len(bow_transformer.vocabulary_))
print(bow_transformer.get_feature_names()[:5])

In [None]:
messages_bow = bow_transformer.transform(messages['message'])
print('sparse matrix shape:', messages_bow.shape)
print('number of non-zeros:', messages_bow.nnz)
print('sparsity: %.2f%%' % (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1])))

In [None]:
tfidf_transformer = TfidfTransformer().fit(messages_bow)

In [None]:
print (tfidf_transformer.idf_[bow_transformer.vocabulary_['the']])
print (tfidf_transformer.idf_[bow_transformer.vocabulary_['hannity']])

In [None]:
messages_tfidf = tfidf_transformer.transform(messages_bow)
print(messages_tfidf.shape)

In [None]:
%time spam_detector = MultinomialNB().fit(messages_tfidf, messages['label'])

In [None]:
all_predictions = spam_detector.predict(messages_tfidf)

In [None]:
print('Training accuracy:', accuracy_score(messages['label'], all_predictions))

In [None]:
fig, ax = plt.subplots(figsize=(3.5,2.5))
sns.heatmap(confusion_matrix(messages['label'], all_predictions), annot=True, linewidths=.5, ax=ax, cmap="Blues", fmt="d").set(xlabel='Predicted Value', ylabel='Expected Value')
sns.plt.title('Training Set Confusion Matrix')

In [None]:
print(classification_report(messages['label'], all_predictions))

In [None]:
msg_train, msg_test, label_train, label_test = \
    train_test_split(messages['message'], messages['label'], test_size=0.2)

print(len(msg_train), len(msg_test), len(msg_train) + len(msg_test))

In [None]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=split_into_lemmas)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

In [None]:
scores = cross_val_score(pipeline,  # steps to convert raw messages into models
                         msg_train,  # training data
                         label_train,  # training labels
                         cv=10,  # split data randomly into 10 parts: 9 for training, 1 for scoring
                         scoring='accuracy',  # which scoring metric?
                         n_jobs=-1,  # -1 = use all cores = faster
                         )
print(scores)

In [None]:
print('Mean score:', scores.mean(), '\n')
print('Stdev:', scores.std())

**We can ask: What is the effect of IDF weighting on accuracy? Does the extra processing cost of lemmatization (vs. just plain words) really help?
Let's find out:**

In [None]:
params = {
    'tfidf__use_idf': (True, False),
    'bow__analyzer': (split_into_lemmas, split_into_tokens),
}

grid = GridSearchCV(
    pipeline,  # pipeline from above
    params,  # parameters to tune via cross validation
    refit=True,  # fit using all available data at the end, on the best found param combination
    n_jobs=-1,  # number of cores to use for parallelization; -1 for "all cores"
    scoring='accuracy',  # what score are we optimizing?
    cv=StratifiedKFold(label_train, n_folds=5),  # what type of cross validation to use
)

%time nb_detector = grid.fit(msg_train, label_train)
print(nb_detector.grid_scores_)

In [None]:
top_h = {}
top_t = {}

for w in (bow_transformer.get_feature_names()[:len(bow_transformer.get_feature_names())]):
    p = nb_detector.predict_proba([w])[0][0]
    if len(w) > 3:
        if p > 0.5:
            top_h[w] = p
        elif p < 0.5:
            top_t[w] = p
    else:
        pass
    
top_t_10 = sorted(top_t, key=top_t.get, reverse=False)[:6]
top_h_10 = sorted(top_h, key=top_h.get, reverse=True)[:6]

dic = {}
for l in [top_t_10, top_h_10]:
    for key, values in (top_t.items() | top_h.items()):
        if key in l:
            dic[key] = values
            
top_df = pd.DataFrame(list(dic.items()), columns=['word', 'hillary_prob'])
top_df['trump_prob'] = (1 - top_df['hillary_prob'])
top_df_t = top_df[:int((len(dic)/2))]
top_df_t = top_df_t[['word','trump_prob','hillary_prob']]
top_df_h = top_df[int((len(dic)/2)):]

In [None]:
sns.set_context({"figure.figsize": (10, 2.5)})
top_df_t.plot(kind='barh', stacked=True, color=["#E91D0E","#08306B"]).legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.yticks(range(len(top_df_t['word'])), list(top_df_t['word']))
plt.title('Words with highest probability of indicating a Trump tweet')
plt.xlabel('Probability')

In [None]:
sns.set_context({"figure.figsize": (10, 2.5)})
top_df_h.plot(kind='barh', stacked=True, color=["#08306B","#E91D0E"]).legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.yticks(range(len(top_df_h['word'])), list(top_df_h['word']))
plt.title('Words with highest probability of indicating a Hillary tweet')
plt.xlabel('Probability')

In [None]:
print(nb_detector.predict(["flotus"])[0])
print(nb_detector.predict_proba(["flotus"])[0])

In [None]:
print(nb_detector.predict(["america needs an experienced leader who respects women"])[0])
print(nb_detector.predict(["i brush my orange hair with a gold comb, just ask @seanhannity"])[0])

In [None]:
predictions = nb_detector.predict(msg_test)
print(classification_report(label_test, predictions))

In [None]:
fig, ax = plt.subplots(figsize=(3.5,2.5))
sns.heatmap(confusion_matrix(label_test, predictions), annot=True, linewidths=.5, ax=ax, cmap="Blues", fmt="d").set(xlabel='Predicted Value', ylabel='Expected Value')
sns.plt.title('Test Set Confusion Matrix')