# Trudeau or Trump?
## Using Naive Bayes to classify tweets as written by either Trump or Trudeau

This project was inspired by the DataCamp project "Who's Tweeting? Trump or Trudeau" project that taught me how to apply Naive Bayes to classifying text. Skills I learned in this project were vectorization and Naive Bayes. The dataset is from the Twitter API.

In [29]:
import pandas as pd
import numpy as np
import matplotlib as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn import metrics

In [3]:
# Load data
tweet_df = pd.read_csv('tweets.csv')
display(tweet_df.head(), print(tweet_df.columns))

Index(['id', 'author', 'status'], dtype='object')


Unnamed: 0,id,author,status
0,1,Donald J. Trump,I will be making a major statement from the @W...
1,2,Donald J. Trump,Just arrived at #ASEAN50 in the Philippines fo...
2,3,Donald J. Trump,"After my tour of Asia, all Countries dealing w..."
3,4,Donald J. Trump,Great to see @RandPaul looking well and back o...
4,5,Donald J. Trump,Excited to be heading home to see the House pa...


None

Since the data has been collected via the Twitter API and not split into test and training sets, we'll need to split it accordingly

In [4]:
# Create target
y = tweet_df['author']

# Split training and testing data
X_train, X_test, y_train, y_test = train_test_split(tweet_df['status'],y,test_size=0.33, random_state=53)

We have the training and testing data all set up, but we need to create vectorized representations of the tweets in order to apply our naive bayes model. I am going to try both vectorizing by count as well as tfidf (term frequency-inverse document frequency) to see which provides more accurate results

In [5]:
# Initialize count vectorizer
count_vectorizer = CountVectorizer(stop_words='english', 
                                   min_df=0.05, max_df=0.9)

# Create count train and test variables
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)

In [6]:
# Initialize tfidf vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', 
                                   min_df=0.05, max_df=0.9)
# Create tfidf train and test variables
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

In [7]:
# Creating and fitting Multinomial Naive Bayes Model with tfidf data
tfidf_nb = MultinomialNB()
tfidf_nb.fit(tfidf_train, y_train)
tfidf_nb_pred = tfidf_nb.predict(tfidf_test)

# Prediction Accurary
tfidf_nb_score = metrics.accuracy_score(y_test, tfidf_nb_pred)
print('NaiveBayes Tfidf Score: ', tfidf_nb_score)

NaiveBayes Tfidf Score:  0.803030303030303


In [38]:
#inputting test results into dataframe
y_test_index = np.array(y_test.index)
test_tweet = tweet_df.loc[y_test_index]
test_tweet['tfidf test result'] = tfidf_nb.predict(tfidf_test)
test_tweet.head()

Unnamed: 0,id,author,status,tfidf test result
161,162,Donald J. Trump,"Diane Black of Tennessee, the highly respected...",Donald J. Trump
174,175,Donald J. Trump,Ed Gillespie will be a great Governor of Virgi...,Donald J. Trump
199,200,Donald J. Trump,Sen. Corker is the incompetent head of the For...,Donald J. Trump
181,182,Donald J. Trump,RT @IvankaTrump: Working families need #TaxRef...,Justin Trudeau
363,364,Justin Trudeau,RT @SeamusORegan: November 5 - 11 is Veterans'...,Justin Trudeau


In [8]:
# Creating and fitting Naive Bayes Model
count_nb = MultinomialNB()
count_nb.fit(count_train, y_train)
count_nb_pred = count_nb.predict(count_test)

# Prediction Accurary
count_nb_score = metrics.accuracy_score(y_test, count_nb_pred)
print('NaiveBayes Count Score: ', count_nb_score)

NaiveBayes Count Score:  0.7954545454545454


In [9]:
tfidf_nb_cm = metrics.confusion_matrix(y_test, tfidf_nb_pred, labels=['Donald J. Trump', 'Justin Trudeau'])
count_nb_cm = metrics.confusion_matrix(y_test, count_nb_pred, labels=['Donald J. Trump', 'Justin Trudeau'])

In [None]:
print('TFIDF Confusion Matrix:', tfidf_nb_cm)