# Trump vs. Drumpf, Elon vs. Bored Elon Musk
### Classifying Tweets Using Natural Language Processing    
[Donald Trump](https://twitter.com/realDonaldTrump)  
[Donald Drumpf](https://twitter.com/RealDonalDrumpf?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor)    
[Bored Elon Musk](https://twitter.com/boredelonmusk?lang=en)  
[Elon Musk](https://twitter.com/elonmusk?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor)  

### Setup

In [16]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

### Read in data and preview

In [9]:
#dataframe to store legitimate tweets
elon = pd.read_csv('elon_data.csv',parse_dates=[0], infer_datetime_format=True)

#dataframe to store fake account tweets
bored_elon = pd.read_csv('bored_elon_data.csv',parse_dates=[0], infer_datetime_format=True)

In [14]:
elon.shape

(3173, 3)

In [15]:
bored_elon.shape

(1470, 3)

In [10]:
elon.head()

Unnamed: 0,Tweet,Date,Retweets
0,@RanNatanzon @Tesla @Cortica This is completel...,Tue Mar 20 18:47:20 +0000 2018,195
1,Paid respects to Masada earlier today. Live fr...,Tue Mar 20 02:20:29 +0000 2018,844
2,Learning how to pour flaming absinthe over a t...,Mon Mar 19 18:09:26 +0000 2018,970
3,@IraEhrenpreis @Tesla Thanks for your support ...,Sun Mar 18 04:31:53 +0000 2018,157
4,@TheOnion Your cruel taunts cut me deep. Deep....,Thu Mar 15 18:46:45 +0000 2018,465


In [11]:
bored_elon.head()

Unnamed: 0,Tweet,Date,Retweets
0,People who argue VR is more interesting than A...,Sun Mar 18 15:54:06 +0000 2018,278.0
1,Dashboard indicator that lets you know when a ...,Sat Mar 17 20:38:42 +0000 2018,85.0
2,Podcast app that scans your brain waves and pa...,Fri Mar 16 15:52:32 +0000 2018,693.0
3,Food delivery service that plans out your meal...,Fri Mar 09 14:16:12 +0000 2018,590.0
4,Vertical buildings that allow multiple people ...,Sat Mar 03 18:26:32 +0000 2018,325.0


### Clean it up and add labels

In [4]:
#remove punctuation from Tweet text
elon['Tweet'] = elon['Tweet'].str.replace('[^\w\s]','')
bored_elon['Tweet'] = bored_elon['Tweet'].str.replace('[^\w\s]','')

#add in label columns for data
elon['Label'] = "Elon"
bored_elon['Label'] = "BoredElon"

#join elon and bored_elon
frames = [elon, bored_elon]
df = pd.concat(frames)

In [5]:
df.head()

Unnamed: 0,Tweet,Date,Retweets,Label
0,RanNatanzon Tesla Cortica This is completely f...,Tue Mar 20 18:47:20 +0000 2018,195,Elon
1,Paid respects to Masada earlier today Live fre...,Tue Mar 20 02:20:29 +0000 2018,844,Elon
2,Learning how to pour flaming absinthe over a t...,Mon Mar 19 18:09:26 +0000 2018,970,Elon
3,IraEhrenpreis Tesla Thanks for your support ov...,Sun Mar 18 04:31:53 +0000 2018,157,Elon
4,TheOnion Your cruel taunts cut me deep Deep Bu...,Thu Mar 15 18:46:45 +0000 2018,465,Elon


### Split the data into training and testing sets  
##### Why do we need to split the data into training and testing sets?  
The training dataset is used to train our classification model. While training the model, special features will be selected. The model will learn the relationship between these features and our labeled data, e.g., to what extent the words in the tweet reveal if it was written by Donald Trump or Donald Drumpf.  
With our testing set we will test our model. Since the testing data is not labeled when we feed it to our model, we can capture the predictions made with the testing data and compare it with the actual results to gauge how well our model is performing. If we tested our model with the training data, it would likely perform very well, but would not give us much information about how well our model can generalize to data it hasn't seen before.  

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['Tweet'], df['Label'], test_size=0.25, random_state=42, stratify=df['Label'])

### Create text features from the tweets  
Working with text data isn't easy business. The lengths of the tweets are different, and they are filled with lots of words that probably won't give us much information about the author.  For this reason, we will remove any stop words (the, a, on, at, to) from our data.  

The TfidfVectorizer will take our tweet data and put it into matrix format. Each tweet will correspond to a row, and each feature will correspond to a column. The feature could be a word or sequence of words together. One way of filling in the matrix is by having each entry in the matrix correspond to the frequency of that word in a given tweet. But that probably can't capture the uniqueness of an individual's writing style among their entire tweets. Here's where tf-idf comes in. Instead of having each entry in the matrix correspond to the frequency of a word or sequence of words in a tweet, we will have each entry correspond to its tf-idf value. This is computed as follows:    
TF(t) = (Number of times term t appears in a tweet) / (Total number of terms in the tweet)    
IDF(t) = log_e(Total number of tweets / Number of tweets with term t in it)  
The final value is just TF(t) * IDF(t)  

In [12]:
#create Tfidf matrix


#transforming X_train, X_test and into dataframes



In [2]:
#view the column features


In [3]:
#view the transformed data  


### Fit the data to a logistic regression and output the results  

In [4]:
# Setup logistic regression and score train set



In [6]:
#try out logistic regression on test set


In [7]:
#print classification report



#### We're done! You've created your first tweet classifier!  
Now that you're a machine learning pro, how could you bump up the accuracy? Would you try different features, or a different model? Experiment and include your results in the cells below.  

### Experiment using an alternative dataset below