The objective is to implement a Naive Bayes classifier to predict whether a tweet was posted by a Republican or Democrat politician. The training data consist of about 13K tweets collected before the 2016 US presidential elections, There are about an equal number of Republican and Democrat tweets, and the tweets belong to three republican and three democrat twitter accounts. 

To represent each tweet, we will use a commonly used model in natural language processing called 'bag of words' model. A bag of words representation of a document (tweet here) consists of words and their frequencies in the document. The order of words is ignored.  

There four main tasks.
1. Tokenization: Parsing and converting the tweets to tokens. 
2. Feature matrix construction from the training data set
3. Learning Naive Bayes parameters, priors and likelihoods, from the feature matrix.
4. Using the learned NB model to predict the labels of the test data set (about 4K tweets).

## Tokenization
This task consists of converting each tweet into a sequence of "tokens" that can be used as features. Tokens are essentially characters and character sequences obtained after using white space as a separator. A lot these are noise that we want to remove; some are words or other character sequences that are useful features. A python package called *NLTK* (natural language toolkit) contains several tokenizers, including one for tweets. We use that tokenizer; in addition we do the following:
- remove stopwords. These are words that are frequently used in a language but do not carry any semantic information, e.g., the, an , a, this, is, was, etc.
- make all tokens lower case (this is done by the tweet tokenizer)
- removing twitter handles (again, done by the tweet tokenizer)
- remove punctuations, http links

Finally, we "lemmatize" the tokens. That means we convert different forms of a word to a common basic form, so that they can be recognized as the same work. E.g., vote, votes, voted would all be converted to vote; geese would be converted to goose,e tc. (There is a less sophisticated version of lemmatizer called a stemmer which just chops words to convert to the same base work; it doesn't work as well as a lemmatizer and we dont use it here.) There is a good description of the NLTK tokenizer [here](https://berkeley-stat159-f17.github.io/stat159-f17/lectures/11-strings/11-nltk..html).

The output of this part is a cleaned up list of tokens for each tweet. 


In [17]:
import pandas as pd
import string
import numpy as np

import nltk
#
# you may need to run the following
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\srira\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\srira\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [18]:
# The data set has two columns - screen_name and text (which is the raw tweet)

## load tweets
tweets = pd.read_csv("tweets_train.csv", na_filter=False)

## screen_namee (accounts)
#  democrat - hillary, time kaine, TheDemocrats
# republicans - trunp, pence, GOP

In [19]:
tweets['screen_name'].unique()

array(['GOP', 'TheDemocrats', 'HillaryClinton', 'timkaine', 'mike_pence',
       'realDonaldTrump'], dtype=object)

In [20]:
tweets.head()

Unnamed: 0,screen_name,text
0,GOP,RT @GOPconvention: #Oregon votes today. That m...
1,TheDemocrats,RT @DWStweets: The choice for 2016 is clear: W...
2,HillaryClinton,Trump's calling for trillion dollar tax cuts f...
3,HillaryClinton,.@TimKaine's guiding principle: the belief tha...
4,timkaine,Glad the Senate could pass a #THUD / MilCon / ...


In [21]:
tweets.describe()

Unnamed: 0,screen_name,text
count,13000,13000
unique,6,12982
top,realDonaldTrump,MAKE AMERICA GREAT AGAIN!
freq,2217,4


In [22]:
# add labels
# 1 for D's
# 0 for R's
tweets['label'] = tweets['screen_name'].str.contains('TheDemocrats|HillaryClinton|timkaine', regex=True)
tweets.describe()

Unnamed: 0,screen_name,text,label
count,13000,13000,13000
unique,6,12982,2
top,realDonaldTrump,MAKE AMERICA GREAT AGAIN!,False
freq,2217,4,6554


The training data has 13K tweets, and each of the two classes have about an equal number of tweets.

Now we will define our tokenizer.

In [23]:
from nltk.stem import WordNetLemmatizer
#
#  Input : dataframe with a column names 'text' which contains raw tweets (one per row)
#  Output: A list of lists of tokens corrsponding to the 'text' column
#
def tokenize_tweets2(tweets):
    """Given a df with tweets in 'text' col, this function return tokens as a list of lists"""

    # apply tokenize to the 'text' coolumn in the tweets df
    tweet_tokenizer = nltk.tokenize.TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)
    tokens = tweets['text'].apply(tweet_tokenizer.tokenize)
    
    # filter
    misc = ['rt', '’', '…', '—', 'u', '”', 'w', '“', '...', '️', 'http', 'https']
    to_remove = nltk.corpus.stopwords.words('English') + list(string.punctuation) + misc
    
    lemmatizer = WordNetLemmatizer()
    
    tokens = [[lemmatizer.lemmatize(token) for token in tw if token not in to_remove] for tw in tokens]      
    return(tokens)

In [24]:
all_tokens = tokenize_tweets2(tweets)
tweets_dem=tweets[tweets['label']]
tweets_rep=tweets[tweets['label']==0]
token_dem=tokenize_tweets2(tweets_dem)
token_rep=tokenize_tweets2(tweets_rep)
print(len(token_dem))
print(len(token_rep))
#print(token_rep[:10])
#print(token_rep[:10])
#all_tokens[:10]

6446
6554


 Let's find the most common tokens, and we will use all tokens that at least occur 25 times as features.

In [25]:
from collections import Counter

counts = Counter([token for tokens in all_tokens for token in tokens])
print(len(counts))
counts.most_common(20)

23459


[('hillary', 1159),
 ('trump', 1144),
 ('great', 749),
 ('clinton', 720),
 ('today', 709),
 ('make', 581),
 ('donald', 576),
 ('president', 564),
 ('day', 552),
 ('thank', 539),
 ('american', 512),
 ('new', 503),
 ('job', 503),
 ('u', 485),
 ('america', 480),
 ('people', 469),
 ('vote', 451),
 ('state', 442),
 ('get', 420),
 ('year', 415)]

In [26]:
top_words = [k for k in counts.keys() if counts.get(k) > 25]
len(top_words)

927

top_words are our features.
Now let's construct a feature matrix from these top words

## Feature Martix Construction

Compute feature matrix

Now we will extract the features from the training data and construct a feature matrix. The bad news is this matrix can be very large. In our case it is about 13K X 1K, or about 13M x 4 bytes ~ 52M, which will easily fit in the RAM of your laptops, but the training set could have easily been 10x or 100x the current size, and the number of features 10x in which case you would be out of luck. The good news is this matrix is likely to be very sparse. In fact, each tweet is not likely to contain more than 10-20 tokens, so even if this matrix becomes very large, we would be okay if we use a sparse representation.

In a sparse representation, only the non-zero entities and their indices are saved. Scipy provides [several formats](https://docs.scipy.org/doc/scipy-0.18.1/reference/sparse.html) for sparse matrices. 

To make it easier to estimate priors and likelihoods, we will construct two feature matrices - one for each for the two classes. For this, first we need to figure out how many data points are in each class.


In [27]:
num_feat = len(top_words)

# set this to the correct values
nTrainR = sum(tweets['label']==0)  # number of R (0) training points
nTrainD =sum(tweets['label'])   # number of D (1) training points


# create sparse feature matrix
from scipy.sparse import csc_matrix

rfmat = csc_matrix((nTrainR, num_feat), dtype=int)
dfmat = csc_matrix((nTrainD, num_feat), dtype=int)

#
# populate rfmat and dfmat with the counts of the features
# Remember: all tokens are not features
#
# a function that might be useful is <list>.index() 
#
row=[]
cols=[]
data=[]
for index, item in enumerate(token_dem):
    i = index
    list_tmp = item
    for index, item in enumerate(list_tmp):
        if item in top_words:
            row.append(i)
            cols.append(top_words.index(item))
            data.append(1)
            
dfmat = csc_matrix((data, (row, cols)), shape=(nTrainD, num_feat))
row1=[]
cols1=[]
data1=[]
for index, item in enumerate(token_rep):
    i = index
    list_tmp = item
    for index, item in enumerate(list_tmp):
        if item in top_words:
            row1.append(i)
            cols1.append(top_words.index(item))
            data1.append(1)
            
rfmat = csc_matrix((data1, (row1, cols1)), shape=(nTrainR, num_feat))
#rfmat
#print(dfmat)

## Learning Naive Bayes Model Parameters

compute log priors

compute log likelihoods using Laplace smoothing

Now we can compute the model parameters, this is, the likelihoods and priors for the two classes. As we discussed in class, since the probabilities can be very small numbers, we will compute log likelihoods and log priors. Aslo use Laplace (aka add one) smoothing.

To sum a matrix column, you can use something like dfmat[:,i].sum()

In [28]:
# compute log priors

import math
log_p_rep=math.log(nTrainR/len(tweets))
log_p_dem=math.log(nTrainD/len(tweets))
p_likel_dem=[]
p_likel_rep=[]

# compute log likelihoods

for i in range(0,927):
    p_likel_dem.append(math.log((dfmat[:,i].sum()+1)/(dfmat.sum()+2)))
    p_likel_rep.append(math.log((rfmat[:,i].sum()+1)/(rfmat.sum()+2)))

print("prior for democratic:",log_p_dem)
print("prior for republican:",log_p_rep)
print("likelihood for democratic:",p_likel_dem[:30])
print("likelihood for republican:",p_likel_rep[:30])


prior for democratic: -0.7014895740682907
prior for republican: -0.6848738071849139
likelihood for democratic: [-4.879083404688273, -4.70970883426038, -6.862604824489369, -5.200898483509512, -6.8149767755001145, -6.054388314144637, -7.049816366577516, -5.116591377049518, -6.515733880647257, -5.112345086168067, -6.605884977641555, -6.386522149167251, -5.894045664069457, -7.243972381018473, -5.417121591979148, -7.531654453470254, -8.091270241405677, -5.8487890724813365, -6.8385072729103085, -6.912615245064031, -7.742963547137461, -5.434513334691017, -6.417293807834005, -7.485134437835361, -6.938590731467292, -7.685805133297513, -4.74723127358347, -6.550825200458528, -6.327681649144318, -7.357301066325476]
likelihood for republican: [-5.4962423605355735, -4.65913449105412, -7.59522849828838, -4.716621581971801, -7.320791652586619, -5.9445476273202305, -7.04315991598834, -5.9857905858542795, -6.235602384250651, -6.291172235405461, -6.95614853899871, -6.350012735428395, -5.727483122082403, 

## Prediction on Test Set

Now we have a trained Naive Bayes classifier. We will load the test data set and make the predictions. Note: If a token is not a feature, ignore it. 

Load test data and tokenize

Using the trained NB classifier predict the labels

Calculate accuracy, recall, and precision of your predictions


In [29]:
#Load test data and tokenize

tweets_test = pd.read_csv("tweets_test.csv", na_filter=False)
test_res = tweets_test['screen_name'].str.contains('TheDemocrats|HillaryClinton|timkaine', regex=True)
test_tokens = tokenize_tweets2(tweets_test)
#tweets_test.describe()

#NB Classifier to predict labels

list_tmp1=[]
poster_res=[]
for index, item in enumerate(test_tokens):
    i = index
    list_tmp1 = item
    dem_prob=0
    rep_prob=0
    for index, item in enumerate(list_tmp1):
        if item in top_words:
            col2=top_words.index(item)
            dem_prob=dem_prob+p_likel_dem[col2]
            rep_prob=rep_prob+p_likel_rep[col2]
        else:
            continue
    dem_prob=dem_prob+log_p_dem
    rep_prob=rep_prob+log_p_rep
    if dem_prob>rep_prob:
        poster_res.append(True)
    else:
        poster_res.append(False)

print("Test labels:",poster_res[:30])

#Accuracy,recall and precision

i,TP,TN,FP,FN=0,0,0,0,0
for res in poster_res:
    if (test_res[i]==True and res==True):
        TP+=1
    elif (test_res[i]==True and res==False):
        FN+=1
        
    elif (test_res[i]==False and res==True):
        FP+=1
        
    elif (test_res[i]==False and res==False):
        TN+=1
    i=i+1
    acc=(TP+TN)/(TP+TN+FP+FN)
    rec=TP/(TP+FP)
    prec=TP/(TP+FN)
print("Accuracy:", acc)
print("Recall:", rec)
print("Precision:", prec)

Test labels: [True, False, False, True, True, False, False, True, False, True, False, True, False, True, True, False, True, True, False, True, True, False, True, True, False, True, True, False, True, False]
Accuracy: 0.8120055839925546
Recall: 0.8215271389144434
Precision: 0.8096101541251133


List of features with top ten likelihoods for each of the two classes. Look into things like:
What is the likelihood for 'hillary', that is, P(hillary|class)? 
Is it in the top ten? 
How important is it in this classification problem?

In [30]:
#Democrats

top_likel_dem=np.argsort(p_likel_dem)[-10:]
top_likel_demtokens=[]
for index,item in enumerate(top_likel_dem):
    top_likel_demtokens.append(top_words[item])
    
#Republicans

top_likel_rep=np.argsort(p_likel_rep)[-10:]
top_likel_reptokens=[]
for index,item in enumerate(top_likel_rep):
    top_likel_reptokens.append(top_words[item])
    
print("Top likelihoods for republicans:", top_likel_reptokens)
print("Top likelihoods for democrats:", top_likel_demtokens)

#Importance of Hillary

hillary=top_words.index("hillary")
rep_hillary=p_likel_dem[hillary]
dem_hillary=p_likel_rep[hillary]
print("P(Hillary|democrat) , P(Hillary|republic) ",rep_hillary,dem_hillary)

Top likelihoods for republicans: ['state', 'job', 'indiana', 'day', 'new', 'today', 'thank', 'great', 'hillary', 'clinton']
Top likelihoods for democrats: ['one', 'vote', 'u', 'make', 'american', 'today', 'president', 'donald', 'hillary', 'trump']
P(Hillary|democrat) , P(Hillary|republic)  -4.225291174478937 -4.162940529556193


How important are the priors in this problem?

1. When likelihoods are negligible, closer to zero or zero, they determine the posterior
2. When tweets do not contain any of the top_words i.e features, priors determine the probability. 

Compute the accuracy of the test set without Laplace smoothing and compare with the above.

In [31]:
import math
p_likel_dem_wolaplace=[]
p_likel_rep_wolaplace=[]

# compute log likelihoods

for index,item in enumerate(top_words):
    if dfmat[:,index].sum() == 0:
        p_likel_dem_wolaplace.append(0)
    else:
        p_likel_dem_wolaplace.append(math.log((dfmat[:,index].sum())/(dfmat.sum())))
        
for index,item in enumerate(top_words):
    if rfmat[:,index].sum()==0:
        p_likel_rep_wolaplace.append(0)
    else:
        p_likel_rep_wolaplace.append(math.log((rfmat[:,index].sum())/(rfmat.sum())))
        
#print("likelihood for democratic without laplace:",p_likel_dem_wolaplace[:30])
#print("likelihood for republican:",p_likel_rep_wolaplace[:30])

poster_res=[]
for index, item in enumerate(test_tokens):
    i = index
    list_tmp1 = item
    dem_prob=0
    rep_prob=0
    for index, item in enumerate(list_tmp1):
        if item in top_words:
            col2=top_words.index(item)
            dem_prob=dem_prob+p_likel_dem_wolaplace[col2]
            rep_prob=rep_prob+p_likel_rep_wolaplace[col2]
        else:
            continue
    dem_prob=dem_prob+log_p_dem
    rep_prob=rep_prob+log_p_rep
    if dem_prob>rep_prob:
        poster_res.append(True)
    else:
        poster_res.append(False)
print("Test labels:",poster_res[:30])

#Accuracy,recall and precision
i,TP,TN,FP,FN=0,0,0,0,0
for res in poster_res:
    if (test_res[i]==True and res==True):
        TP+=1
    elif (test_res[i]==True and res==False):
        FN+=1
        
    elif (test_res[i]==False and res==True):
        FP+=1
        
    elif (test_res[i]==False and res==False):
        TN+=1
    i=i+1
    acc=(TP+TN)/(TP+TN+FP+FN)
    rec=TP/(TP+FP)
    prec=TP/(TP+FN)
print("Accuracy:", acc)
print("Recall:", rec)
print("Precision:", prec)

Test labels: [True, False, False, True, True, True, False, True, False, True, False, True, False, True, True, False, True, True, False, True, True, False, True, True, False, True, True, False, True, False]
Accuracy: 0.6644951140065146
Recall: 0.6552845528455284
Precision: 0.7307343608340888
