# Week 5: A Naive Bayes classifier

Suppose President Trump gets savvy, and realizes that people can use the Android/iPhone distinction to separate his tweets from the the tweets of aides. He starts using an iPhone too. Now, how will we distinguish tweets really authored by the President?

Well, one thing we can do is train a classifier to predict authorship using the text itself.

In [10]:
import os, csv, math, random
import pandas as pd
import numpy as np

from collections import Counter

cwd = os.getcwd()
print('Current working directory: ' + cwd + '\n')
      
relativepath = '/Users/rdubnic2/Documents/lis590dsh/Data/trump.csv'
trump = pd.read_csv(relativepath)
trump.head()

Current working directory: /Users/rdubnic2/Documents/lis590dsh/Code



Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted,longitude,latitude
0,0,1,My economic policy speech will be carried live...,False,9214,,2016-08-08 15:20:44,False,,762669882571980801,,"<a href=""http://twitter.com/download/android"" ...",realDonaldTrump,3107,False,False,,
1,1,2,"Join me in Fayetteville, North Carolina tomorr...",False,6981,,2016-08-08 13:28:20,False,,762641595439190016,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,2390,False,False,,
2,2,3,"#ICYMI: ""Will Media Apologize to Trump?"" https...",False,15724,,2016-08-08 00:05:54,False,,762439658911338496,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,6691,False,False,,
3,3,4,"Michael Morell, the lightweight former Acting ...",False,19837,,2016-08-07 23:09:08,False,,762425371874557952,,"<a href=""http://twitter.com/download/android"" ...",realDonaldTrump,6402,False,False,,
4,4,5,The media is going crazy. They totally distort...,False,34051,,2016-08-07 21:31:46,False,,762400869858115588,,"<a href=""http://twitter.com/download/android"" ...",realDonaldTrump,11717,False,False,,


Let's simplify the Trump dataframe. We only need three columns:

1. Text of the tweet
2. Source = Android or iphone
3. A random number 0-4 that we'll use to divide the dataset into five 'folds'.


In [11]:
def trump_test(a_data_frame, rowidx):
    if 'iphone' in a_data_frame['statusSource'][rowidx]:
        return 'iphone'
    elif 'android' in a_data_frame['statusSource'][rowidx]:
        return 'android'
    else:
        return 'other'
    
tweet_text = trump['text']

source = []
fold = []
for idx in trump.index:
    source.append(trump_test(trump, idx))
    fold.append(random.sample(list(range(5)), 1)[0])
source = pd.Series(source, index = trump.index)
fold = pd.Series(fold, index = trump.index)

tdf = pd.concat([tweet_text, source, fold], axis = 1)
tdf.columns = ['text', 'source', 'fold']

# limit the dataframe to columns with either android or iphone;
# exclude 'other'
tdf = tdf[(tdf['source'] == 'android') | (tdf['source'] == 'iphone')]
tdf.head()


Unnamed: 0,text,source,fold
0,My economic policy speech will be carried live...,android,3
1,"Join me in Fayetteville, North Carolina tomorr...",iphone,3
2,"#ICYMI: ""Will Media Apologize to Trump?"" https...",iphone,0
3,"Michael Morell, the lightweight former Acting ...",android,1
4,The media is going crazy. They totally distort...,android,0


Now we need to divide the dataset into a training set and a test set. This is easily done using the "folds." We select one fold as our test set and use all the others as a training set.



In [12]:
testset = tdf[tdf['fold'] == 4]
trainingset = tdf[tdf['fold'] != 4]
print('Training set includes ' + str(trainingset.shape[0]))
print('Test set includes ' + str(testset.shape[0]))

Training set includes 1113
Test set includes 277


In [13]:
testset.head()

Unnamed: 0,text,source,fold
6,"Thank you Windham, New Hampshire! #TrumpPence1...",iphone,4
11,"Anybody whose mind ""SHORT CIRCUITS"" is not fit...",android,4
24,President Obama should ask the DNC about how t...,iphone,4
28,Great meeting all of you. This group knocked o...,iphone,4
30,"Thank you Jacksonville, Florida!\n#MakeAmerica...",iphone,4


Now we need some basic text-wrangling functions that we've used before.

In [14]:
def tokenize(astring):
    ''' Breaks a string into words, and counts them.
    Designed so it strips punctuation and lowercases everything,
    but doesn't separate hashtags and at-signs.
    '''
    wordcounts = Counter()
    # create a counter to hold the counts
    
    tokens = astring.split()
    for t in tokens:
        word = t.strip(',.!?:;-—()<>[]/"\'').lower()
        wordcounts[word] += 1
        
    return wordcounts

def create_vocab(seq_of_strings, n):
    ''' Given a sequence of text snippets, this function
    returns the n most common words. We'll use this to
    create a limited 'vocabulary'.
    '''
    vocab = Counter()
    for astring in seq_of_strings:
        counts = tokenize(astring)
        vocab = vocab + counts
    topn = [x[0] for x in vocab.most_common(n)]
    return topn

Now we can actually write functions that build a Naive Bayes model. ```train_nb_model``` is the central function here. It calls the other two.

In [15]:
def categorize(df, rowidx): # 'df' is 'data frame'
    if df.loc[rowidx, 'source'] == 'android':
        return 'positive' # can change 'positive' or 'negative' to be whatever labels you want (e.g. fiction, nonfiction, iphone, android)
    elif df.loc[rowidx, 'source'] == 'iphone': # pandas function to find the object's row, then check a given column for a classification
        return 'negative'
    else:
        print('error: neither iphone nor android')
        return 'other'

def get_priors(df):
    source_counts = df.groupby('source').count()['text']
    print(source_counts)
    positive_odds = source_counts['android'] / source_counts['iphone']
    negative_odds = source_counts['iphone'] / source_counts['android']
    return math.log(positive_odds), math.log(negative_odds)

def train_nb_model(df, p): # p is the number of words you're using in your model, the portion of the total vocabulary
    vocab = create_vocab(df['text'], p)
    vocabset = set(vocab)
    # we make a set because membership-checking is faster
    # in sets; but we also hold onto the list, which is ordered
    
    positive_prior, negative_prior = get_priors(df)
    
    positive_counts = Counter()
    negative_counts = Counter()
    
    for i in df.index:
        tweet = df['text'][i]
        tweet_counts = tokenize(tweet)
        category = categorize(df, i)
        if category == 'negative':
            negative_counts = negative_counts + tweet_counts
        elif category == 'positive':
            positive_counts = positive_counts + tweet_counts
    
    # Now let's organize these Counters into a DataFrame
    
    negative = pd.Series(1, index = vocab)
    positive = pd.Series(1, index = vocab)
    # notice initializing to 1 -- Laplacian smoothing
    
    for word, count in positive_counts.items():
        if word in vocabset:
            positive[word] += count
    
    for word, count in negative_counts.items():
        if word in vocabset:
            negative[word] += count
    
    all_prob = (negative + positive) / (np.sum(negative) + np.sum(positive))
    
    negative_prob = negative / np.sum(negative)
    positive_prob = positive / np.sum(positive)
    
    # note that when we sum up the negative and positive
    # columns, we are also summing up all the Laplacian 1's
    # we initially added to them
    
    model = pd.concat([negative, positive, all_prob, 
                       negative_prob, positive_prob], axis = 1) 
        
    model.columns = ['neg', 'pos', 'all_prob', 'neg_prob', 'pos_prob']
    
    # The next step is unnecessary, and will not be found in
    # most published versions of naive Bayes. I'm providing it
    # because it may help you understand the logic of the
    # algorithm.
    
    model['neg_norm'] = negative_prob / all_prob
    model['pos_norm'] = positive_prob / all_prob
    
    
    model['log_neg'] = [math.log(x) for x in model['neg_norm']] # using log to control for long floats, which are hard to handle
    model['log_pos'] = [math.log(x) for x in model['pos_norm']]
    return vocab, positive_prior, negative_prior, model

vocab, positive_prior, negative_prior, model = train_nb_model(trainingset, 2350)
model.head() 
        

source
android    600
iphone     513
Name: text, dtype: int64


Unnamed: 0,neg,pos,all_prob,neg_prob,pos_prob,neg_norm,pos_norm,log_neg,log_pos
the,165,497,0.029308,0.019233,0.035477,0.656247,1.210512,-0.421219,0.191043
to,136,262,0.01762,0.015853,0.018702,0.899698,1.061424,-0.105696,0.059611
and,90,300,0.017266,0.010491,0.021415,0.607602,1.240302,-0.498236,0.215355
a,104,210,0.013901,0.012123,0.01499,0.872057,1.078351,-0.136901,0.075433
in,127,180,0.013591,0.014804,0.012849,1.089197,0.945376,0.085441,-0.056172


Notice that we're using logarithms of the probabilities, so that we can just add them up. Our priors are logarithms, too.

In [26]:
testset.head()

Unnamed: 0,text,source,fold
6,"Thank you Windham, New Hampshire! #TrumpPence1...",iphone,4
11,"Anybody whose mind ""SHORT CIRCUITS"" is not fit...",android,4
24,President Obama should ask the DNC about how t...,iphone,4
28,Great meeting all of you. This group knocked o...,iphone,4
30,"Thank you Jacksonville, Florida!\n#MakeAmerica...",iphone,4


In [17]:
print(positive_prior, negative_prior)

0.1566538100453768 -0.15665381004537685


Now let's write a function that applies a given model to a given testset. It will have lots of arguments.

In [18]:
pd.options.mode.chained_assignment = None

def apply_model(vocab, positive_prior, negative_prior, model, testset):
    right = 0
    wrong = 0
    vocabset = set(vocab)
    odds_pos = []
    odds_neg = []

    for i in testset.index:
        odds_positive = positive_prior
        odds_negative = negative_prior
        tweet = testset['text'][i]
        tweet_counts = tokenize(tweet)
        for word, count in tweet_counts.items():
            if word not in vocabset:
                continue
            odds_positive += model.loc[word, 'log_pos']
            odds_negative += model.loc[word, 'log_neg']
            
        if odds_positive > odds_negative:
            prediction = 'positive'
        else:
            prediction = 'negative'
        
        odds_pos.append(odds_positive)
        odds_neg.append(odds_negative)

        reality = categorize(testset, i)
        if reality != 'positive' and reality != 'negative':
            continue
        elif prediction == reality:
            right += 1
        else:
            wrong += 1

    print("Got " + str(right) + " rows right, and " + str(wrong) + " wrong.")
    accuracy = (right / (wrong + right)) * 100
    print("Accuracy was {0:.2f}%".format(accuracy))
    
    resultset = testset.copy()
    resultset['odds_positive'] = odds_pos
    resultset['odds_negative'] = odds_neg
    resultset = resultset.sort_values(by = 'odds_positive')
    
    return resultset, accuracy

newtestset, accuracy = apply_model(vocab, positive_prior, 
                         negative_prior, model, testset)

Got 221 rows right, and 56 wrong.
Accuracy was 79.78%


The ```apply_model``` function returns a version of the test set with two new columns. The dataframe is sorted by the (ascending) odds of being in the positive class, so we can find the "Trumpiest" and "least Trumpy" tweets by saying ```.tail()``` or ```.head()``` respectively.

In [19]:
testset.head()

Unnamed: 0,text,source,fold
6,"Thank you Windham, New Hampshire! #TrumpPence1...",iphone,4
11,"Anybody whose mind ""SHORT CIRCUITS"" is not fit...",android,4
24,President Obama should ask the DNC about how t...,iphone,4
28,Great meeting all of you. This group knocked o...,iphone,4
30,"Thank you Jacksonville, Florida!\n#MakeAmerica...",iphone,4


## Exercise 1.

To start with, just play around with the functions above in order to find a value of ```p``` (number of parameters in the model) that roughly maximizes accuracy on the test set.

What accuracy do you get if you train a model on the whole ```tdf``` data frame, and also apply it to ```tdf``` as a whole?

In [20]:
vocab, positive_prior, negative_prior, model = train_nb_model(trainingset, 2280)
model.head() 

newtestset, accuracy = apply_model(vocab, positive_prior, 
                         negative_prior, model, testset)

source
android    600
iphone     513
Name: text, dtype: int64
Got 221 rows right, and 56 wrong.
Accuracy was 79.78%


## Exercise 2.

Write a function that *cross-validates* a modeling strategy by applying it successively to five different training sets and testing it on five different test sets.

This is called "five-fold crossvalidation."

In [21]:
def five_fold_cross_valid(tdf, p):
    accuracies = []
    for i in range(5):
        tdf_test_set = tdf[tdf['fold'] == i]
        tdf_training_set = tdf[tdf['fold'] != i] 
        vocab, positive_prior, negative_prior, model = train_nb_model(tdf_training_set, p)
        tdf_test_set, accuracy = apply_model(vocab, positive_prior, negative_prior, model, testset)
        accuracies.append(accuracy)
    avg_acc = print('Average accuracy is ', round(sum(accuracies)/len(accuracies),2), '%')
    return avg_acc

In [22]:
five_fold_cross_valid(tdf, 1900)

source
android    599
iphone     510
Name: text, dtype: int64
Got 250 rows right, and 27 wrong.
Accuracy was 90.25%
source
android    622
iphone     497
Name: text, dtype: int64
Got 252 rows right, and 25 wrong.
Accuracy was 90.97%
source
android    594
iphone     488
Name: text, dtype: int64
Got 249 rows right, and 28 wrong.
Accuracy was 89.89%
source
android    633
iphone     504
Name: text, dtype: int64
Got 248 rows right, and 29 wrong.
Accuracy was 89.53%
source
android    600
iphone     513
Name: text, dtype: int64
Got 222 rows right, and 55 wrong.
Accuracy was 80.14%
Average accuracy is  88.16 %


## Exercise 3 (probably, for homework).

Do all this for the poefic dataset, trying to distinguish poetry from fiction. Create a new notebook. Copy functions as needed in order to build a naive Bayes classifier and run five-fold crossvalidation.

How much accuracy do you get? Why do you think that accuracy is higher or lower than it was on the Trump tweet data? (You might want to inspect the data itself, using Excel or a text editor.)