# Week 5: A Naive Bayes classifier

Suppose President Trump gets savvy, and realizes that people can use the Android/iPhone distinction to separate his tweets from the the tweets of aides. He starts using an iPhone too. Now, how will we distinguish tweets really authored by the President?

Well, one thing we can do is train a classifier to predict authorship using the text itself.

In [43]:
import os, csv, math, random
import pandas as pd
import numpy as np

from collections import Counter

cwd = os.getcwd()
print('Current working directory: ' + cwd + '\n')
      
relativepath = os.path.join('..', 'data', 'weekfour', 'trump.csv')
trump = pd.read_csv(relativepath)
trump.head()

Current working directory: /Users/tunder/Dropbox/courses/2017datasci/code



Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted,longitude,latitude
0,0,1,My economic policy speech will be carried live...,False,9214,,2016-08-08 15:20:44,False,,762669882571980801,,"<a href=""http://twitter.com/download/android"" ...",realDonaldTrump,3107,False,False,,
1,1,2,"Join me in Fayetteville, North Carolina tomorr...",False,6981,,2016-08-08 13:28:20,False,,762641595439190016,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,2390,False,False,,
2,2,3,"#ICYMI: ""Will Media Apologize to Trump?"" https...",False,15724,,2016-08-08 00:05:54,False,,762439658911338496,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,6691,False,False,,
3,3,4,"Michael Morell, the lightweight former Acting ...",False,19837,,2016-08-07 23:09:08,False,,762425371874557952,,"<a href=""http://twitter.com/download/android"" ...",realDonaldTrump,6402,False,False,,
4,4,5,The media is going crazy. They totally distort...,False,34051,,2016-08-07 21:31:46,False,,762400869858115588,,"<a href=""http://twitter.com/download/android"" ...",realDonaldTrump,11717,False,False,,


Let's simplify the Trump dataframe. We only need three columns:

1. Text of the tweet
2. Source = Android or iphone
3. A random number 0-4 that we'll use to divide the dataset into five 'folds'.


In [56]:
def trump_test(a_data_frame, rowidx):
    if 'iphone' in a_data_frame['statusSource'][rowidx]:
        return 'iphone'
    elif 'android' in a_data_frame['statusSource'][rowidx]:
        return 'android'
    else:
        return 'other'
    
tweet_text = trump['text']

source = []
fold = []
for idx in trump.index:
    source.append(trump_test(trump, idx))
    fold.append(random.sample(list(range(5)), 1)[0])
source = pd.Series(source, index = trump.index)
fold = pd.Series(fold, index = trump.index)

tdf = pd.concat([tweet_text, source, fold], axis = 1)
tdf.columns = ['text', 'source', 'fold']

# limit the dataframe to columns with either android or iphone;
# exclude 'other'
tdf = tdf[(tdf['source'] == 'android') | (tdf['source'] == 'iphone')]
tdf.head()


Unnamed: 0,text,source,fold
0,My economic policy speech will be carried live...,android,1
1,"Join me in Fayetteville, North Carolina tomorr...",iphone,3
2,"#ICYMI: ""Will Media Apologize to Trump?"" https...",iphone,4
3,"Michael Morell, the lightweight former Acting ...",android,2
4,The media is going crazy. They totally distort...,android,2


Now we need to divide the dataset into a training set and a test set. This is easily done using the "folds." We select one fold as our test set and use all the others as a training set.



In [57]:
testset = tdf[tdf['fold'] == 4]
trainingset = tdf[tdf['fold'] != 4]
print('Training set includes ' + str(trainingset.shape[0]))
print('Test set includes ' + str(testset.shape[0]))

Training set includes 1101
Test set includes 289


Now we need some basic text-wrangling functions that we've used before.

In [5]:
def tokenize(astring):
    ''' Breaks a string into words, and counts them.
    Designed so it strips punctuation and lowercases everything,
    but doesn't separate hashtags and at-signs.
    '''
    wordcounts = Counter()
    # create a counter to hold the counts
    
    tokens = astring.split()
    for t in tokens:
        word = t.strip(',.!?:;-—()<>[]/"\'').lower()
        wordcounts[word] += 1
        
    return wordcounts

def create_vocab(seq_of_strings, n):
    ''' Given a sequence of text snippets, this function
    returns the n most common words. We'll use this to
    create a limited 'vocabulary'.
    '''
    vocab = Counter()
    for astring in seq_of_strings:
        counts = tokenize(astring)
        vocab = vocab + counts
    topn = [x[0] for x in vocab.most_common(n)]
    return topn

Now we can actually write functions that build a Naive Bayes model. ```train_nb_model``` is the central function here. It calls the other two.

In [88]:
def categorize(df, rowidx):
    if df.loc[rowidx, 'source'] == 'android':
        return 'positive'
    elif df.loc[rowidx, 'source'] == 'iphone':
        return 'negative'
    else:
        print('error: neither iphone nor android')
        return 'other'

def get_priors(df):
    source_counts = df.groupby('source').count()['text']
    print(source_counts)
    positive_odds = source_counts['android'] / source_counts['iphone']
    negative_odds = source_counts['iphone'] / source_counts['android']
    return math.log(positive_odds), math.log(negative_odds)

def train_nb_model(df, p):
    vocab = create_vocab(df['text'], p)
    vocabset = set(vocab)
    # we make a set because membership-checking is faster
    # in sets; but we also hold onto the list, which is ordered
    
    positive_prior, negative_prior = get_priors(df)
    
    positive_counts = Counter()
    negative_counts = Counter()
    
    for i in df.index:
        tweet = df['text'][i]
        tweet_counts = tokenize(tweet)
        category = categorize(df, i)
        if category == 'negative':
            negative_counts = negative_counts + tweet_counts
        elif category == 'positive':
            positive_counts = positive_counts + tweet_counts
    
    # Now let's organize these Counters into a DataFrame
    
    negative = pd.Series(1, index = vocab)
    positive = pd.Series(1, index = vocab)
    # notice initializing to 1 -- Laplacian smoothing
    
    for word, count in positive_counts.items():
        if word in vocabset:
            positive[word] += count
    
    for word, count in negative_counts.items():
        if word in vocabset:
            negative[word] += count
    
    all_prob = (negative + positive) / (np.sum(negative) + np.sum(positive))
    
    negative_prob = negative / np.sum(negative)
    positive_prob = positive / np.sum(positive)
    
    # note that when we sum up the negative and positive
    # columns, we are also summing up all the Laplacian 1's
    # we initially added to them
    
    model = pd.concat([negative, positive, all_prob, 
                       negative_prob, positive_prob], axis = 1) 
        
    model.columns = ['neg', 'pos', 'all_prob', 'neg_prob', 'pos_prob']
    
    # The next step is unnecessary, and will not be found in
    # most published versions of naive Bayes. I'm providing it
    # because it may help you understand the logic of the
    # algorithm.
    
    model['neg_norm'] = negative_prob / all_prob
    model['pos_norm'] = positive_prob / all_prob
    
    
    model['log_neg'] = [math.log(x) for x in model['neg_norm']]
    model['log_pos'] = [math.log(x) for x in model['pos_norm']]
    return vocab, positive_prior, negative_prior, model

vocab, positive_prior, negative_prior, model = train_nb_model(trainingset, 75)
model.head() 
        

source
android    606
iphone     495
Name: text, dtype: int64


Unnamed: 0,neg,pos,all_prob,neg_prob,pos_prob,neg_norm,pos_norm,log_neg,log_pos
the,167,497,0.075558,0.056783,0.085001,0.751525,1.124981,-0.285651,0.117766
to,131,252,0.043582,0.044543,0.043099,1.022039,0.988914,0.0218,-0.011147
and,77,298,0.042672,0.026182,0.050966,0.613556,1.194378,-0.488483,0.177626
in,107,198,0.034706,0.036382,0.033864,1.048284,0.975713,0.047155,-0.024586
a,98,207,0.034706,0.033322,0.035403,0.960111,1.020064,-0.040707,0.019865


Notice that we're using logarithms of the probabilities, so that we can just add them up. Our priors are logarithms, too.

In [89]:
print(positive_prior, negative_prior)

0.20232222350062407 -0.2023222235006242


Now let's write a function that applies a given model to a given testset. It will have lots of arguments.

In [90]:
pd.options.mode.chained_assignment = None

def apply_model(vocab, positive_prior, negative_prior, model, testset):
    right = 0
    wrong = 0
    vocabset = set(vocab)
    odds_pos = []
    odds_neg = []

    for i in testset.index:
        odds_positive = positive_prior
        odds_negative = negative_prior
        tweet = testset['text'][i]
        tweet_counts = tokenize(tweet)
        for word, count in tweet_counts.items():
            if word not in vocabset:
                continue
            odds_positive += model.loc[word, 'log_pos']
            odds_negative += model.loc[word, 'log_neg']
            
        if odds_positive > odds_negative:
            prediction = 'positive'
        else:
            prediction = 'negative'
        
        odds_pos.append(odds_positive)
        odds_neg.append(odds_negative)

        reality = categorize(testset, i)
        if reality != 'positive' and reality != 'negative':
            continue
        elif prediction == reality:
            right += 1
        else:
            wrong += 1

    print("Got " + str(right) + " rows right, and " + str(wrong) + " wrong.")
    accuracy = (right / (wrong + right)) * 100
    print("Accuracy was {0:.2f}%".format(accuracy))
    
    resultset = testset.copy()
    resultset['odds_positive'] = odds_pos
    resultset['odds_negative'] = odds_neg
    resultset = resultset.sort_values(by = 'odds_positive')
    
    return resultset

newtestset = apply_model(vocab, positive_prior, 
                         negative_prior, model, testset)

Got 205 rows right, and 84 wrong.
Accuracy was 70.93%


The ```apply_model``` function returns a version of the test set with two new columns. The dataframe is sorted by the (ascending) odds of being in the positive class, so we can find the "Trumpiest" and "least Trumpy" tweets by saying ```.tail()``` or ```.head()``` respectively.

In [87]:
newtestset.tail(20)


Unnamed: 0,text,source,fold,odds_positive,odds_negative
162,The invention of email has proven to be a very...,iphone,4,1.943128,-4.918958
1134,"Obama, and all others, have been so weak, and ...",android,4,1.962053,-5.507154
104,The dishonest media didn't mention that Bernie...,android,4,1.962424,-5.452342
186,Even though Bernie Sanders has lost his energy...,android,4,2.013873,-4.988697
1347,A big fat hit job on @oreillyfactor tonight. A...,android,4,2.014236,-5.99323
1236,Word is I am doing very well in Michigan and M...,android,4,2.050395,-6.183038
1369,The reason that Ted Cruz lost the Evangelicals...,android,4,2.086914,-5.638276
154,"I hate to say it, but the Republican Conventio...",android,4,2.102557,-5.473023
1127,Lyin' Ted Cruz denied that he had anything to ...,android,4,2.114487,-6.330089
679,The Inspector General's report on Crooked Hill...,android,4,2.144761,-6.036284


## Exercise 1.

To start with, just play around with the functions above in order to find a value of ```p``` (number of parameters in the model) that roughly maximizes accuracy on the test set.

What accuracy do you get if you train a model on the whole ```tdf``` data frame, and also apply it to ```tdf``` as a whole?

## Exercise 2.

Write a function that *cross-validates* a modeling strategy by applying it successively to five different training sets and testing it on five different test sets.

This is called "five-fold crossvalidation."

## Exercise 3 (probably, for homework).

Do all this for the poefic dataset, trying to distinguish poetry from fiction. Create a new notebook. Copy functions as needed in order to build a naive Bayes classifier and run five-fold crossvalidation.

How much accuracy do you get? Why do you think that accuracy is higher or lower than it was on the Trump tweet data? (You might want to inspect the data itself, using Excel or a text editor.)