# Data Smart Sans Excel

[Table of Contents](Data%20Smart%20Sans%20Excel.ipynb)

If you have not downloaded the Data Smart files then run the first code block of the main notebook and it will download the files from the web. 

## Chapter 3 - Naïve Bayes

In [47]:
import os
import pandas as pd
import numpy as np
excel_file = os.path.join(os.getcwd(), "data_smart_files", "", "Mandrill.xlsx")
about_mandrill_df = pd.read_excel(excel_file, 'AboutMandrillApp', parse_cols = "A", index_col=None)
about_other_df = pd.read_excel(excel_file, 'AboutOther', parse_cols = "A", index_col=None)

def clean(s): 
    s = s.lower()
    s = s.replace('. ', ' ')
    s = s.replace(': ', ' ')
    s = s.replace('?', ' ')
    s = s.replace('!', ' ')
    s = s.replace(';', ' ')
    s = s.replace(',', ' ')
    return s

def split_(s):
    return s.split(' ')

#Clean each tweet and then split the tweets into lists of words
about_mandrill_df = about_mandrill_df.applymap(clean)
about_mandrill_df = about_mandrill_df.applymap(split_)
about_other_df = about_other_df.applymap(clean)
about_other_df = about_other_df.applymap(split_)

Turn the training tweets into lists of unique words (tokens)

In [20]:
import itertools
about_mandrill_words = []
for row in about_mandrill_df.itertuples():
    about_mandrill_words.append(row[1])
about_mandrill_words  = list(itertools.chain(*about_mandrill_words)) #flatten list
amw_unique = set(about_mandrill_words)
if '' in amw_unique:
    amw_unique.remove('')

about_other_words = []
for row in about_other_df.itertuples():
    about_other_words.append(row[1])
about_other_words  = list(itertools.chain(*about_other_words)) #flatten list
aow_unique = set(about_other_words)
if '' in aow_unique:
    aow_unique.remove('')

Count the occurance of each token within the training sets.

In [52]:
about_mandrill_count_dict = {}
for word in amw_unique:
    about_mandrill_count_dict[word] = about_mandrill_words.count(word) + 1
md_sum = sum(about_mandrill_count_dict.values()) 

about_other_count_dict = {}
for word in aow_unique:
    about_other_count_dict[word] = about_other_words.count(word) + 1
ot_sum = sum(about_other_count_dict.values()) 

Calculate the probality of each tokens occurance within the training set as well as the natural log.

In [38]:
md_df = pd.DataFrame(list(about_mandrill_count_dict.values()),index=list(about_mandrill_count_dict.keys()))
md_df.rename(columns={0: 'Token Count + 1'}, inplace=True)
md_df['P(Token|App)'] = md_df['Token Count + 1'] / md_sum
md_df['LN(P)'] = np.log(md_df['P(Token|App)'])

ot_df = pd.DataFrame(list(about_other_count_dict.values()),index=list(about_other_count_dict.keys()))
ot_df.rename(columns={0: 'Token Count + 1'}, inplace=True)
ot_df['P(Token|Other)'] = ot_df['Token Count + 1'] / ot_sum
ot_df['LN(P)'] = np.log(ot_df['P(Token|Other)'])

A function to determine the conditional probability for a particular tweet and an associated model

In [49]:
def tweet_scorer(tweet, probs_df):
    #tweet is list of individual words
    prob_total = 0
    s = probs_df['Token Count + 1'].sum()
    for word in tweet.split(' '):
        if len(word) <= 3: continue # no score for short words
        if word in probs_df.index:
            prob_total += probs_df['LN(P)'].loc[word]
        else:
            prob_total += np.log(1/s)
    return prob_total

Use model to predict class of each test tweet

In [55]:
test_tweets_df = pd.read_excel(excel_file, 'TestTweets', parse_cols = "C", index_col=None)
test_tweets_df = test_tweets_df.applymap(clean)

print ('Predictions')
print ('-'*13)
num = 1
for row in test_tweets_df.itertuples():
    tweet = row[1]
    if tweet_scorer(tweet, md_df) > tweet_scorer(tweet, ot_df):
        print ('#%i: Mandrill App (%s)'%(num, tweet))
    else:
        print ('#%i: Other (%s)'%(num, tweet))
    num += 1

Predictions
-------------
#1: Mandrill App (just love @mandrillapp transactional email service - http://mandrill.com sorry @sendgrid and @mailjet #timetomoveon)
#2: Mandrill App (@rossdeane mind submitting a request at http://help.mandrill.com with account details if you haven't already  glad to take a look )
#3: Mandrill App (@veroapp any chance you'll be adding mandrill support to vero )
#4: Mandrill App (@elie__ @camj59 jparle de relai smtp 1 million de mail chez mandrill / mois comparé à 1 million sur lite sendgrid y a pas photo avec mailjet)
#5: Mandrill App (would like to send emails for welcome  password resets  payment notifications  etc what should i use  was looking at mailgun/mandrill)
#6: Mandrill App (from coworker about using mandrill  "i would entrust email handling to a pokemon".)
#7: Mandrill App (@mandrill realised i did that about 5 seconds after hitting send )
#8: Mandrill App (holy shit it’s here http://www.mandrill.com/ )
#9: Mandrill App (our new subscriber profi

We got the same predictions as the book. 