#DATASCI W261: Machine Learning at Scale  

#Vineet Gangwar
**vineet.gangwar@gmail.com **  
**W261-2: Machine Learning at Scale**  
**Assignment #1**   
**Date: Sep - 15 - 2015**

#HW1.0.0  
Define big data. Provide an example of a big data problem in your domain of expertise  
  
**Answer:**  
*Big data definition*  
Big data refers to the following:
- Data so large that traditional application are inadequate for both storage and analysis
- Data that is large enough to require more than one machine to store and process it

*Example of Big Data*  
My domain is Enterprise Monitoring. In this domain, we monitor parameters of running IT systems such as operating systems, networks, applications) and generate alerts based on thresholds. The parameters monitored include: CPU, Memory, Disk, Network, logs and many more. The parameter list can go into hundreds.  
Apart from alerting, we also need to store historical data of all monitored parameters. An organization with a few thousand machines, can generate a few terabytes of data per day. Storing this data and making it available for efficient historical analysis is a big data problem

#HW1.0.1  
In 500 words (English or pseudo code or a combination) describe how to estimate the bias, the variance, the irreduciable error for a test dataset T when using polynomial regression models of degree 1, 2,3, 4,5 are considered. How would you select a model?

**Answer:**  
Statistical models attempt to model reality. So there will be an error between the model and reality (true function). The error has 2 parts - Irreducible and Reducible. The Reducible error also has two parts - squared bias and variance.  

Calculation of Squared Bias and Variance:  
> If we take 100 different training dataset with 50 observations each, then we can estimate 100 different models.  

Error due to squared bias is the expected error (of these 100 models) where the error of each model is calculated as the error between the prediction and the y in the training data

Error due to variance is the expected variance in the predictions of these 100 models against a test dataset.

As the flexibility of a statistical model increases, the variance increases and bias decreases. The bias decreases because the flexible model starts to learn the noise in the training datasets.

> Method to select a polynomial regression models of degree 1, 2, 3, 4, 5

- Choose m different training datasets with n datasets each
- For each of the m training datasets, generate 5 models of degrees 1 to 5
- For each 'degree',  calculate expected squared bias and expected variance
- Choose the 'degree' that generated the least reducible error


#HW1.1
Read through the provided control script (pNaiveBayes.sh) and all of its comments. When you are comfortable with their purpose and function, respond to the remaining homework questions below. A simple cell in the notebook with a print statmement with  a "done" string will suffice here.

In [1]:
print 'Done'

Done


#HW1.2
Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will determine the number of occurrences of a single, user-specified word. Examine the word “assistance” and report your results.  

To do so, make sure that
   
   - mapper.py counts all occurrences of a single word, and
   - reducer.py collates the counts of the single word.

**Mapper**  
The mapper uses words from just the email message/content. This is as per Jake's suggestion in the Google Groups. As a result it outputs 9 occurrences for the words assistance.

Interface:  
*Input*: The mapper takes a filename and a list of words as inputs  
*Output*: It outputs word, word count, email id and TRUTH for each input word in separate lines  

The map reduce code below can handle single words, multiple words and also \* for all words

In [2]:
%%writefile mapper.py
#!/home/vineetgangwar/anaconda/bin/python
import sys
import re

# Reading file into memory and converting into lowercase
filename = sys.argv[1]
filetxt = str()
with open (filename, "r") as myfile:
    for line in myfile:
        # Splitting each line/email based on tab
        fields = line.strip().split('\t')
        email_id = fields[0]
        truth = fields[1]
        if len(fields) == 4:
            subject = fields[2]
            message = fields[3]
        else:
            subject = ""
            message = fields[2]
        filetxt = filetxt + ' ' + message
# ====

# This function returns words for which to count occurrences of
def wordlist():
    inputlist = sys.argv[2].lower()
    if inputlist == '*':        # If * then using all words from the input file
        file_aplhanum = re.sub('[^a-z]', ' ', filetxt)    # Converting non alpha characters to space
        words = set(file_aplhanum.split())
    else:
    	words = inputlist.split()                         # Splitting list of words input by the user
    return words 

def main():
    for word in wordlist():
        #Reading email file line by line where each line is an email
        with open(filename, 'r') as myfile:
            for line in myfile:
                # Splitting each line/email based on tab
                fields = line.strip().split('\t')
                email_id = fields[0]
                truth = fields[1]
                if len(fields) == 4:
                    subject = fields[2]
                    message = fields[3]
                else:
                    subject = ""
                    message = fields[2]
                # Printing word, count, email_id and Truth
                print word, message.lower().count(word), email_id, truth

if __name__ == '__main__':
        main()

Overwriting mapper.py


**Reducer**  
The reducer first reads all intermediate files and accumulates all the mapper output data in a list object. It then loops through the list and does two things:
- Uses the words in the list to create dictionary keys
- Uses the word_counts in the list to increment the dictionary values where the word == dictionary key

It then prints out the dictionary which gives words and their corresponding counts

In [3]:
%%writefile reducer.py
#!/home/vineetgangwar/anaconda/bin/python
import sys

# The function 'readfiles()' reads all the intermediate files generated by the mappers and returns a list of lists in the following format:
# [word, word_count, email_id, TRUTH]
# e.g.
# [
# ['assistance', '1', '0018.2003-12-18.GP', '1'],
# ['assistance', '3', '0018.2001-07-13.SA_and_HP', '1'],
# ['enlargementwithatypo', '0', '0001.1999-12-10.farmer', '0']
# ]

def readfiles():
    # Opening all files for reading
    filehandles = [open(file, 'r') for file in sys.argv[1:]]
    
    # Reading all files - strip the last newline and then split on new lines 
    all_file_data = [fh.read().strip().split('\n') for fh in filehandles]

    # Closing all files
    retval = [fh.close() for fh in filehandles]
    
    # Flattening and sorting list
    flat_list = [l for sublist in all_file_data for l in sublist]
    flat_list.sort()

    # Splitting and creating list of lists
    key_value_list = [item.split() for item in flat_list] 
    
    return key_value_list

def main():
    # This dict will be used to store counts of terms
    word_count_dict = dict()
    key_value_list = readfiles()

    # Looping through the list of lists created by def readfiles()
    for item in key_value_list:
        if item[0] in word_count_dict.keys():   # If word exists update count
            word_count_dict[item[0]] = int(word_count_dict[item[0]]) + int(item[1])
        else:                                   # If new word then create key and store count
            word_count_dict[item[0]] = int(item[1])

    # Printing dictionary contents as output
    for key, value in word_count_dict.iteritems():
        print  key + '\t' + str(value)

if __name__ == '__main__':
        main()

Overwriting reducer.py


**pNaiveBayes**  
Writing pNaiveBayes out to the filesystem

In [4]:
%%writefile pNaiveBayes.sh
## pNaiveBayes.sh
## Author: Jake Ryland Williams
## Usage: pNaiveBayes.sh m wordlist
## Input:
##       m = number of processes (maps), e.g., 4
##       wordlist = a space-separated list of words in quotes, e.g., "the and of"
##
## Instructions: Read this script and its comments closely.
##               Do your best to understand the purpose of each command,
##               and focus on how arguments are supplied to mapper.py/reducer.py,
##               as this will determine how the python scripts take input.
##               When you are comfortable with the unix code below,
##               answer the questions on the LMS for HW1 about the starter code.

## collect user input
m=$1 ## the number of parallel processes (maps) to run

wordlist=$2 ## if set to "*", then all words are used

## a test set data of 100 messages
data="enronemail_1h.txt" 

## the full set of data (33746 messages)
# data="enronemail.txt" 

## 'wc' determines the number of lines in the data
## 'perl -pe' regex strips the piped wc output to a number
linesindata=`wc -l $data | perl -pe 's/^.*?(\d+).*?$/$1/'`

## determine the lines per chunk for the desired number of processes
linesinchunk=`echo "$linesindata/$m+1" | bc`

## split the original file into chunks by line
split -l $linesinchunk $data $data.chunk.

## assign python mappers (mapper.py) to the chunks of data
## and emit their output to temporary files
for datachunk in $data.chunk.*; do
    ## feed word list to the python mapper here and redirect STDOUT to a temporary file on disk
    ####
    ####
    ./mapper.py $datachunk "$wordlist" > $datachunk.counts &
    ####
    ####
done
## wait for the mappers to finish their work
wait

## 'ls' makes a list of the temporary count files
## 'perl -pe' regex replaces line breaks with spaces
countfiles=`ls $data.chunk.*.counts | perl -pe 's/\n/ /'`

## feed the list of countfiles to the python reducer and redirect STDOUT to disk
####
####
./reducer.py $countfiles > $data.output
####
####

## clean up the data chunks and temporary count files
rm $data.chunk.*
cat $data.output

Overwriting pNaiveBayes.sh


Changing execute permissions and executing pNaiveBayes.sh

In [5]:
!chmod a+x mapper.py
!chmod a+x reducer.py
!chmod a+x pNaiveBayes.sh
!./pNaiveBayes.sh 4 "assistance"

assistance	9


#HW1.3
Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will classify the email messages by a single, user-specified word using the Naive Bayes Formulation. Examine the word “assistance” and report your results.  

To do so, make sure that
   
   - mapper.py and
   - reducer.py 

that performs a single word Naive Bayes classification.

**Result**  
The algorithm obtained an accurracy of 0.56

**Mapper**  
Same as **HW1.2** above


**Reducer**  
Step 1:  
The reducer first reads all the intermediate files and accumulates all the mapper output data in a list object. It then uses the list object to create a Pandas DataFrame that contains the term frequency per document:
- Index/Row names are the email_ids
- Column headers are words/features. If the user inputs a word or a list of words from the command line then the headers are those words. This includes words such as enlargementWITHATypo. If the user specified * then all unique words from all emails become the column headers
- The DataFrame also contains another column called 'TRUTH' that contains the true class of each email

Step 2:  
Next, the reducer calculates probabilities from the above DataFrame and stores them in new objects. It used Pandas DataFrame methods such as groupby, sum for the calculations:
- It calculates and stores the prior probabilities in a dictionary
- It calculates and stores P(word|class) in a Pandas DataFrame. The index of the DataFrame are words and the columns are 'SPAM' and 'HAM'. Laplace smoothing is applied in this step

Step 3:  
The Reducer then calculates P(email|class) and stores these in yet another Pandas DataFrame. This DataFrame has email_ids as the index and has the following column headings - SPAM, HAM, PREDICT, TRUTH. 'SPAM' stores the log probability of the email given SPAM. 'HAM' stores the log probability of the email given HAM. 'PREDICT' stores the predict class of the email based on the calculated log probabilities. 'TRUTH' contains the true class of the email.
The log probabilities of email given class is calculated as follows:
- For each email, the reducer refers the 1st DataFrame i.e. the one that contains the Term Frequecy per DataFrame. It gets a dictionary of words and counts where the counts are greater than zero.
- It then uses the prior proabilities dictionary and the DataFrame containing P(word|class) to calculate P(email|document)
- It then stores the log probabilities in the DataFrame created in step 3

Step 4:
The Reducer outputs the results in the format - Email_id \t TRUTH \t Predicted_Class. It also calculates the accuracy and prints it also.

In [6]:
%%writefile reducer.py
#!/home/vineetgangwar/anaconda/bin/python
import pandas as pd
import numpy as np
import math
import sys

# The function 'readfiles()' reads all the intermediate files generated by the mappers and returns a list of lists in the following format:
# [word, word_count, email_id, TRUTH]
# e.g.
# [
# ['assistance', '1', '0018.2003-12-18.GP', '1', ['assistance','and','valium']],
# ['assistance', '3', '0018.2001-07-13.SA_and_HP', '1', ['assistance','and','valium']],
# ['enlargementwithatypo', '0', '0001.1999-12-10.farmer', '0', ['assistance','and','valium']]
# ]

def readfiles(filelist):
    # Opening all files for reading
    filehandles = [open(file, 'r') for file in filelist]
    # Reading all files - strip the last newline and then split on new lines
    all_file_data = [fh.read().strip().split('\n') for fh in filehandles]
    # Closing all files
    retval = [fh.close() for fh in filehandles]
    # Flattening and sorting list
    flat_list = [l for sublist in all_file_data for l in sublist]
    flat_list.sort()
    # Splitting and creating list of lists
    key_value_list = [item.split() for item in flat_list]
    return key_value_list

# This function creates a dataframe with email_ids as the index
# and all words as the column headings. This data frame essentially contains the terms frequency per document
# Each cell contains the count of occurrences of each word in each email
# This function returns a tuple of vocab and the DataFrame

def create_dataframe(key_value_list):
    ## Creating dataframe of email id and truth pairs
    # Creating list of email_ids and truths
    email_ids = list()
    truths = list()
    for item in key_value_list:
        email_id = item[2]
        truth = int(item[3])
        if email_id not in email_ids:
            email_ids.append(email_id)
            truths.append(truth)
    # Creating dictionary
    id_truth_dict = dict()
    id_truth_dict['email_id'] = email_ids
    id_truth_dict['TRUTH'] = truths
    # Converting into dataframe
    id_truth = pd.DataFrame(id_truth_dict)

    ## Creating data frame to store word counts and email in matrix
    # Creating words and ids list to create an empty data frame email_id X word list
    set_of_words = set()
    set_of_ids = set()
    for item in key_value_list:
        set_of_words.add(item[0])
        set_of_ids.add(item[2])

    set_of_words = list(set_of_words)
    set_of_ids = list(set_of_ids)
    num_of_ids = len(set_of_ids)

    # Creating dict of zeros to convert into a dataframe
    zeros_dict = dict()
    for i in range(len(set_of_words)):
        zeros_dict[set_of_words[i]] = [0 for x in range(num_of_ids)]
    # Adding ids
    zeros_dict['email_id'] = set_of_ids

    # Converting into dataframe
    id_wordlist = pd.DataFrame(zeros_dict)

    # Merging dataframe to add truth also
    df = pd.merge(id_wordlist, id_truth, on='email_id', how='inner')
    df.set_index('email_id', inplace=True)

    # Updating counts
    for item in key_value_list:
        email_id = item[2]
        word_count = item[1]
        word = item[0]
        df.loc[email_id, word] = int(word_count)

    return set_of_words, df

# This function calcuates the following probabilities:
# Priors in a Dict() called priors
# A DataFrame containing probabilities of all words given class. This Dataframe called 
# word_prob_class has the following structure:
# words X class

def calculating_probs(vocab, df):
    category = {'spam': 1, 'ham': 0}
    ## Calculating probabilities
    # Calculating priors probabilites and storing in a dict
    prob_prior_spam = df.groupby('TRUTH').size()[1].astype(float) / len(df)
    prob_prior_ham = df.groupby('TRUTH').size()[0].astype(float) / len(df)
    priors = {'spam': prob_prior_spam, 'ham': prob_prior_ham}

    # Calculating term count in spam and ham for the given vocab
    term_count_spam = df.groupby('TRUTH').sum().sum(axis=1)[1]
    term_count_ham = df.groupby('TRUTH').sum().sum(axis=1)[0]
    term_count_category = {'spam': term_count_spam, 'ham': term_count_ham}

    # Calculating counts of words in vocab per catergory
    words_per_category = df.groupby('TRUTH').sum().transpose()

    # Calculating word probabilities per class
    word_probs_class = words_per_category.copy()
    for cat_key, cat_value in category.iteritems():
        word_probs_class[cat_value] = word_probs_class[cat_value] / term_count_category[cat_key]
    # Applying laplace smoothing
    # For Spam
    word_probs_class[1][word_probs_class[1] == 0] = float(1) / (term_count_category['spam'] + len(vocab))
    # For ham
    word_probs_class[0][word_probs_class[0] == 0] = float(1) / (term_count_category['ham'] + len(vocab))
    
    return priors, word_probs_class 

def main():
    filelist = sys.argv[1:]
    # Aggregating input from all the mappers
    key_value_list = readfiles(filelist)
    # Creating vocabulary and Dataframe contains counts of terms per document
    vocab, df = create_dataframe(key_value_list)
    
    # Writing term frequency per document matrix to file for use in HW1.6
    df.to_csv('input_for_hw1.6.csv')

    # Getting priors and words given class probabilities
    priors, word_probs_class = calculating_probs(vocab, df)
    
    # Creating Pandas DataFrame to store final probabilities
    # Creating dataframe to store probabilities
    # Structure of DataFrame has email_ds in the index
    # and spam, ham, TRUTH, PREDICT as columns
    df_probs = df.copy(deep=True)
    header_to_remove = list(df_probs.columns.values)
    header_to_remove.remove('TRUTH')
    tokens = header_to_remove
    df_probs.drop(header_to_remove, inplace=True, axis=1)
    df_probs['spam'] = [0 for x in range(df.index.values.shape[0])]
    df_probs['ham'] = [0 for x in range(df.index.values.shape[0])]
    df_probs['PREDICT'] = [0 for x in range(df.index.values.shape[0])]
    
    # Looping through all emails and calculating probabilites of email given class
    # and storing in a DataFrame
    
    category = {'spam': 1, 'ham': 0}
    for email_id in df_probs.index:
        # Creating dict of all words whose count != 0 per email
        words_in_email = dict(df.loc[email_id, df.loc[email_id] != 0])
        # Removing the column 'TRUTH'
        if 'TRUTH' in words_in_email:
            words_in_email.pop('TRUTH')

        for cat_key, cat_value in category.iteritems():
            running_prob = math.log(priors[cat_key])

            for word in words_in_email:
                count = df.loc[email_id, word]
                running_prob += count * math.log(word_probs_class.loc[word, cat_value])

            df_probs.loc[email_id, cat_key] = running_prob

    # Calculating predictions
    df_probs['PREDICT'] = (df_probs['spam'] > df_probs['ham']).astype(int)

    # Printing output
    for email_id in df_probs.index:
        print email_id, '\t', int(df_probs.loc[email_id, 'TRUTH']), '\t', int(df_probs.loc[email_id, 'PREDICT'])
    
    # Calculating and printing accuracy
    correct = df_probs['TRUTH'] == df_probs['PREDICT']
    print 'Accuracy:', float(np.sum(correct.astype(int))) / len(df_probs)

if __name__ == '__main__':
        main()

Overwriting reducer.py


Executing pNaiveBayes with "assistance"

In [7]:
!./pNaiveBayes.sh 4 "assistance"

0010.2003-12-18.GP 	1 	0
0010.2001-06-28.SA_and_HP 	1 	0
0001.2000-01-17.beck 	0 	0
0018.1999-12-14.kaminski 	0 	0
0005.1999-12-12.kaminski 	0 	0
0011.2001-06-29.SA_and_HP 	1 	0
0008.2004-08-01.BG 	1 	0
0009.1999-12-14.farmer 	0 	0
0017.2003-12-18.GP 	1 	0
0011.2001-06-28.SA_and_HP 	1 	0
0015.2001-07-05.SA_and_HP 	1 	0
0015.2001-02-12.kitchen 	0 	0
0009.2001-06-26.SA_and_HP 	1 	0
0017.1999-12-14.kaminski 	0 	0
0012.2000-01-17.beck 	0 	0
0003.2000-01-17.beck 	0 	0
0004.2001-06-12.SA_and_HP 	1 	0
0008.2001-06-12.SA_and_HP 	1 	0
0007.2001-02-09.kitchen 	0 	0
0016.2004-08-01.BG 	1 	0
0015.2000-06-09.lokay 	0 	0
0016.1999-12-15.farmer 	0 	0
0013.2004-08-01.BG 	1 	0
0005.2003-12-18.GP 	1 	0
0012.2001-02-09.kitchen 	0 	0
0011.1999-12-14.farmer 	0 	0
0003.2001-02-08.kitchen 	0 	0
0009.2001-02-09.kitchen 	0 	0
0006.2001-02-08.kitchen 	0 	0
0014.2003-12-19.GP 	1 	0
0010.1999-12-14.farmer 	0 	0
0010.2004-08-01.BG 	1 	0
0014.1999-12-14.kaminski 	0 	0
0006.1999-12-1

#HW1.4
Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will classify the email messages by a list of one or more user-specified words. Examine the words “assistance”, “valium”, and “enlargementWithATypo” and report your results  
To do so, make sure that

   - mapper.py counts all occurrences of a list of words, and
   - reducer.py 

performs the multiple-word Naive Bayes classification via the chosen list.

**Mapper**  
Same as **HW1.2** above

**Reducer**  
Same as **HW1.3** above 

The algorithm achieved an accuracy of 0.56. using the terms "assistance valium enlargementWithATypo".  
To Note:  
- The algorithm achieved an accuracy of of 0.59 when the entire email was used rather than just the message body.  
- The algorithm achieved an accuracy of of 0.57 when the term 'him' was added

In [8]:
!./pNaiveBayes.sh 4 "assistance valium enlargementWithATypo"

0010.2003-12-18.GP 	1 	0
0010.2001-06-28.SA_and_HP 	1 	0
0001.2000-01-17.beck 	0 	0
0018.1999-12-14.kaminski 	0 	0
0005.1999-12-12.kaminski 	0 	0
0011.2001-06-29.SA_and_HP 	1 	0
0008.2004-08-01.BG 	1 	0
0009.1999-12-14.farmer 	0 	0
0017.2003-12-18.GP 	1 	0
0011.2001-06-28.SA_and_HP 	1 	0
0015.2001-07-05.SA_and_HP 	1 	0
0015.2001-02-12.kitchen 	0 	0
0009.2001-06-26.SA_and_HP 	1 	0
0017.1999-12-14.kaminski 	0 	0
0012.2000-01-17.beck 	0 	0
0003.2000-01-17.beck 	0 	0
0004.2001-06-12.SA_and_HP 	1 	0
0008.2001-06-12.SA_and_HP 	1 	0
0007.2001-02-09.kitchen 	0 	0
0016.2004-08-01.BG 	1 	0
0015.2000-06-09.lokay 	0 	0
0016.1999-12-15.farmer 	0 	0
0013.2004-08-01.BG 	1 	0
0005.2003-12-18.GP 	1 	0
0012.2001-02-09.kitchen 	0 	0
0011.1999-12-14.farmer 	0 	0
0003.2001-02-08.kitchen 	0 	0
0009.2001-02-09.kitchen 	0 	0
0006.2001-02-08.kitchen 	0 	0
0014.2003-12-19.GP 	1 	0
0010.1999-12-14.farmer 	0 	0
0010.2004-08-01.BG 	1 	0
0014.1999-12-14.kaminski 	0 	0
0006.1999-12-1

The algorithm achieved an accuracy of 0.57 when the term 'him' was added

In [9]:
!./pNaiveBayes.sh 4 "assistance valium enlargementWithATypo him"

0010.2003-12-18.GP 	1 	0
0010.2001-06-28.SA_and_HP 	1 	0
0001.2000-01-17.beck 	0 	0
0018.1999-12-14.kaminski 	0 	0
0005.1999-12-12.kaminski 	0 	0
0011.2001-06-29.SA_and_HP 	1 	0
0008.2004-08-01.BG 	1 	0
0009.1999-12-14.farmer 	0 	0
0017.2003-12-18.GP 	1 	0
0011.2001-06-28.SA_and_HP 	1 	0
0015.2001-07-05.SA_and_HP 	1 	0
0015.2001-02-12.kitchen 	0 	0
0009.2001-06-26.SA_and_HP 	1 	0
0017.1999-12-14.kaminski 	0 	0
0012.2000-01-17.beck 	0 	0
0003.2000-01-17.beck 	0 	0
0004.2001-06-12.SA_and_HP 	1 	0
0008.2001-06-12.SA_and_HP 	1 	0
0007.2001-02-09.kitchen 	0 	0
0016.2004-08-01.BG 	1 	0
0015.2000-06-09.lokay 	0 	0
0016.1999-12-15.farmer 	0 	0
0013.2004-08-01.BG 	1 	0
0005.2003-12-18.GP 	1 	0
0012.2001-02-09.kitchen 	0 	0
0011.1999-12-14.farmer 	0 	0
0003.2001-02-08.kitchen 	0 	0
0009.2001-02-09.kitchen 	0 	0
0006.2001-02-08.kitchen 	0 	0
0014.2003-12-19.GP 	1 	0
0010.1999-12-14.farmer 	0 	0
0010.2004-08-01.BG 	1 	0
0014.1999-12-14.kaminski 	0 	0
0006.1999-12-1

#HW1.5
Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will classify the email messages by all words present.
To do so, make sure that

   - mapper.py counts all occurrences of all words, and
   - reducer.py performs a word-distribution-wide Naive Bayes classification.

**Mapper**  
Same as **HW1.2** above. Mapper explanation given in section 1.2 above

**Reducer**  
Same as **HW1.3** above. Reducer explaination provided in section 1.3

Executing pNaiveBayes with all words - the algorithm achieved and accuracy of 0.94

In [10]:
!./pNaiveBayes.sh 4 "*"

0010.2003-12-18.GP 	1 	0
0010.2001-06-28.SA_and_HP 	1 	1
0001.2000-01-17.beck 	0 	0
0018.1999-12-14.kaminski 	0 	0
0005.1999-12-12.kaminski 	0 	0
0011.2001-06-29.SA_and_HP 	1 	1
0008.2004-08-01.BG 	1 	0
0009.1999-12-14.farmer 	0 	0
0017.2003-12-18.GP 	1 	1
0011.2001-06-28.SA_and_HP 	1 	1
0015.2001-07-05.SA_and_HP 	1 	1
0015.2001-02-12.kitchen 	0 	0
0009.2001-06-26.SA_and_HP 	1 	1
0018.2001-07-13.SA_and_HP 	1 	1
0012.2000-01-17.beck 	0 	0
0003.2000-01-17.beck 	0 	0
0004.2001-06-12.SA_and_HP 	1 	1
0008.2001-06-12.SA_and_HP 	1 	1
0007.2001-02-09.kitchen 	0 	0
0016.2004-08-01.BG 	1 	1
0015.2000-06-09.lokay 	0 	0
0016.1999-12-15.farmer 	0 	0
0013.2004-08-01.BG 	1 	1
0005.2003-12-18.GP 	1 	1
0012.2001-02-09.kitchen 	0 	0
0011.1999-12-14.farmer 	0 	0
0009.2001-02-09.kitchen 	0 	0
0006.2001-02-08.kitchen 	0 	0
0014.2003-12-19.GP 	1 	1
0010.1999-12-14.farmer 	0 	0
0010.2004-08-01.BG 	1 	1
0014.1999-12-14.kaminski 	0 	0
0006.1999-12-13.kaminski 	0 	0
0005.1999-12

#HW1.6
Benchmark your code with the Python SciKit-Learn implementation of Naive Bayes

It always a good idea to test your solutions against publicly available libraries such as SciKit-Learn, The Machine Learning toolkit available in Python. In this exercise, we benchmark ourselves against the SciKit-Learn implementation of Naive Bayes.  For more information on this implementation see: http://scikit-learn.org/stable/modules/naive_bayes.html more  

Lets define  Training error = misclassification rate with respect to a training set. It is more formally defined here:

Let DF represent the training set in the following:
Err(Model, DF) = |{(X, c(X)) ∈ DF : c(X) != Model(x)}| / |DF|

Where || denotes set cardinality; c(X) denotes the class of the tuple X in DF; and Model(X) denotes the class inferred by the Model “Model”

In this exercise, please complete the following:

- Run the Multinomial Naive Bayes algorithm (using default settings) from SciKit-Learn over the same training data used in HW1.5 and report the Training error (please note some data preparation might be needed to get the Multinomial Naive Bayes algorithm from SkiKit-Learn to run over this dataset)
- Run the Bernoulli Naive Bayes algorithm from SciKit-Learn (using default settings) over the same training data used in HW1.5 and report the Training error 
- Run the Multinomial Naive Bayes algorithm you developed for HW1.5 over the same data used HW1.5 and report the Training error 
- Please prepare a table to present your results
- Explain/justify any differences in terms of training error rates over the dataset in HW1.5 between your Multinomial Naive Bayes implementation (in Map Reduce) versus the Multinomial Naive Bayes implementation in SciKit-Learn
- Discuss the performance differences in terms of training error rates over the dataset in HW1.5 between the  Multinomial Naive Bayes implementation in SciKit-Learn with the  Bernoulli Naive Bayes implementation in SciKit-Learn

In [14]:
from sklearn.feature_extraction.text import *
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
import pandas as pd
import time

filename = 'enronemail_1h.txt'
train_label = list()
train_data = list()
with open (filename, "r") as myfile:
    for line in myfile:
        fields = line.strip().split('\t')
        email_id = fields[0]
        truth = fields[1]
        if len(fields) == 4:
            subject = fields[2]
            message = fields[3]
        else:
            subject = ""
            message = fields[2]
        
        # Updating train label and train data list
        train_label.append(truth)
        train_data.append(message)

# CountVectorizer as feature extraction
cv = CountVectorizer()
cv.fit(train_data)
cv_matrix = cv.transform(train_data)

# Calculating sklearn Multinomial Naive Bayes Training Error
mnb = MultinomialNB()
mnb.fit(cv_matrix, train_label)
mnb_train_err = 1 - mnb.score(cv_matrix, train_label)
print 'HW1.6 - i: sklearn Multinomial Naive Bayes Training Error =\t', mnb_train_err

# Calculating skearn Bernoulli Naive Bayes Training Error
bnb = BernoulliNB()
bnb.fit(cv_matrix, train_label)
bnb_train_err = 1 - bnb.score(cv_matrix, train_label)
print '\nHW1.6 - ii: sklearn Bernoulli Naive Bayes Training Error =\t', bnb_train_err

# Calculating HW1.5 Multinomial Naive Bayes Training Error
tempvar = !./pNaiveBayes.sh 4 "*"
hw1_5_train_err = 1 - float(tempvar[-1].split(':')[1].strip())
print '\nHW1.6 - iii: HW1.5 Multinomial Naive Bayes Training Error =\t', hw1_5_train_err

# Printing table
print "\nHW1.6 - iv: Table with Results"
result = pd.DataFrame({'Algorithm':['sklearn Multi', 'sklearn Bernoulli', 'HW1.5 Multi'], 'Training_Error':[mnb_train_err, bnb_train_err, hw1_5_train_err]})
result.set_index('Algorithm', inplace=True)
print result

# Number v
print "\nHW1.6 - v:"
print "Number of tokens created by CountVectorizer:", len(cv.vocabulary_)

# Testing sklearn with DataFrame from HW1.5
# Reading dataframe created in HW1.5 above
filename = 'input_for_hw1.6.csv'
df = pd.read_csv(filename)

def prepare_data_mnb(df):
    # Creating train_data and train_label
    train_data = df.copy()

    train_label = train_data['TRUTH']
    train_label = train_label.as_matrix()

    train_data.drop(['email_id', 'TRUTH'], axis=1, inplace=True)
    train_data = train_data.as_matrix()
    return train_data, train_label

# Multinomial Naive Bayes model
train_data, train_label = prepare_data_mnb(df)
mnb = MultinomialNB()
mnb.fit(train_data, train_label)
print 'sklearn Multinomial Naive Bayes Training Error with DataFrame from HW1.5:', 1 - mnb.score(train_data, train_label)

print "Number of Tokens created in HW1.5:", train_data.shape[1]



HW1.6 - i: sklearn Multinomial Naive Bayes Training Error =	0.02

HW1.6 - ii: sklearn Bernoulli Naive Bayes Training Error =	0.19

HW1.6 - iii: HW1.5 Multinomial Naive Bayes Training Error =	0.06

HW1.6 - iv: Table with Results
                   Training_Error
Algorithm                        
sklearn Multi                0.02
sklearn Bernoulli            0.19
HW1.5 Multi                  0.06

HW1.6 - v:
Number of tokens created by CountVectorizer: 5322
sklearn Multinomial Naive Bayes Training Error with DataFrame from HW1.5: 0.03
Number of Tokens created in HW1.5: 5016


The Multinomial Naive Bayes classifier in HW1.5 has a higher training error rate than sklearn's Multinomial classifier. There is a difference of 0.04.  
One reason is that the tokenization is different. In HW1.5 tokenization resulted in 5016 features while sklearn's CountVectorizer created 5322 tokens. Also, as there are many variations of implementations of Multinomial Naive Bayes, it is possible that sklearn uses a different algorithm that what I had implemented in HW1.5. Infact, as a test I had passed the DataFrame (term frequency per document matrix) of HW1.5 to sklearn (by writing it out to a file, then reading and converting it into a numpy ndarray). Sklearn obtained a training error rate of 0.03

HW1.6 - vi:  
Sklearn Multinomial Naive Bayes algorithm uses term frequencies to determine probabilities. While sklearn Bernoulli uses a binary representation of whether a term exists in a document (ofcourse Bernoulli also includes probabilities of non-occurrence of terms also). This means that Bernoulli Naive Bayes throws away a lot of information. This is the reason why Bernoulli training error rate is higher.