#DATASCI W261: Machine Learning at Scale 

* **Sayantan Satpati**
* **sayantan.satpati@ischool.berkeley.edu**
* **W261**
* **Week-1**
* **Assignment-2**
* **Date of Submission: 07-SEP-2015**

#This notebook implements a Spam Filter backed by a Multinomial Naive Bayes Classifier 

In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# Import a bunch of libraries.
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

# Set the randomizer seed so results are the same each time.
np.random.seed(0)

### HW1.0.0

**Define big data. Provide an example of a big data problem in your domain of expertise.**

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate, and cannot be processed or analyzed in a single computer. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set. Big Data is also characterized by the 4 V's: Volume, Velocity, Variety, and Veracity

### HW1.0.1



#Map

In [2]:
%%writefile mapper_HW12.py
#!/usr/bin/python
import sys
import re

def strip_special_chars(word):
    return re.sub('[^A-Za-z0-9]+', '', word)

count = 0
filename = sys.argv[1]
wordList = sys.argv[2]
wordList = wordList.split()
wordCountDict = {}
with open (filename, "r") as myfile:
    for line in myfile:
        # Split the line by <TAB> delimeter
        email = re.split(r'\t+', line)
        
        # Check whether Content is present
        if len(email) < 4:
            continue
        
        # Get the content as a list of words
        content = email[len(email) - 1].split()
        
        if len(wordList) == 1 and wordList[0] == '*':
            for w in content:
                w = strip_special_chars(w)
                if w not in wordCountDict:
                    wordCountDict[w] = 1
                else:
                    wordCountDict[w] += 1
        else:
            for w in content:
                w = strip_special_chars(w)
                # Check if word is in word list passed to mapper
                if w in wordList:
                    if w not in wordCountDict:
                        wordCountDict[w] = 1
                    else:
                        wordCountDict[w] += 1
       
# Print count from each mapper
for k,v in wordCountDict.items():
    print "{0}\t{1}".format(k,v)

Overwriting mapper_HW12.py


In [3]:
!chmod a+x mapper_HW12.py

#Reduce

In [4]:
%%writefile reducer_HW12.py
#!/usr/bin/python
import sys
import re
cnt = 0
wordCountDict = {}
for file in sys.argv:
    if cnt == 0:
        cnt += 1
        continue
        
    with open (file, "r") as myfile:
        for line in myfile:
            wc = re.split(r'\t+', line.strip())
            if wc[0] not in wordCountDict:
                wordCountDict[wc[0]] = int(wc[1])
            else:
                wordCountDict[wc[0]] += int(wc[1])
                
# Print count from each mapper
for k,v in wordCountDict.items():
    print "{0}\t{1}".format(k,v)

Overwriting reducer_HW12.py


In [5]:
!chmod a+x reducer_HW12.py

In [6]:
# Remove split files from last runs
! rm License.txt.*

rm: License.txt.*: No such file or directory


# Write control script 'pNaiveBayes.sh' to a file

In [31]:
%%writefile pNaiveBayes.sh
## pNaiveBayes.sh
## Author: Jake Ryland Williams
## Usage: pNaiveBayes.sh m wordlist
## Input:
##       m = number of processes (maps), e.g., 4
##       wordlist = a space-separated list of words in quotes, e.g., "the and of"
##
## Instructions: Read this script and its comments closely.
##               Do your best to understand the purpose of each command,
##               and focus on how arguments are supplied to mapper.py/reducer.py,
##               as this will determine how the python scripts take input.
##               When you are comfortable with the unix code below,
##               answer the questions on the LMS for HW1 about the starter code.

## collect user input
m=$1 ## the number of parallel processes (maps) to run
wordlist=$2 ## if set to "*", then all words are used

## Mapper and Reducer Files are passed to make this script generic
mapper=$3
reducer=$4

## a test set data of 100 messages
data="enronemail_1h.txt" 

## the full set of data (33746 messages)
# data="enronemail.txt" 

## 'wc' determines the number of lines in the data
## 'perl -pe' regex strips the piped wc output to a number
linesindata=`wc -l $data | perl -pe 's/^.*?(\d+).*?$/$1/'`

## determine the lines per chunk for the desired number of processes
linesinchunk=`echo "$linesindata/$m+1" | bc`

## split the original file into chunks by line
split -l $linesinchunk $data $data.chunk.

## assign python mappers (mapper.py) to the chunks of data
## and emit their output to temporary files
for datachunk in $data.chunk.*; do
    ## feed word list to the python mapper here and redirect STDOUT to a temporary file on disk
    ####
    ####
    ./${mapper} $datachunk "$wordlist" > $datachunk.counts &
    ####
    ####
done
## wait for the mappers to finish their work
wait

## 'ls' makes a list of the temporary count files
## 'perl -pe' regex replaces line breaks with spaces
countfiles=`\ls $data.chunk.*.counts | perl -pe 's/\n/ /'`

## feed the list of countfiles to the python reducer and redirect STDOUT to disk
####
####
./${reducer} $countfiles > $data.output
####
####

## clean up the data chunks and temporary count files
\rm $data.chunk.*

## Display the Output
cat $data.output


Overwriting pNaiveBayes.sh


In [8]:
'''
HW1.1. Read through the provided control script (pNaiveBayes.sh)
'''
print "done"

done


#Run the file

In [32]:
!chmod a+x pNaiveBayes.sh

Usage: usage: pGrepCount filename word chuncksize

In [10]:
# Test the Program
!./pNaiveBayes.sh 4 'the and of' 'mapper_HW12.py' 'reducer_HW12.py'

and	631
of	546
the	1217


In [11]:
'''
HW1.2. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
'''
!./pNaiveBayes.sh 4 'assistance' 'mapper_HW12.py' 'reducer_HW12.py'

assistance	9


In [33]:
%%writefile mapper_HW15.py
#!/usr/bin/python
import sys
import re

def strip_special_chars(word):
    word = word.strip()
    
    if not word or word == '':
        return None
    
    word = re.sub('[^A-Za-z0-9]+', '', word)
    return word.lower()

count = 0
filename = sys.argv[1]
wordList = sys.argv[2]
wordList = wordList.split()

# (Line#, Spam/Ham, Dict of Word|Count)
mapper_output_list = []
line_num = 0
with open (filename, "r") as myfile:
    for line in myfile:
        # Split the line by <TAB> delimeter
        email = re.split(r'\t+', line)
        
        # Check whether Content is present
        if len(email) < 4:
            continue
            
        line_num += 1
        
        # Get the content as a list of words
        content = email[len(email) - 1].split()
        
        wordCountDict = {}
        for w in content:
            w = strip_special_chars(w)
            
            if not w:
                continue
                
            if w not in wordCountDict:
                wordCountDict[w] = 1
            else:
                wordCountDict[w] += 1
                
        mapper_output_list.append((line_num, email[1], wordCountDict))
       
# Print output from each mapper
for (line_num, spam, wordCountDict) in mapper_output_list:
    for word,count in wordCountDict.items():
        print "{0}\t{1}\t{2}\t{3}".format(line_num, spam, word, count)
    

Overwriting mapper_HW15.py


In [34]:
!chmod a+x mapper_HW15.py

In [35]:
%%writefile reducer_HW15.py
#!/usr/bin/python
import sys
import re
import math

# Totals
vocab = 0
vocab_spam = 0
vocab_ham = 0

vocab = {}
word_counts = {
    "1": {},
    "0": {}
}

num_spam = 0
num_ham = 0

cnt = 0
# Calculate the totals in Reducer First Pass
for file in sys.argv:
    if cnt == 0:
        cnt += 1
        continue
        
    with open (file, "r") as myfile:
        last_line_num = -1
        last_spam = -1
        
        for line in myfile:
            tokens = re.split(r'\t+', line.strip())
            line_num = int(tokens[0])
            spam = int(tokens[1])
            word = tokens[2]
            count = float(tokens[3])
            
            # Init
            if last_line_num == -1:
                last_line_num = line_num
                last_spam = spam
            
            # Add Vocab per line
            if word not in vocab:
                vocab[word] = 0.0
            if word not in word_counts[str(spam)]:
                word_counts[str(spam)][word] = 0.0
            vocab[word] += count
            word_counts[str(spam)][word] += count
                    
            if last_line_num != line_num:
                if last_spam == 1:
                    num_spam += 1
                else:
                    num_ham += 1
                
            last_line_num = line_num
            last_spam = spam
            
        # Last Line
        if last_spam == 1:
            num_spam += 1
        else:
            num_ham += 1
                
# At the end of first pass
print 'Num Spam: {0}, Num Ham: {1}'.format(num_spam, num_ham)
print '''Total Vocab: {0},
       Total Unique Vocab: {1},
       Total Spam Vocab: {2}, 
       Total Ham Vocab: {3}'''.format(sum(vocab.values()), 
                                    len(vocab),
                                    sum(word_counts['1'].values()), 
                                    sum(word_counts['0'].values())
                                   )
                                    

prior_spam = (num_spam * 1.0) / (num_spam + num_ham)
prior_ham = (num_ham * 1.0) / (num_spam + num_ham)
print '[Priors] Spam: {0}, Ham: {1}'.format(prior_spam, prior_ham)

spam_likelihood_denom = sum(word_counts['1'].values()) + len(vocab)
ham_likelihood_denom = sum(word_counts['0'].values()) + len(vocab)

# Calculate the Conditionals/Likelihood in Next Pass
reducer_output_list = []
cnt = 0
for file in sys.argv:
    if cnt == 0:
        cnt += 1
        continue
        
    with open (file, "r") as myfile:
        last_line_num = -1
        log_prob_spam = 0
        log_prob_ham = 0
        
        for line in myfile:
            
            tokens = re.split(r'\t+', line.strip())
            line_num = int(tokens[0])
            spam = int(tokens[1])
            word = tokens[2]
            count = int(tokens[3])
            
            # Init
            if last_line_num == -1:
                last_line_num = line_num
            
            if last_line_num != line_num:
                # Calculate the Naive Bayes Scores for Document Classification
                spam_score = log_prob_spam + math.log(prior_spam)
                ham_score = log_prob_ham + math.log(prior_ham)
                reducer_output_list.append((spam, spam_score, ham_score))
                # Reset log prob
                log_prob_spam = 0
                log_prob_ham = 0
            else:
                # Calcuate the log likelihoods Using Laplace Smoothing
                spam_likelihood = (word_counts['1'].get(word, 0.0) + 1) / spam_likelihood_denom
                ham_likelihood = (word_counts['0'].get(word, 0.0) + 1) / ham_likelihood_denom
                log_prob_spam += math.log( spam_likelihood )
                log_prob_ham += math.log( ham_likelihood )
            
            last_line_num = line_num
            
        # Last Line
        spam_score = log_prob_spam + math.log(prior_spam)
        ham_score = log_prob_ham + math.log(prior_ham)
        reducer_output_list.append((spam, spam_score, ham_score))
        
total = 0.0
miscat = 0.0
for (spam, spam_score, ham_score) in reducer_output_list:
        total += 1.0
        pred_class = 'HAM'
        if spam_score > ham_score:
            pred_class = 'SPAM'
        if (spam == 1 and pred_class == 'HAM') or (spam == 0 and pred_class == 'SPAM'):
            miscat += 1.0
            
        print "{0}\t{1}\t{2}\t{3}".format(spam, spam_score, ham_score, pred_class)

error = miscat * 100 / total
print "Accuracy: {0}, Error Rate: {1}, # of Miscats: {2}".format((100 - error), error, miscat)

Overwriting reducer_HW15.py


In [36]:
!chmod a+x reducer_HW15.py

In [37]:
'''
HW1.5. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
   will classify the email messages by all words present.
'''
!./pNaiveBayes.sh 4 '*' 'mapper_HW15.py' 'reducer_HW15.py'

Num Spam: 43, Num Ham: 59
Total Vocab: 30316.0,
       Total Unique Vocab: 5601,
       Total Spam Vocab: 17851.0, 
       Total Ham Vocab: 12465.0
[Priors] Spam: 0.421568627451, Ham: 0.578431372549
0	-9.82787146701	-9.65607518653	HAM
0	-9.31704584324	-8.55746289786	HAM
0	-2191.1588365	-2021.72420184	HAM
0	-281.842510908	-248.630590416	HAM
0	-916.441638904	-878.151478825	HAM
0	-1653.48319844	-1492.50829857	HAM
1	-395.32418664	-381.836278447	HAM
1	-469.326129307	-514.959604259	SPAM
1	-929.769132314	-1010.88945311	SPAM
0	-580.653856258	-614.353545243	SPAM
0	-308.945761337	-276.031549971	HAM
0	-47.2260842694	-40.2045304237	HAM
0	-950.866291476	-848.782383396	HAM
1	-1072.4835333	-960.299272087	HAM
1	-606.745205434	-633.062794683	SPAM
0	-640.316398508	-676.856179047	SPAM
0	-899.109077747	-813.432461852	HAM
0	-639.476372133	-553.677146111	HAM
1	-486.410820816	-462.414233728	HAM
1	-611.928765714	-658.589558436	SPAM
0	-635.74110717	-666.787750109	SPAM
0	-696.43567268

In [17]:
# Load Data into Pandas Dataframe
df = pd.read_csv('enronemail_1h.txt', sep='\t', header=None)
df.columns = ['ID', 'SPAM', 'SUBJECT', 'CONTENT']
df.head()


Unnamed: 0,ID,SPAM,SUBJECT,CONTENT
0,0001.1999-12-10.farmer,0,christmas tree farm pictures,
1,0001.1999-12-10.kaminski,0,re: rankings,thank you.
2,0001.2000-01-17.beck,0,leadership development pilot,"sally: what timing, ask and you shall receiv..."
3,0001.2000-06-06.lokay,0,key dates and impact of upcoming sap implemen...,
4,0001.2001-02-07.kitchen,0,key hr issues going forward,a) year end reviews-report needs generating l...


In [18]:
# Remove missing values
print df.count()
df = df.dropna()
print df.count()

ID         100
SPAM       100
SUBJECT     98
CONTENT     96
dtype: int64
ID         94
SPAM       94
SUBJECT    94
CONTENT    94
dtype: int64


In [19]:
data = df['CONTENT'].values
labels = df['SPAM'].values
print data[:1], labels[:1]
# Split into Train and Test
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size = 0.8)
print train_data.shape, train_labels.shape
print test_data.shape, test_labels.shape

[' thank you.'] [0]
(75,) (75,)
(19,) (19,)


In [20]:
# Extract features from Dataset
cv = CountVectorizer(analyzer='word')
train_counts = cv.fit_transform(data)
print "Shape of training/feature vector", train_counts.shape
print "Size of the Vocabulary", len(cv.vocabulary_)

# Run Multinomial NB (sklearn)
mNB = MultinomialNB()
mNB.fit(train_counts, labels)
print "Multinomial NB Training Accuracy: {0}".format(mNB.score(train_counts, labels))

#Run Bernoulli MB (sklearn)
bNB = BernoulliNB()
bNB.fit(train_counts, labels)
print "Bernoulli NB Training Accuracy: {0}".format(bNB.score(train_counts, labels))

Shape of training/feature vector (94, 5224)
Size of the Vocabulary 5224
Multinomial NB Training Accuracy: 0.989361702128
Bernoulli NB Training Accuracy: 0.765957446809
