## Data Science W261: Machine Learning at Scale
Safyre Anderson

safyre@berkeley.edu

January 27, 2016 8am

W261-3

Week 2 HW

### HW2.0.  
*What is a race condition in the context of parallel computation? Give an example.*

*What is MapReduce?*

*How does it differ from Hadoop?*

*Which programming paradigm is Hadoop based on? Explain and give a simple example in code and show the code running.*

**Race Condition**:

A race condition refers to the issue of timing the completion and sequence of parallel jobs. For example, let's say there is one round of parrallel jobs is running and there is a subsequent round of parallel jobs. If not all the jobs from the first round of parallel jobs complete and the second round begins prematurely, the job will not complete correctly. This is known as a race condition. Hadoop HDFS was designed to abstract the prevention of race conditions and communication between parallel processes from developers.

**MapReduce**

MapReduce in the broadest sense is a functional programming-inpired parallel processing framework. It consists of at least two major steps: map and reduce (though subsequent MapReduce jobs can follow). 

**Hadoop**

Hadoop is a Java-based framework for distributed data storage and processing that implements its own MapReduce framework. However, Hadoop is also a distributed file system that consists of a master 'namenode' as well as at least one 'datanode'. The namenode keeps track of where data are stored as well as the implementation on MapReduce Jobs. The datanode(s) store typically 3 copies of the data in chunks which are scattered between the datanodes for redundancy and disaster recovery.

**Programming Paradigm**

In [23]:
!source .bashrc
!echo $HADOOP_HOME

/bin/sh: 1: source: not found
/usr/local/hadoop


### HW2.1. Sort in Hadoop MapReduce
*Given as input: Records of the form `<integer, “NA”\>`, where integer is any integer, and “NA” is just the empty string.
Output: sorted key value pairs of the form <integer, “NA”> in decreasing order; what happens if you have multiple reducers? Do you need additional steps? Explain.*

*Write code to generate N  random records of the form `<integer, “NA”>`. Let N = 10,000.
Write the python Hadoop streaming map-reduce job to perform this sort. Display the top 10 biggest numbers. Display the 10 smallest numbers*

I used 1 node Hadoop cluster AWS instance:

`alias aws_hadoop_master="ssh -i "~/.ssh/***.pem" ec2-user@ec2-54-213-63-253.us-west-2.compute.amazonaws.com"`


In [14]:
!pwd

/home/ubuntu


In [9]:
%%writefile rand_num.py
#!/usr/bin/python

import random
random.seed(0)
count = 0
N = 10000
while count < N:
    print str(random.randint(0,N)) +"\t" + "NA"
    count +=1
    

Overwriting rand_num.py


In [10]:
!rm random_numbers.txt
!python rand_num.py > random_numbers.txt

As a sanity check, I wanted to print out the top 10 rows and make sure there were only 10000 numbers (one on each line) generated.

In [11]:
!head random_numbers.txt
!wc -l random_numbers.txt

8445	NA
7580	NA
4206	NA
2589	NA
5113	NA
4049	NA
7838	NA
3033	NA
4766	NA
5834	NA
10000 random_numbers.txt


In [12]:
!cp random_numbers.txt $HADOOP_HOME/input
!ls $HADOOP_HOME/input

capacity-scheduler.xml	hdfs-site.xml	 kms-site.xml	     yarn-site.xml
core-site.xml		httpfs-site.xml  mapred-site.xml
hadoop-policy.xml	kms-acls.xml	 random_numbers.txt


Hadoop streaming should automatically sort the keys. Since our integers are the keys, we can just reprint them using the mapper to sort them.

In [1]:
!hdfs namenode && hdfs datanode
!$HADOOP_HOME/sbin/start-yarn.sh

# CD to hadoop root dir
!cd $HADOOP_HOME

16/01/26 09:20:01 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ip-172-31-25-233.us-west-2.compute.internal/172.31.25.233
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.7.1
STARTUP_MSG:   classpath = /usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/usr/local/hadoop/share/hadoop/common/lib/zookeeper-3.4.6.jar:/usr/local/hadoop/share/hadoop/common/lib/hamcrest-core-1.3.jar:/usr/local/hadoop/share/hadoop/common/lib/slf4j-api-1.7.10.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-httpclient-3.1.jar:/usr/local/hadoop/share/hadoop/common/lib/guava-11.0.2.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/xz-1.0.jar:/usr/local/hadoop/share/ha

In [29]:
# check hdfs status and yarn nodes
!hdfs dfsadmin -report
!yarn node -list

Configured Capacity: 0 (0 B)
Present Capacity: 0 (0 B)
DFS Remaining: 0 (0 B)
DFS Used: 0 (0 B)
DFS Used%: NaN%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
16/01/25 15:47:32 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Total Nodes:1
         Node-Id	     Node-State	Node-Http-Address	Number-of-Running-Containers
ec2-54-213-63-253.us-west-2.compute.amazonaws.com:35267	        RUNNING	ec2-54-213-63-253.us-west-2.compute.amazonaws.com:8042	                           0


For HDFS health:

<http:/ec2-54-213-63-253.us-west-2.compute.amazonaws.com:50070/dfshealth.html#tab-overview>

For YARN cluster/job manager:
http://ec2-54-213-63-253.us-west-2.compute.amazonaws.com:8088/cluster

In [21]:
%%writefile mapper.py
#!/usr/bin/python

import sys

# Simply read and print out key value pairs from input
# note, writing to standard out seems more secure than print
for line in sys.stdin:
    sys.stdout.write(line)
    #key, empty_string = line.split('\t')
    #print str(key) +"\t" + empty_string

Overwriting mapper.py


In [18]:
%%writefile reducer.py
#!/usr/bin/python

import sys

# Simply read and print out key value pairs from input
for line in sys.stdin:
    #key, empty_string = line.split('\t')
    #print str(key) +"\t" + empty_string
    sys.stdout.write(line)

Overwriting reducer.py


In [22]:
!chmod a+x mapper.py
!chmod a+x reducer.py

In [28]:
!./mapper.py <random_numbers.txt| sort -n | ./reducer.py >rand_num.out
!head rand_num.out
!tail rand_num.out

0	NA
1	NA
1	NA
2	NA
5	NA
6	NA
9	NA
10	NA
10	NA
11	NA
9989	NA
9989	NA
9992	NA
9995	NA
9995	NA
9997	NA
9999	NA
9999	NA
9999	NA
10000	NA


In [2]:
!$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-*streaming*.jar -mapper $HADOOP_HOME/mapper.py -reducer reducer.py -input /user/safyre/input/random_numbers.txt -output /user/safyre/output/SortOutput

packageJobJar: [/tmp/hadoop-unjar1215359286937913206/] [] /tmp/streamjob4040146808467195045.jar tmpDir=null
16/01/26 04:23:46 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/01/26 04:23:52 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/01/26 04:24:02 INFO mapred.FileInputFormat: Total input paths to process : 1
16/01/26 04:24:04 INFO mapreduce.JobSubmitter: number of splits:2
16/01/26 04:24:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1453781403055_0002
16/01/26 04:24:23 INFO impl.YarnClientImpl: Submitted application application_1453781403055_0002
16/01/26 04:24:27 INFO mapreduce.Job: The url to track the job: http://ip-172-31-25-233:8088/proxy/application_1453781403055_0002/
16/01/26 04:24:27 INFO mapreduce.Job: Running job: job_1453781403055_0002
16/01/26 04:27:48 INFO mapreduce.Job: Job job_1453781403055_0002 running in uber mode : false
16/01/26 04:27:48 INFO mapreduce.Job:  map 0% reduce 0%
16/01/26 04:30:09 INFO ip

### HW2.2.  WORDCOUNT
*Using the Enron data from HW1 and Hadoop MapReduce streaming, write the mapper/reducer job that
will determine the word count (number of occurrences) of each white-space delimitted token 
(assume spaces, fullstops, comma as delimiters). Examine the word “assistance” 
and report its word count results.*

 
CROSSCHECK: `>grep assistance enronemail_1h.txt|cut -d$'\t' -f4| grep assistance|wc -l`    

    `8`    

*#NOTE  "assistance" occurs on 8 lines but how many times does the token occur? 
10 times! This is the number we are looking for!*

In [62]:
%%writefile mapper.py
#!/usr/bin/python
import re, string
import sys
import os
import numpy as np

# store a regex expression into a pattern object
# that seeks words including underscores and single quotes
WORD_RE = re.compile(r"[\w']+")
translate_table = string.maketrans("","") #empty translation

# file input
filename = sys.argv[1]

# list of words argument, '*' means all words
word_list = sys.argv[2]
count = 0

with open(filename, 'rU') as f:
    for line in f.readlines():
        #strip punctuation from line
        line = line.translate(translate_table, string.punctuation)
        
        # if not all words selected,
        # go through each word in word list and count occurances
        if word_list != '*':
            for word in word_list.split():
                counts = [int(1) if (x == word) and (word.isalpha()) else int(0) for x in WORD_RE.findall(line)]
                counts = np.array(counts)
            
                if counts.sum() > 0:
                    print word + "\t" + str(counts.sum())
        else: 
            for word in line.split():
                if word.isalpha():
                    print word + "\t"+ str(1)

Overwriting mapper.py


In [54]:
%%writefile reducer.py
#!/usr/bin/python

# recycled code from W205, sorry
import re, string
import sys
import os
import numpy as np

#if a new word enters the fray, print the current word and its counts
def wcount(prev_word ,counts):
    if prev_word is not None:
            print(prev_word + "\t" + str(counts))

prev_word = None
counts = 0

for line in sys.stdin:
    word, value =line.split("\t",1)
    if word != prev_word:
        wcount(prev_word, counts)
        prev_word = word 
        counts = 0
    counts += eval(value)

# A print just for the final word
wcount(prev_word, counts)


Overwriting reducer.py


In [63]:
!chmod a+x mapper.py
!chmod a+x reducer.py
!./mapper.py enronemail_1h.txt assistance |sort| ./reducer.py 

assistance	10


#### HW2.2.1  

*Using Hadoop MapReduce and your wordcount job (from HW2.2) 
determine the top-10 occurring tokens (most frequent tokens)*



In [69]:
!./mapper.py enronemail_1h.txt \* |sort | ./reducer.py |sort -k 2n >hw221.out

In [71]:
# sorted in ascending order
!tail hw221.out

this	260
for	373
your	391
in	415
you	427
a	529
of	560
and	662
to	961
the	1246


### HW2.3. Multinomial NAIVE BAYES with NO Smoothing
*Using the Enron data from HW1 and Hadoop MapReduce, write  a mapper/reducer job(s) that
   will both learn  Naive Bayes classifier and classify the Enron email messages using the learnt Naive Bayes classifier. Use all white-space delimitted tokens as independent input variables (assume spaces, fullstops, commas as delimiters). Note: for multinomial Naive Bayes, the Pr(X=“assistance”|Y=SPAM) is calculated as follows:*

   `the number of times “assistance” occurs in SPAM labeled documents / the number of words in documents labeled SPAM` 

   *E.g.,   “assistance” occurs 5 times in all of the documents Labeled SPAM, and the length in terms of the number of words in all documents labeled as SPAM (when concatenated) is 1,000. Then $Pr(X=“assistance”|Y=SPAM) = 5/1000$. Note this is a multinomial estimation of the class conditional for a Naive Bayes Classifier. No smoothing is needed in this HW. Multiplying lots of probabilities, which are between 0 and 1, can result in floating-point underflow. Since $log(xy) = log(x) + log(y)$, it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. Please pay attention to probabilites that are zero! They will need special attention. Count up how many times you need to process a zero probabilty for each class and report.* 

   *Report the performance of your learnt classifier in terms of misclassifcation error rate of your multinomial Naive Bayes Classifier. Plot a histogram of the  posterior probabilities (i.e., Pr(Class|Doc)) for each class over the training set. Summarize what you see.* 

   *Error Rate = misclassification rate with respect to a provided set (say training set in this case). It is more formally defined here:*

*Let DF represent the evalution set in the following:*
$Err(Model, DF) = |{(X, c(X)) ∈ DF : c(X) != Model(x)}|   / |DF|$

*Where $||$ denotes set cardinality; c(X) denotes the class of the tuple X in DF; and Model(X) denotes the class inferred by the Model “Model”*

In [121]:
%%writefile mapper.py
#!/usr/bin/python
import re, string
import sys
import os
import numpy as np

# store a regex expression into a pattern object
# that seeks words including underscores and single quotes
WORD_RE = re.compile(r"[\w']+")
TRUTH_RE = re.compile(r"\t(\d)\t")
translate_table = string.maketrans("","") #empty translation


# file input
filename = sys.argv[1]

# for this part, just assume word_list is length 1
word_list = sys.argv[2]

# Avoid KeyError if no data in chunk
#counts_dict = dict.fromkeys(['0', '1'], 0)
counts_dict = {}
doc_len    = 0
spam_count = 0
ham_count  = 0

line_count = 0
with open(filename, 'rU') as f:
    for line in f.readlines():
        
        # Parse out TRUTH
        # truth is the actual label provided in the data
        # 1 = spam, 0 = ham
        #key is the id of the emai
        key = line.split()[0]
        try:
            truth = TRUTH_RE.findall(line)[0]

        except:
            #for some reason line 59 gives problems
            # truth = '1'
            continue
        
        # Remove punctuation
        line = line.translate(translate_table, string.punctuation)
        
        '''
         # define empty dictionaries
        for category in ['0','1']:
            counts_dict[category] = {}
        '''
       
        if word_list != "*":
            for word in word_list.split():
                #doc_len = len(line.split())
                counts = [1 if (x == word) and (x.isalpha()) else 0 for x in WORD_RE.findall(line)]
                counts = np.array(counts)  
                
                # Only pass to reducer if the word is present
                if counts.sum() > 0:
                    count = counts.sum()
                    #counts_dict[truth][word] = counts_dict[truth].get(word, 0) + int(count)
                    print key + "\t" + truth + "\t" + word + "\t" + str(count) #+ "\t" + str(doc_len)

        else:
            for word in list(set(line.split())):
                counts = [1 if (x == word) and (x.isalpha()) else 0 for x in WORD_RE.findall(line)]
                counts = np.array(counts)
                
                count = counts.sum()
                #counts_dict[truth][word] = counts_dict[truth].get(word, 0) + int(count)
                print key +"\t" + truth + "\t" + word + "\t" + str(count) #+ "\t" + str(doc_len)

'''
for category, word_dictionary in counts_dict.iteritems():
    for words, count in counts_dict[category].iteritems():
        print key + category + "\t" + words + "\t" + str(count) + "\t" + str(doc_len)
'''

Overwriting mapper.py


In [164]:
%%writefile reducer.py
#!/usr/bin/python
import re, sys
import numpy as np
from math import log

## training, gather all the counts and calculate corpus-wide priors, etc
## data come in as strings, 
## ID TRUTH WORD COUNT 
# define empty dictionaries


# number of documents in each class
N_spam_docs = 0
N_ham_docs  = 0

# number of terms in each class
N_spam_terms = 0
N_ham_terms = 0
counts_dict ={}
keys_dict = {}

for line in sys.stdin:
    key, truth, word, count = line.split()
    keys_dict[key+"_"+truth] = {}
    keys_dict[key+"_"+truth][word] = {}
    
    for i in ['0', '1']:
        counts_dict[i] = {}
    # tabulate word counts for each class
    counts_dict[truth][word] = counts_dict[truth].get(word, 0) + int(count)
    
    if truth == '1':
        N_spam_docs  += 1
        N_spam_terms += int(count)
    else:
        N_ham_docs   += 1
        N_ham_terms += int(count)
        
priors = {'0': float(N_ham_docs)/(N_spam_docs+N_ham_docs),
          '1': float(N_spam_docs)/(N_spam_docs+N_ham_docs)}

prior_counts = {'0': float(N_ham_terms),
          '1': float(N_spam_terms)}

## Calculate conditional probabilities
## P(word | class) 
posteriors = {}
for category in ['0', '1']:
    posteriors[category] = {}
    for word in counts_dict[category].keys():
        posteriors[category][word] = float(counts_dict[category][word])/float(prior_counts[category])

for category in ['0', '1']:
    for key, word in keys_dict.iteritems():
        print key.split('_')[0] + "\t" +key.split('_')[1] + "\t" + str(word) + '\t' +str(priors['0']) + \
        '\t' + str(priors['1']) + '\t' + str(posteriors['0'][word]) + '\t' + str(posteriors['1'][word])
        #keys_dict[key][word]['prior_ham']    = priors['0']
        #keys_dict[key][word]['prior_spam']   = priors['1']
        #keys_dict[key][word]['posterior_ham'] = posteriors['0'][word]
        #keys_dict[key][word]['posterior_spam'] = posteriors['1'][word]

        
        

#print "Priors are: "
#for category in priors:
#    print category + " " + str(priors[category]) + "n = " +str(prior_counts[category]) 

spam_vocab = counts_dict['1'].keys()
ham_vocab  = counts_dict['0'].keys()

spam_vocab_n = len(counts_dict['1'].keys())
ham_vocab_n  = len(counts_dict['0'].keys())

# all unique words from both classes
vocab = list(set(counts_dict['0'].keys()).union(counts_dict['1'].keys()))
len_vocab = len(vocab)

#print "\nPosteriors are: "
#for category in posteriors:
#    for word in posteriors[category]:
#        print word + " in class " + category + " " + str(posteriors[category][word]) + "\n"


Overwriting reducer.py


In [165]:
%%writefile reducer2.py

## Testing the classifer
## Without laplacian transform 
#print "DOC_ID | TRUTH | CLASS "
#print "=======================\n"


def cum_log_probs(prev_key, label, prediction):# ham_score, spam_score):
    if prev_key is not None:
        print prev_key + "\t" + label + "\t" +str(prediction)# +  " " +ham_score + " " + spam_score

prev_key = None
counts = 0
prediction =0
correct = 0

'''
for line in sys.stdin:
    word, value =line.split("\t",1)
    if word!=prev_word:
        wcount(prev_word, counts)
        prev_word = word 
        counts = 0
    counts += eval(value)

wcount(prev_word, counts)
'''

# This could probably be another MapReduce job...
for line in sys.stdin:
    for key, truth, word, prior_ham, prior_spam, posterior_ham, posterior_spam in line.split():
        print key, truth, word, count
        if key != prev_key:
            # Dump the previous key's statistics
            if prev_key is not None:
                if int(prediction) == int(truth):
                    correct +=1
                #print prev_key + "\t" + truth + "\t" +str(prediction) +  " " +str(score[0]) + " " + str(score[1])
                cum_logs_prob(prev_key, truth, prediction)# str(score[0]), str(score[1]))
            #initialize scores for new sample
            score = [0,0]
            score[0]= log(prior_ham) + log(float(posterior_ham))
            score[1] = log(prior_spam) + log(float(posterior_spam))
            #for category in ['0', '1']:
            #    idx = int(category)
            #    score[idx] =  log(priors[category])
                    #score[idx] += log(float(posteriors[category][word]))
                    
            prev_key = key
            
        else:
            score[0] += log(float(posterior_ham))
            score[1] += log(float(posterior_spam))
            #for category in ['0', '1']:
            #    for word in posteriors[category]:
            #        score[idx] += log(float(posteriors[category][word]))
            
            score = np.array(score)
            prediction = score.argmax()
            
#print prev_key + "\t" + truth + "\t" +str(prediction) +  " " +str(score[0]) + " " + str(score[1])
cum_log_probs(prev_key, truth, prediction)# str(score[0]), str(score[1]))

accuracy = float(correct)/(N_spam_docs+N_ham_docs)*100.0
print "Accuracy: ", accuracy



Overwriting reducer2.py


In [166]:
!chmod a+x mapper.py
!chmod a+x reducer.py
!chmod a+x reducer2.py
!./mapper.py enronemail_1h.txt assistance | sort | ./reducer.py |./reducer2.py

./reducer2.py: 8: ./reducer2.py: Syntax error: "(" unexpected
Traceback (most recent call last):
  File "./reducer.py", line 56, in <module>
    '\t' + str(priors['1']) + '\t' + str(posteriors['0'][word]) + '\t' + str(posteriors['1'][word])
TypeError: unhashable type: 'dict'


### HW2.4 
*Repeat HW2.3 with the following modification: use Laplace plus-one smoothing.* 
*Compare the misclassifcation error rates for 2.3 versus 2.4 and explain the differences.*

*For a quick reference on the construction of the Multinomial NAIVE BAYES classifier that you will code,
please consult the "Document Classification" section of the following wikipedia page:*

https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Document_classification

*OR the original paper by the curators of the Enron email data:*

http://www.aueb.gr/users/ion/docs/ceas2006_paper.pdf

### HW2.5. 
*Repeat HW2.4. This time when modeling and classification ignore 
tokens with a frequency of less than three (3) in the training set. 
How does it affect the misclassifcation error of learnt naive multinomial 
Bayesian Classifier on the training dataset:*

### HW2.6

*Benchmark your code with the Python SciKit-Learn implementation of the multinomial Naive Bayes algorithm*

*It always a good idea to benchmark your solutions against publicly available libraries such as SciKit-Learn.*
*The Machine Learning toolkit available in Python. In this exercise, 
we benchmark ourselves against the SciKit-Learn implementation of multinomial Naive Bayes.  
For more information on this implementation see:* http://scikit-learn.org/stable/modules/naive_bayes.html   

*In this exercise, please complete the following:*

- *Run the Multinomial Naive Bayes algorithm (using default settings) from SciKit-Learn over the same training data used in HW2.5 and report the misclassification error (please note some data preparation might be needed to get the Multinomial Naive Bayes algorithm from SkiKit-Learn to run over this dataset)*
- *Prepare a table to present your results, where rows correspond to approach used (SkiKit-Learn versus your Hadoop implementation) and the column presents the training misclassification error*
- *Explain/justify any differences in terms of training error rates over the dataset in HW2.5 between your Multinomial Naive Bayes implementation (in Map Reduce) versus the Multinomial Naive Bayes implementation in SciKit-Learn* 

In [157]:
%%writefile sklearn_run.py
#!/usr/bin/env python

import sys
import os
import numpy as np
import pandas as pd
import re

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import *


data = pd.read_csv("enronemail_1h.txt",sep='\t',header=None)
data.columns = ['ID', 'TRUTH', 'SUBJECT', 'TEXT']
data = data.replace(np.nan,' ', regex=True)
train_data , train_labels = data['SUBJECT']+data['TEXT'] , data['TRUTH']
vec = CountVectorizer()
vec_t = vec.fit_transform(train_data)

#Fit and predict Naivebayes
clf = MultinomialNB(alpha = 1.0)       
clf.fit(vec_t, train_labels)
y_pred = clf.predict(vec_t)
err = 1- metrics.accuracy_score(train_labels, y_pred)
print "Training Error: " + str(err)


Overwriting sklearn_run.py


In [158]:
!chmod a+x sklearn_run.py
!./sklearn_run.py

Traceback (most recent call last):
  File "./sklearn_run.py", line 24, in <module>
    clf.fit(vec_t, train_labels)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 531, in fit
    Y = labelbin.fit_transform(y)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/base.py", line 455, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 308, in fit
    self.classes_ = unique_labels(y)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/utils/multiclass.py", line 99, in unique_labels
    raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: (array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0,
       0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0,
       1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0,
       1.0, 0.0, 1.0, 1.0, 

### HW 2.6.1 OPTIONAL (note this exercise is a stretch HW and optional)
-  *Run the Bernoulli Naive Bayes algorithm from SciKit-Learn (using default settings) over the same training data used in HW2.6 and report the misclassification error* 
-  *Discuss the performance differences in terms of misclassification error rates over the dataset in HW2.5 between the  Multinomial Naive Bayes implementation in SciKit-Learn with the  Bernoulli Naive Bayes implementation in SciKit-Learn. Why such big differences. Explain.* 

*Which approach to Naive Bayes would you recommend for SPAM detection? Justify your selection.*

### HW2.7 OPTIONAL (note this exercise is a stretch HW and optional)

*The Enron SPAM data in the following folder enron1-Training-Data-RAW is in raw text form (with subfolders for SPAM and HAM that contain raw email messages in the following form:*

- Line 1 contains the subject
- The remaining lines contain the body of the email message.

*In Python write a script to produce a TSV file called train-Enron-1.txt that has a similar format as the enronemail_1h.txt that you have been using so far. Please pay attend to funky characters and tabs. Check your resulting formated email data in Excel and in Python (e.g., count up the number of fields in each row; the number of SPAM mails and the number of HAM emails). Does each row correspond to an email record with four values? Note: use "NA" to denote empty field values.*

### HW2.8 OPTIONAL
*Using Hadoop Map-Reduce write job(s) to perform the following:*
 - *Train a multinomial Naive Bayes Classifier with Laplace plus one smoothing using the data extracted in HW2.7 (i.e., train-Enron-1.txt). Use all white-space delimitted tokens as independent input variables 
(assume spaces, fullstops, commas as delimiters). Drop tokens with a frequency of less than three (3).*
 - *Test the learnt classifier using enronemail_1h.txt and report the misclassification error rate. 
    Remember to use all white-space delimitted tokens as independent input variables 
    (assume spaces, fullstops, commas as delimiters). 
    How do we treat tokens in the test set that do not appear in the training set?*



### HW2.8.1 OPTIONAL
-  *Run  both the Multinomial Naive Bayes and the Bernoulli Naive Bayes algorithms 
from SciKit-Learn (using default settings) over the same training data used in 
HW2.8 and report the misclassification error on both the training set and the testing set*
- *Prepare a table to present your results, where rows correspond to approach used (SciKit-Learn Multinomial NB; SciKit-Learn Bernouili NB; Your Hadoop implementation)  and the columns presents the training misclassification error, and the misclassification error on the test data set*
- *Discuss the performance differences in terms of misclassification error rates over the test and training datasets by the different implementations. Which approch (Bernouili versus Multinomial) would you recommend for SPAM detection? Justify your selection.*