<h2>Project 3: Na&iuml;ve Bayes and the Perceptron</h2>

<blockquote>
    <center>
    <img src="nb.png" width="200px" />
    </center>
      <p><cite><center>"All models are wrong, but some are useful."<br>
       -- George E.P. Box
      </center></cite></p>
</blockquote>

<h3>Introduction</h3>
<!--Aðalbrandr-->

<p>A&eth;albrandr is visiting America from Finland and has been having the hardest time distinguishing boys and girls because of the weird American names like Jack and Jane.  This has been causing lots of problems for A&eth;albrandr when he goes on dates. When he heard that Cornell has a Machine Learning class, he asked that we help him identify the gender of a person based on their name to the best of our ability.  In this project, you will implement Na&iuml;ve Bayes to predict if a name is male or female.</p>
  

<strong>How to submit:</strong> You can submit your code using the red <strong>Submit</strong> button above. This button will send any code below surrounded by <strong>#&lt;GRADED&gt;</strong><strong>#&lt;/GRADED&gt;</strong> tags below to the autograder, which will then run several tests over your code. By clicking on the <strong>Details</strong> dropdown next to the Submit button, you will be able to view your submission report once the autograder has completed running. This submission report contains a summary of the tests you have failed or passed, as well as a log of any errors generated by your code when we ran it.

Note that this may take a while depending on how long your code takes to run! Once your code is submitted you may navigate away from the page as you desire -- the most recent submission report will always be available from the Details menu.


<p><strong>Academic Integrity:</strong> We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; <em>please</em> don't let us down. If you do, we will pursue the strongest consequences available to us.
                </p>

<p><strong>Getting Help:</strong> You are not alone!  If you find yourself stuck  on something, contact the course staff for help.  Office hours, section, and the <a href="https://piazza.com/class/iyag4nk2rsxsv">Piazza</a> are there for your support; please use them.  If you can't make our office hours, let us know and we will schedule more.  We want these projects to be rewarding and instructional, not frustrating and demoralizing.  But, we don't know when or how to help unless you ask.  </p>

<h3> Of boys and girls </h3>

<p> Take a look at the files <code>girls.train</code> and <code>boys.train</code>. For example with the unix command <pre>cat girls.train</pre> 
<pre>
...
Addisyn
Danika
Emilee
Aurora
Julianna
Sophia
Kaylyn
Litzy
Hadassah
</pre>
Believe it or not, these are all more or less common girl names. The problem with the current file is that the names are in plain text, which makes it hard for a machine learning algorithm to do anything useful with them. We therefore need to transform them into some vector format, where each name becomes a vector that represents a point in some high dimensional input space. </p>

<p>That is exactly what the following Julia function <code>name2features</code> does: </p>

In [1]:
function name2features(filename;B=128,FIX=3,LoadFile=true)
#  Output:
#  X : n feature vectors of dimension B, (nxB)
    
 function hashfeatures(baby)
  v=zeros(B,1);
  for m=1:FIX
    featurestring=string("prefix",baby[1:min(m,length(baby))]);
    v[mod(hash(featurestring),B)+1]=1;
    featurestring=string("suffix",baby[end-min(m,length(baby))+1:end]);
    v[mod(hash(featurestring),B)+1]=1;
  end;
  return(v)
 end;

    # read in baby names
    if LoadFile
     f=open(filename);
     babynames=[n for n=split(readstring(f),"\n") if length(n)>0];
     close(f);
    else
        babynames=split(filename,"\n"); 
    end;
    
    # compute feature vectors
    X=zeros(length(babynames),B);
    for l=1:length(babynames)
        X[l,:]=hashfeatures(babynames[l]);
    end;
 return(X);
end;

It reads every name in the girls.train file and converts it into a 128-dimensional feature vector. </p> 

<p>Can you figure out what the features are? (Understanding how these features are constructed will help you later on in the competition.)<br></p>

<p>We have provided you with a Julia function <code>genTrainFeatures</code>, which calls this script, transforms the names into features and loads them into memory. 

In [2]:
function genTrainFeatures(dimension=128,fix=3);
# This function calls the julia function name2features 
# to convert names into feature vectors and loads in the training data. 
#
#
# Output: 
#  x: n feature vectors of dimensionality d (nxd)
#  y: n labels (+1 = girl, -1 = boy)
#
## Load in the data
    Xgirls=name2features("girls.train",B=dimension,FIX=fix);
    Xboys=name2features("boys.train",B=dimension,FIX=fix);
    X=[Xgirls;Xboys];
# Generate Labels
    Y=[-ones(size(Xgirls,1),1);ones(size(Xboys,1),1)];

# shuffle data into random order
    ii=randperm(length(Y));
    X=X[ii,:];
    Y=Y[ii];
    return(X,Y)
end;

You can call the following command to load in the features and the labels of all boys and girls names. 

In [3]:
X,Y=genTrainFeatures(128);


<h3> The Na&iuml;ve Bayes Classifier </h3>

<p> The Na&iuml;ve Bayes classifier is a linear classifier based on Bayes Rule. The following questions will ask you to finish these functions in a pre-defined order. <br>
<strong>As a general rule, you should avoid tight loops at all cost.</strong></p>
<p>(a) Estimate the class probability P(Y) in 
<b><code>naivebayesPY</code></b>
. This should return the probability that a sample in the training set is positive or negative, independent of its features.
</p>


In [5]:
#<GRADED>
function naivebayesPY(x,y)
# Computation of P(Y)
# Input:
#  x : n input vectors of d dimensions (nxd)
#  y : n labels (-1 or +1) (nx1)
#
# Output:
#  pos: probability p(y=1)
#  neg: probability p(y=-1)
#
# add one positive and negative example to avoid division by zero ("plus-one smoothing")
    y=[y;-1;1];
    n=length(y);
    ## fill in code here
    
    return (pos,neg)
end;
#</GRADED>
pos,neg=naivebayesPY(X,Y);

<p>(b) Estimate the conditional probabilities P(X|Y) in 
<b><code>naivebayesPXY</code></b>
.  Use a <b>multinomial</b> distribution as model. This will return the probability vectors  for all features given a class label.
</p> 

In [6]:
#<GRADED>
function naivebayesPXY(x,y)
# Computation of P(X|Y)
# Input:
#  x : n input vectors of d dimensions (nxd)
#  y : n labels (-1 or +1) (nx1)
#
# Output:
#  posprob: probability vector of p(x|y=1) (1xd)
#  negprob: probability vector of p(x|y=-1) (1xd)
#
# add one positive and negative example to avoid division by zero ("plus-one smoothing")

    n,d=size(x);
    x=[x;ones(2,d)];
    y=[y;-1;1];
    ## fill in code here

    return (posprob,negprob)
end;
#</GRADED>

posprob,negprob=naivebayesPXY(X,Y);

<p>(c) Solve for the log ratio, $\log\left(\frac{P(Y=1 | X)}{P(Y=-1|X)}\right)$, using Bayes Rule.
 Implement this in 
<b><code>naivebayes</code></b>.
</p>


In [7]:
#<GRADED>
function naivebayes(x,y,xtest);
# Computation of log P(Y|X=x1) using Bayes Rule
# Input:
#  x : n input vectors of d dimensions (nxd)
#  y : n labels (-1 or +1)
#  xtest: input vector of d dimensions (1xd)
#
# Output:
#  logratio: log (P(Y = 1|X=x1)/P(Y=-1|X=x1))

## fill in code here

    return(logratio)
end;
#</GRADED>

p=naivebayes(X,Y,transpose(X[1,:]))

1×10 Array{Float64,2}:
 0.861401  1.84994  2.83745  -0.118037  …  -1.96061  -0.477412  -2.73153

<p>(d) Na&iuml;ve Bayes can also be written as a linear classifier.  Implement this in 
<b><code>naivebayesCL</code></b>
</p>


In [8]:
#<GRADED>
function naivebayesCL(x,y);
# Implementation of a Naive Bayes classifier
# Input:
#  x : n input vectors of d dimensions (nxd)
#  y : n labels (-1 or +1)

# Output:
#  w : weight vector
#  b : bias (scalar)


    n,d=size(x);
    # fill in code here
    
    return (w[:],b)
end;
#</GRADED>

w,b=naivebayesCL(X,Y);

<p>(e) Implement 
<b><code>classifyLinear</code></b>
 that applies a linear weight vector and bias to a set of input vectors and outputs their predictions.  (You can use your answer from the previous project.)
 
 

In [9]:
#<GRADED>
function  classifyLinear(x,w,b=0);
# Make predictions with a linear classifier
# Input:
#  x : n input vectors of d dimensions (nxd)
#  w : weight vector
#  b : bias (optional)
#
# Output:
#  preds: predictions
#

    ## fill in code here
    
    return(preds);
end;
#</GRADED>

error=mean(classifyLinear(X,w,b).!=Y[:]);
@printf("Training error: %2.2f%%",error*100);

Training error: 23.17%

You can now test your code wih the following interactive name classification script:

In [None]:
DIMS=1200;
println("Loading data ...")
X,Y=genTrainFeatures(DIMS);
println("Training classifier ...")
w,b=naivebayesCL(X,Y);
error=mean(classifyLinear(X,w,b).!=Y[:]);
@printf("Training error: %2.2f%%\n",error*100);

while true # stop only when user presses return
 print("Please enter your name>")
 yourname=chomp(readline());
    if length(yourname)==0;break;end;
 xtest=name2features(yourname,B=DIMS,LoadFile=false);
    pred=classifyLinear(xtest,w,b)[1];
    if pred>0
        println("$yourname, I am sure you are a nice boy.\n")
 else
        println("$yourname, I am sure you are a nice girl.\n")
 end;    
end;

Loading data ...
Training classifier ...
Training error: 23.17%
Please enter your name>STDIN> Mukund
Mukund, I am sure you are a nice boy.

Please enter your name>STDIN> Kilian
Kilian, I am sure you are a nice boy.

Please enter your name>STDIN> Jenny
Jenny, I am sure you are a nice boy.

Please enter your name>STDIN> Jennifer
Jennifer, I am sure you are a nice boy.

Please enter your name>STDIN> Anne
Anne, I am sure you are a nice girl.

Please enter your name>STDIN> Nika
Nika, I am sure you are a nice girl.

Please enter your name>STDIN> Lillian
Lillian, I am sure you are a nice girl.

Please enter your name>

<h3> Feature Extraction (Competition)</h3>

<p>(e) (<b>optional</b>) As always, this programming project also includes a competition.  We will rank all submissions by how well your Na&iuml;ve Bayes classifier performs on a secret test set. If you want to improve your classifier modify <code>name2features2</code> below.   The automatic reader will use your Julia function to extract features and train your classifier on the same names training set.  The given implementation is the same as the given <code>name2features</code> above.</p>
  

In [None]:
#<GRADED>
function name2features2(filename;B=128,FIX=3,LoadFile=true)
#  Output:
#  X : n feature vectors of dimension B, (nxB)
    
 function hashfeatures(baby)
  v=zeros(B,1);
  for m=1:FIX
    featurestring=string("prefix",baby[1:min(m,length(baby))]);
    v[mod(hash(featurestring),B)+1]=1;
    featurestring=string("suffix",baby[end-min(m,length(baby))+1:end]);
    v[mod(hash(featurestring),B)+1]=1;
  end;
  return(v)
 end;

    # read in baby names
    if LoadFile
     f=open(filename);
     babynames=[n for n=split(readstring(f),"\n") if length(n)>0];
     close(f);
    else
        babynames=split(filename,"\n"); 
    end;
    
    # compute feature vectors
    X=zeros(length(babynames),B);
    for l=1:length(babynames)
        X[l,:]=hashfeatures(babynames[l]);
    end;
 return(X);
end;
#</GRADED>

<h4>Credits</h4>
  Parts of this webpage were copied from or heavily inspired by John DeNero's and Dan Klein's (awesome) <a href="http://ai.berkeley.edu/project_overview.html">Pacman class</a>. The name classification idea originates from <a href="http://nickm.com">Nick Montfort</a>.