In [1]:
import pandas as pd
import numpy as np
import scipy
from scipy import stats

# Naive Bayes Classifier [30 pts]

## Introduction
Naive Bayes is a class of simple classifiers based on the Bayes' Rule and strong (or naive) independence assumptions between features. In this problem, you will implement a Naive Bayes Classifier for the Census Income Data Set from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/) (which is a good website to browse through for datasets).

## Dataset Description
The dataset consists 32561 instances, each representing an individual. The goal is to predict whether a person makes over 50K a year based on the values of 14 features. The features, extracted from the 1994 Census database, are a mix of continuous and discrete attributes. These are enumerated below:

#### Continuous (real-valued) features
- age
- final_weight (computed from a number of attributes outside of this dataset; people with similar demographic attributes have similar values)
- education_num
- capital_gain
- capital_loss
- hours_per_week

#### Categorical (discrete) features 
- work_class: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
- marital_status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
- sex: Female, Male
- native_country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands

## Q1. Input preparation [2 pts]
First, you need to load in the above data, provided to you as a CSV file. As the data is from UCI repository, it is already quite clean. However, some instances contain missing values (represented as ? in the CSV file) and these have to be discarded from the training set. Also, replace the `income` column with `label`, which is 1 if `income` is `>50K` and 0 otherwise.

In [2]:
def load_data(file_name):
    """ loads and processes data in the manner specified above
    Inputs:
        file_name (str): path to csv file containing data
    Outputs:
        pd.DataFrame: processed dataframe
    """
    df = pd.read_csv(file_name, na_values=['#VALUE!', '?','', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'nan'])
    incomelabel = df['income'].map({"<=50K":0,">50K":1})
    df = df.assign(label = incomelabel)
    del df['income']
    return df.dropna().reset_index(drop=True)

# AUTOLAB_IGNORE_START
df = load_data('census.csv')
print df.tail()
print len(df)
# AUTOLAB_IGNORE_STOP

       age    work_class  final_weight   education  education_num  \
30157   27       Private        257302  Assoc-acdm             12   
30158   40       Private        154374     HS-grad              9   
30159   58       Private        151910     HS-grad              9   
30160   22       Private        201490     HS-grad              9   
30161   52  Self-emp-inc        287927     HS-grad              9   

           marital_status         occupation relationship   race     sex  \
30157  Married-civ-spouse       Tech-support         Wife  White  Female   
30158  Married-civ-spouse  Machine-op-inspct      Husband  White    Male   
30159             Widowed       Adm-clerical    Unmarried  White  Female   
30160       Never-married       Adm-clerical    Own-child  White    Male   
30161  Married-civ-spouse    Exec-managerial         Wife  White  Female   

       capital_gain  capital_loss  hours_per_week native_country  label  
30157             0             0              38  Uni

Our reference code yields the following output (pay attention to the index):
```python
>>> print df.dtypes
age                int64
work_class        object
final_weight       int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
label              int64
dtype: object
    
>>> print df.tail()
       age    work_class  final_weight   education  education_num  \
30157   27       Private        257302  Assoc-acdm             12   
30158   40       Private        154374     HS-grad              9   
30159   58       Private        151910     HS-grad              9   
30160   22       Private        201490     HS-grad              9   
30161   52  Self-emp-inc        287927     HS-grad              9   

           marital_status         occupation relationship   race     sex  \
30157  Married-civ-spouse       Tech-support         Wife  White  Female   
30158  Married-civ-spouse  Machine-op-inspct      Husband  White    Male   
30159             Widowed       Adm-clerical    Unmarried  White  Female   
30160       Never-married       Adm-clerical    Own-child  White    Male   
30161  Married-civ-spouse    Exec-managerial         Wife  White  Female   

       capital_gain  capital_loss  hours_per_week native_country  label  
30157             0             0              38  United-States      0  
30158             0             0              40  United-States      1  
30159             0             0              40  United-States      0  
30160             0             0              20  United-States      0  
30161         15024             0              40  United-States      1  
>>> print len(df)
30162
```

## Overview of Naive Bayes classifier
Let $X_1, X_2, \ldots, X_k$ be the $k$ features of a dataset, with class label given by the variable $y$. A probabilistic classifier assigns the most probable class to each instance $(x_1,\ldots,x_k)$, as expressed by
$$ \hat{y} = \arg\max_y P(y\ |\ x_1,\ldots,x_k) $$

Using Bayes' theorem, the above *posterior probability* can be rewritten as
$$ P(y\ |\ x_1,\ldots,x_k) = \frac{P(y) P(x_1,\ldots,x_n\ |\ y)}{P(x_1,\ldots,x_k)} $$
where
- $P(y)$ is the prior probability of the class
- $P(x_1,\ldots,x_k\ |\ y)$ is the likelihood of data under a class
- $P(x_1,\ldots,x_k)$ is the evidence for data

Naive Bayes classifiers assume that the feature values are conditionally independent given the class label, that is,
$ P(x_1,\ldots,x_n\ |\ y) = \prod_{i=1}^{k}P(x_i\ |\ y) $. This strong assumption helps simplify the expression for posterior probability to
$$ P(y\ |\ x_1,\ldots,x_k) = \frac{P(y) \prod_{i=1}^{k}P(x_i\ |\ y)}{P(x_1,\ldots,x_k)} $$

For a given input $(x_1,\ldots,x_k)$, $P(x_1,\ldots,x_k)$ is constant. Hence, we can simplify omit the denominator replace the equality sign with proportionality as follows:
$$ P(y\ |\ x_1,\ldots,x_k) \propto P(y) \prod_{i=1}^{k}P(x_i\ |\ y) $$

Thus, the class of a new instance can be predicted as $\hat{y} = \arg\max_y P(y) \prod_{i=1}^{k}P(x_i\ |\ y)$. Here, $P(y)$ is commonly known as the **class prior** and $P(x_i\ |\ y)$ termed **feature predictor**. The rest of the assignment deals with how each of these $k+1$ probability distributions -- $P(y), P(x_1\ |\ y), \ldots, P(x_k\ |\ y)$ -- are estimated from data.


**Note**: Observe that the computation of the final expression above involve multiplication of $k+1$ probability values (which can be really low). This can lead to an underflow of numerical precision. So, it is a good practice to use a log transform of the probabilities to avoid this underflow.

** TL;DR ** Your final take away from this cell is the following expression:
$$\hat{y} = \arg\max_y \underbrace{\log P(y)}_{log-prior} + \underbrace{\sum_{i=1}^{k} \log P(x_i\ |\ y)}_{log-likelihood}$$

Each term in the sum for log-likelihood can be regarded a partial log-likelihood based on a particular feature alone.

## Feature Predictor
The beauty of a Naive Bayes classifier lies in the fact we can mix-and-match different likelihood models for each feature predictor according to the prior knowledge we have about it and these models can be varied independent of each other. For example, we might know that $P(X_i|y)$ for some continuous feature $X_i$ is normally distributed or that $P(X_i|y)$ for some categorical feature follows multinomial distribution. In such cases, we can directly plugin the pdf/pmf of these distributions in place of $P(x_i\ |\ y)$.

In this assignment, you will be using two classes of likelihood models:
- Gaussian model, for continuous real-valued features (parameterized by mean $\mu$ and variance $\sigma$)
- Categorical model, for discrete features (parameterized by $\mathbf{p} = <p_0,\ldots,p_{l-1}>$, where $l$ is the number of values taken by this categorical feature)

You need to implement a predictor class for each likelihood model. Each predictor should implement two functionalities:
- **Parameter estimation `init()`**: Learn parameters of the likelihood model using MLE (Maximum Likelihood Estimator). You need to keep track of $k$ sets of parameters, one for each class.
- **Partial Log-Likelihood computation for *this* feature `partial_log_likelihood()`**: Use the learnt parameters to compute the probability (density/mass for continuous/categorical features) of a given feature value. Report np.log() of this value.

The parameter estimation is for the conditional distributions $P(X|Y)$. Thus, while estimating parameters for a specific class (say class 0), you will use only those data points in the training set (or rows in the input data frame) which have class label 0.

## Q2. Gaussian Feature Predictor [8pts]
The Guassian distribution is characterized by two parameters - mean $\mu$ and standard deviation $\sigma$:
$$ f_Z(z) = \frac{1}{\sqrt{2\pi}\sigma} \exp{(-\frac{(z-\mu)^2}{2\sigma^2})} $$

Given $n$ samples $z_1, \ldots, z_n$ from the above distribution, the MLE for mean and standard deviation are:
$$ \hat{\mu} = \frac{1}{n} \sum_{j=1}^{n} z_j $$

$$ \hat{\sigma} = \sqrt{\frac{1}{n} \sum_{j=1}^{n} (z_j-\hat{\mu})^2} $$

`scipy.stats.norm` would be helpful.

In [3]:
from scipy.stats import norm

class GaussianPredictor:
    """ Feature predictor for a normally distributed real-valued, continuous feature.
        Attributes: 
            mu (array_like) : vector containing per class mean of the feature
            sigma (array_like): vector containing per class std. deviation of the feature
    """
    # feel free to define and use any more attributes, e.g., number of classes, etc
    def __init__(self, x, y) :
        """ initializes the predictor statistics (mu, sigma) for Gaussian distribution
        Inputs:
            x (array_like): feature values (continuous)
            y (array_like): class labels (0,...,k-1)
        """
        x = pd.Series(x)
        y = pd.Series(y)
        self.k = len(y.unique())
        self.ycats = set(y)
        self.mu = np.array([np.mean(x[y==yi]) for yi in self.ycats])
        self.sigma = np.array([np.std(x[y==yi]) for yi in self.ycats])
    
    def partial_log_likelihood(self, x):
        """ log likelihood of feature values x according to each class
        Inputs:
            x (array_like): vector of feature values
        Outputs:
            (array_like): matrix of log likelihood for this feature alone
        """
        x = pd.Series(x)
        
        # create dict of norm dist for each y
        norms = {yi:norm(self.mu[yi], self.sigma[yi]) for yi in self.ycats}
        
        def plli(xi):
            probs = [np.log(norms[label].pdf(xi)) for label in self.ycats]
            return probs
        plls = [plli(xi) for i,xi in x.iteritems()]
        return plls

# AUTOLAB_IGNORE_START
f = GaussianPredictor(df['age'], df['label'])
print "f.mu", f.mu
print "f.sigma",f.sigma
f.partial_log_likelihood([43,40,100,10])
# AUTOLAB_IGNORE_STOP

f.mu [ 36.60806039  43.95911028]
f.sigma [ 13.46433407  10.2689489 ]


[[-3.6316676632712879, -3.252424898473627],
 [-3.5507147309350287, -3.3223844915890508],
 [-14.602263367129872, -18.139207161753209],
 [-5.4716430361031225, -8.7160898926695758]]

In [7]:
x = pd.Series(["a","a","b"])
x.unique()

for xi in x.unique():
    print xi

a
b


Our reference code gives the following output:
```python
>>> f.mu
array([ 36.60806039  43.95911028])
>>> f.sigma
array([ 13.46433407  10.2689489 ])
>>> f.partial_log_likelihood([43,40,100,10])
array([[ -3.63166766,  -3.2524249 ],
       [ -3.55071473,  -3.32238449],
       [-14.60226337, -18.13920716],
       [ -5.47164304,  -8.71608989]])
```

## Q3. Categorical Feature Predictor [8pts]
The categorical distribution with $l$ categories $\{0,\ldots,l-1\}$ is characterized by parameters $\mathbf{p} = (p_0,\dots,p_{l-1})$:
$$ P(z; \mathbf{p}) = p_0^{[z=0]}p_1^{[z=1]}\ldots p_{l-1}^{[z=l-1]} $$

where $[z=t]$ is 1 if $z$ is $t$ and 0 otherwise.

Given $n$ samples $z_1, \ldots, z_n$ from the above distribution, the smoothed-MLE for each $p_t$ is:
$$ \hat{p_t} = \frac{n_t + \alpha}{n + l\alpha} $$

where $n_t = \sum_{j=1}^{n} [z_j=t]$, i.e., the number of times the label $t$ occurred in the sample. The smoothing is done to avoid zero-count problem (similar in spirit to $n$-gram model in NLP).

**Note:** You have to learn the number of classes and the number and value of labels from the data. We might be testing your code on a different categorical feature.

In [9]:
from collections import Counter
 
class CategoricalPredictor:
    """ Feature predictor for a categorical feature.
        Attributes: 
            p (dict) : dictionary of vector containing per class probability of a feature value;
                    the keys of dictionary should exactly match the values taken by this feature
    """
    # feel free to define and use any more attributes, e.g., number of classes, etc
    def __init__(self, x, y, alpha=1) :
        """ initializes the predictor statistics (mu, sigma) for Gaussian distribution
        Inputs:
            x (array_like): feature values (continuous)
            y (array_like): class labels (0,...,k-1)
        """
        x = pd.Series(x)
        y = pd.Series(y)
        
        xcategories = set(x)
        ycategories = set(y)
        
        xcounts = Counter(x)
#         print "xcounts",xcounts
        ycounts = Counter(y)
        self.yprobs = {k:1.0*v/len(y) for k,v in ycounts.iteritems()}
#         print self.yprobs
        
        
        self.p = {xi:[] for xi in xcounts.iterkeys()}
        
        for ycategory in ycategories:
            # get x vector grouped by current y category
            x_y = x[y==ycategory]
            
            xcount_Ycat = Counter(x_y)
            total_count_Y = sum(xcount_Ycat.values())
#             print "xcountinYcat",xcount_Ycat, "label:", ycategory
            
            for xcategory in xcategories:
                numerator = float(xcount_Ycat[xcategory] + alpha)
                denominator = float(total_count_Y + alpha*len(xcategories))
                self.p[xcategory].append(numerator/denominator)
                
        # convert to array
        self.p = {k:np.array(v) for k,v in self.p.iteritems()}
    def partial_log_likelihood(self, x):
        """ log likelihood of feature values x according to each class
        Inputs:
            x (array_like): vector of feature values
        Outputs:
            (array_like): matrix of log likelihood for this feature
        """
        x = pd.Series(x)
        
        def plli(xi):
            probs = [np.log(self.p[xi][label]) for label in self.yprobs.iterkeys()]
            return probs
        plls = [plli(xi) for i,xi in x.iteritems()]
        return plls
    
# AUTOLAB_IGNORE_START
f = CategoricalPredictor(df['sex'], df['label'])
print "\n",f.p
f.partial_log_likelihood(['Male','Female','Male'])

# AUTOLAB_IGNORE_STOP


{'Male': array([ 0.61727578,  0.8517976 ]), 'Female': array([ 0.38272422,  0.1482024 ])}


[[-0.48243939085738619, -0.16040633530804277],
 [-0.96044059311628582, -1.9091763934826356],
 [-0.48243939085738619, -0.16040633530804277]]

In [5]:
# AUTOLAB_IGNORE_START
df.loc[:,['sex','label']].tail()
# AUTOLAB_IGNORE_STOP

Unnamed: 0,sex,label
30157,Female,0
30158,Male,1
30159,Female,0
30160,Male,0
30161,Female,1


Our reference code gives the following output:
```python
>>> f.p
{'Female': array([ 0.38272422,  0.1482024 ]),
 'Male': array([ 0.61727578,  0.8517976 ])}
>>> f.partial_log_likelihood(['Male','Female','Male'])
array([[-0.48243939 -0.16040634]
       [-0.96044059 -1.90917639]
       [-0.48243939 -0.16040634]])
```

## Q4. Putting things together [10pts]
It's time to put all the feature predictors together and do something useful! You will implement two functions in the following class.

1. **__init__()**: Compute the log prior for each class and initialize the feature predictors (based on feature type). The smoothed prior for class $t$ is given by
$$ prior(t) = \frac{n_t + \alpha}{n + k\alpha} $$
where $n_t = \sum_{j=1}^{n} [y_j=t]$, i.e., the number of times the label $t$ occurred in the sample. 

2. **predict()**: For each instance and for each class, compute the sum of log prior and partial log likelihoods for all features. Use it to predict the final class label. Break ties by predicting the class with lower id.

**Note:** Your implementation should not assume anything about the schema of the input data frame or the number of classes. The only guarantees you have are: (1) there will be a `label` column with values $0,\ldots,k-1$ for some $k$. And the datatypes of the columns will be either `object` (string, categorical) or `int64` (integer).

In [69]:
class NaiveBayesClassifier:
    """ Naive Bayes classifier for a mixture of continuous and categorical attributes.
        We use GaussianPredictor for continuous attributes and MultinomialPredictor for categorical ones.
        Attributes:
            predictor (dict): model for P(X_i|Y) for each i
            log_prior (array_like): log P(Y)
    """
    # feel free to define and use any more attributes, e.g., number of classes, etc
    def __init__(self, df, alpha=1):
        """initializes predictors for each feature and computes class prior
        Inputs:
            df (pd.DataFrame): processed dataframe, without any missing values.
        """
        
        xcols = set(df.columns)
        if "label" in xcols:
            xcols.remove("label")
            
        
        ycounts = Counter(df['label'])
        self.yprobsmooth = {k:(1.0*v+alpha)/(1.0*len(df)+len(ycounts)*alpha) for k,v in ycounts.iteritems()}
        self.log_prior = np.array([np.log(prob) for y,prob in self.yprobsmooth.iteritems()])
        
            
        self.predictor = {}
        for col in xcols:
            if df[col].dtype=='object':
                f = CategoricalPredictor(df[col], df['label'],alpha)
            else:
                f = GaussianPredictor(df[col], df['label'])
            self.predictor[col] = f
                

    def predict(self, x):
        """ predicts label for input instances from log_prior and partial_log_likelihood of feature predictors
        Inputs:
            x (pd.DataFrame): processed dataframe, without any missing values and without class label.
        Outputs:
            (array_like): array of predicted class labels (0,..,k-1)
        """
        
        xcols = set(x.columns)
        if "label" in xcols:
            xcols.remove("label")
            
#         self.new = pd.DataFrame(self.predictor[xcol].partial_log_likelihood(x[xcol]) for xcol in xcols)
        self.probs = 0.0
        for xcol in xcols:
            self.probs += np.array(self.predictor[xcol].partial_log_likelihood(x[xcol]))
            
        self.probs = self.probs + self.log_prior
        self.probs = pd.DataFrame(self.probs)
        
        preds = self.probs.idxmax(axis=1)
        return preds.values
    
# AUTOLAB_IGNORE_START
c = NaiveBayesClassifier(df, 0)
print c.log_prior
print c.predictor
y_pred = c.predict(df)
y_pred
# AUTOLAB_IGNORE_STOP

[-0.28624642 -1.39061374]
{'hours_per_week': <__main__.GaussianPredictor instance at 0x000000000F1C1708>, 'native_country': <__main__.CategoricalPredictor instance at 0x000000000F1C1388>, 'relationship': <__main__.CategoricalPredictor instance at 0x000000000FED0408>, 'work_class': <__main__.CategoricalPredictor instance at 0x000000000FED0208>, 'age': <__main__.GaussianPredictor instance at 0x000000000FED02C8>, 'marital_status': <__main__.CategoricalPredictor instance at 0x000000000FED0E88>, 'sex': <__main__.CategoricalPredictor instance at 0x000000000FED0F48>, 'race': <__main__.CategoricalPredictor instance at 0x000000000FED0108>, 'education_num': <__main__.GaussianPredictor instance at 0x000000000FED0DC8>, 'final_weight': <__main__.GaussianPredictor instance at 0x000000000FED0188>, 'capital_loss': <__main__.GaussianPredictor instance at 0x000000000FED0288>, 'education': <__main__.CategoricalPredictor instance at 0x000000000FED0448>, 'capital_gain': <__main__.GaussianPredictor instance

array([ 0.,  0.,  0., ...,  0.,  0.,  1.])

In [70]:
# AUTOLAB_IGNORE_START
print c.log_prior
print c.predictor
print c.predictor['hours_per_week'].mu 
print c.predictor['hours_per_week'].sigma
print c.predictor['work_class'].p
print y_pred.shape
y_pred
# AUTOLAB_IGNORE_STOP

[-0.28624642 -1.39061374]
{'hours_per_week': <__main__.GaussianPredictor instance at 0x000000000F1C1708>, 'native_country': <__main__.CategoricalPredictor instance at 0x000000000F1C1388>, 'relationship': <__main__.CategoricalPredictor instance at 0x000000000FED0408>, 'work_class': <__main__.CategoricalPredictor instance at 0x000000000FED0208>, 'age': <__main__.GaussianPredictor instance at 0x000000000FED02C8>, 'marital_status': <__main__.CategoricalPredictor instance at 0x000000000FED0E88>, 'sex': <__main__.CategoricalPredictor instance at 0x000000000FED0F48>, 'race': <__main__.CategoricalPredictor instance at 0x000000000FED0108>, 'education_num': <__main__.GaussianPredictor instance at 0x000000000FED0DC8>, 'final_weight': <__main__.GaussianPredictor instance at 0x000000000FED0188>, 'capital_loss': <__main__.GaussianPredictor instance at 0x000000000FED0288>, 'education': <__main__.CategoricalPredictor instance at 0x000000000FED0448>, 'capital_gain': <__main__.GaussianPredictor instance

array([ 0.,  0.,  0., ...,  0.,  0.,  1.])

Our reference code gives the following output:
```python
>>> c.log_prior
array([-0.28624642, -1.39061374])
>>> c.predictor
{'age': <__main__.GaussianPredictor instance at 0x115edbcb0>,
 'capital_gain': <__main__.GaussianPredictor instance at 0x114c19320>,
 'capital_loss': <__main__.GaussianPredictor instance at 0x114c19998>,
 'education': <__main__.CategoricalPredictor instance at 0x114c04638>,
 'education_num': <__main__.GaussianPredictor instance at 0x114c04f38>,
 'final_weight': <__main__.GaussianPredictor instance at 0x114c045a8>,
 'hours_per_week': <__main__.GaussianPredictor instance at 0x114c19ef0>,
 'marital_status': <__main__.CategoricalPredictor instance at 0x114c047a0>,
 'native_country': <__main__.CategoricalPredictor instance at 0x114c19f80>,
 'occupation': <__main__.CategoricalPredictor instance at 0x114c195a8>,
 'race': <__main__.CategoricalPredictor instance at 0x114c19bd8>,
 'relationship': <__main__.CategoricalPredictor instance at 0x114c19a28>,
 'sex': <__main__.CategoricalPredictor instance at 0x114c19d40>,
 'work_class': <__main__.CategoricalPredictor instance at 0x115edbb90>}
>>> c.predictor['hours_per_week'].mu
array([ 39.34859186  45.70657965])
>>> c.predictor['hours_per_week'].sigma
array([ 11.95051037  10.73627157])
>>> c.predictor['work_class'].p
{'Federal-gov': array([ 0.02551426,  0.04861481]),
 'Local-gov': array([ 0.0643595 ,  0.08111348]),
 'Private': array([ 0.7685177,  0.6494406]),
 'Self-emp-inc': array([ 0.02092346,  0.07991476]),
 'Self-emp-not-inc': array([ 0.07879403,  0.09509856]),
 'State-gov': array([ 0.04127306,  0.04581779]),
 'Without-pay': array([ 0.00061799,  0.        ])}
>>> y_pred.shape
(30162,)
>>> y_pred
array([0, 0, 0, ..., 0, 0, 1])
```

## Q5. Evaluation - Error rate [2pts]
If a classifier makes $n_e$ errors on a data of size $n$, its error rate is $n_e/n$. Fill the following function, to evaluate your classifier.

In [73]:
def evaluate(y_hat, y):
    """ Evaluates classifier predictions
        Inputs:
            y_hat (array_like): output from classifier
            y (array_like): true class label
        Output:
            (double): error rate as defined above
    """
    return 1.0*sum(1 for yi,y_hati in zip(y,y_hat) if yi != y_hati)/len(y)
    

# AUTOLAB_IGNORE_START
evaluate(y_pred, df['label'])
# AUTOLAB_IGNORE_STOP

0.17243551488628076

Our implementation yields 0.17240236058616804.