# A Soft Introduction to Machine Learning with Naïve Bayes

For these next two modules, we will be dipping our feet into the metaphorical ocean that is machine learning. This module is meant to familiarize you with the packages and tools to get you started making some prediction given a set of data. 

This week we will be working with two particular algorithms that will classify some target given a set of inputs: Naïve Bayes and Decision Trees. We are going to provide the methods to allow us to hit the ground running with these algorithms and the complimentary data preparation in order to make more accurate models. Keep in mind that there are some more abstract mathematical concepts under the hood of these functions, but for the purpose of these two modules, we are going to de-emphasize the math involved and focus on how to use the tools. This is not to say that the math is unimportant, quite the opposite in fact, and the concepts will be discussed in further detail in the Statistical and Mathematical Foundations course as well as through out the curriculum. 

For this module and the next, we will only be discussing four such algorithms and when they are appropriate to use. Naïve Bayes and Decision Trees, the two we will be using today, are not as complex as many of the others, but they are still widely used in the machine learning community. Why? Because they work surprisingly well when making decisions.

Let's begin by reading in some of the dependencies that we will need for this week...

In [2]:
import pandas as pd
import numpy as np 
from sklearn import datasets

### some carpentry...

Remember the `iris` dataset that we used in the `ggplot2` lessons? Well, `sklearn` provides a copy of the dataset too, which is why we call that third line:
```python
from sklearn import datasets
```
This gives us access to some preloaded data that we can begin to play with. Today, we are going to be using the `iris` dataset again, because we are already familiar with it. Take a look at what it looks like...

In [3]:
datasets.load_iris()

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

Whoa! It looks strange! Well, that is because the `datasets` module stores its datasets as dictionary objects with different components of the dataset stored as key-value pairs. What we are interested in is the values of the `data` key as well as the `target` key. Take a look!

In [4]:
iris = datasets.load_iris() # load the iris dataset from sklearn
data = pd.DataFrame(data=iris.data) # create frame of input data
target = pd.DataFrame(data=iris.target) # create frame of target data

df = pd.concat((data,target), axis=1) # combine input and target together
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'] # column names
df.columns = col_names # name data frame columns

df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


So the `iris.data` references the values for the input data of the `iris` data frame, in other words the variables sepal length and width, and petal length and width. The `iris.target` stores the values of the target variable (the variable we will attempt to predict) or, in this case, the `species` variable. These values are stored as `numpy` arrays, so we transform them into `pandas` data frame objects\*. We then combine these objects together into one data frame, which we call `df`. 

Oh, but one more bit of carpentry. We are going to go ahead and transform the values in the `species` column to the species names.
    
\* *This is purely for easier representation of the data. You can actually skip this step in the learning process, but data frames make it easier to visualize what we are working with.*

In [5]:
vals_to_replace = {0:'setosa', 1:'versicolor', 2:'verginica'}
df['species'] = df['species'].map(vals_to_replace)

In order to change the values from integers to strings, I went ahead and created a dictionary where the original integer was the key and the species name was the value. This way, I could just map the dictionary to the column where it finds those values that match the key and replaces them with the values. Here is what we have below...

In [6]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### Naïve Bayes Classification

Naïve Bayes classification is a popular machine learning algorithm for prediction and works particularly well on large datasets and, in many cases, will outperform some of the most complex algorithms. Let's see if we can use it to predict the species of iris given the measurements taken.

First we have to import the function that will perform the task...

In [7]:
from sklearn.naive_bayes import GaussianNB
nbc = GaussianNB()

There we have it. We imported a Gaussian Naïve Bayes and we renamed the function nbc. Now we need to define our input variables, we will call these `X` and our target, `y`. Remember, our inputs are the measurements and our target is the species.

In [8]:
X = iris.data
y = np.asarray(df.species)

There we go. From here, we want our model to predict our target variables given a set of inputs. To do this we must train our model by giving the data objects we just defined above. We do this by calling the method `fit()`....

In [9]:
nbc.fit(X,y)

GaussianNB(priors=None, var_smoothing=1e-09)

And that's it!...kind of. But aren't we missing a couple things? For example, what if we had some new points that we wanted to predict? How do we do that? 

Imagine that someone plopped an iris down at your desk and asked you what species it was. Well, they are asking the wrong person because you know nothing about irises...BUT you have a model that does. So you measure the plant dimensions and you come out with this list:

* sepal l: 5
* sepal w: 3
* petal l: 2
* petal w: .5

You can now take this and plug these measurements into your model and predict the species that is in front you. It's as simple as calling `nbc.predict()`.

In [10]:
print(nbc.predict([[5, 3,2,.5]]))

['setosa']


"SETOSA!" we tell our curious colleague. But wait...how do we know that our model was good? Well, it would be nice to assess our model. How could we do this though? Well we can find out how many targets it misclassified.

We can create another array of data of the predictions of the inputs we fed our model in the first place. We can call this array `y_pred`. We can then find how many points from our target variable `y` do NOT match our `y_pred`.

In [11]:
y_pred = nbc.fit(X, y).predict(X)
print("Number of mislabeled points out of a total {} points : {}"
      .format(iris.data.shape[0],(y != y_pred).sum()))

Number of mislabeled points out of a total 150 points : 6


Not too shabby! Only 6 misclassified points! Or in other words...

In [12]:
1 - ((y != y_pred).sum()/iris.data.shape[0])

0.96

96% of the points were classified correctly!

We can now see what points were incorrectly classified. There are many ways to do this, but I would like to see it in data frame format, since it is organized and easy to visualize all the variables at once. To do that, we have to create some new columns. One column that returns the predicted species (`y_pred`) and the other that displays whether or not the prediction was correct (either `True` or `False`). 

In [13]:
df['evaluation'] = y_pred == y
df['pred'] = y_pred

Well, that was simple enough. Now we can filter the data frame where the `evaluation` column values equal `False` to see what it misclassified.

In [14]:
df[df['evaluation']==False]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,evaluation,pred
52,6.9,3.1,4.9,1.5,versicolor,False,verginica
70,5.9,3.2,4.8,1.8,versicolor,False,verginica
77,6.7,3.0,5.0,1.7,versicolor,False,verginica
106,4.9,2.5,4.5,1.7,verginica,False,versicolor
119,6.0,2.2,5.0,1.5,verginica,False,versicolor
133,6.3,2.8,5.1,1.5,verginica,False,versicolor


It appears that some of the variability of versicolor and verginica cross a bit and therefore we have a model misclassifying some points. 

### To come...

So, we learned a little bit about machine learning in this notebook using a Naïve Bayes Classifier. However, there are several things we should take into consideration when constructing a model. What are the concepts of training and testing data, and what is the difference between the two? How do we know that all of the features are contributing to the model? Can we refine our model by selecting only certain variables? During the practices, we will tackle these questions.