# Practical Machine Learning With Python - Part 2

In the <a href="https://savan77.github.io/blog/machine-learning-part1.html"> previous post</a>, I explained what is machine learning, types of machine learning, linear regression, logistic regression, various issues that we need to consider such as overfitting and at last I explained what really learning is in machine learning. In <a href="https://savan77.github.io/blog/lab-machine-learning-part1.html">lab session</a>, I explained how to implement algorithms and concepts that I explained in theory session using Python.

In this session, I will explain some easy yet powerful machine learning algorithms such as <b> naive bayes, support vector machine and decision trees</b>. From now onwards, I will not make seperate part for theory and lab session. Instead, I will integrate theory with code in jupyter notebook. If you are unfamiliar with Jupyter notebooks, please go through <a href="http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Notebook%20Basics.html"> Jupyter Notebook Basics Guide </a>.

## Naive Bayes

<b> Naive Bayes </b> is a supervised learning algorithm which is based on <a href="https://en.wikipedia.org/wiki/Bayes%27_theorem"><b> bayes theorem </b></a>. Naive Bayes is a widely used classification algorithm. Here, word <b> naive</b> comes from the assumption of independence among features. That is, if we have a feature vector (input vector) (x<sub>1</sub>, x<sub>2</sub>,...,x<sub>n</sub>), x<sub>i</sub><sup>'</sup>s are conditionally independent given <i>y</i>. We can write bayes theorem as follows :<br><br>


\begin{align}
P( y | x ) = \frac{P(y)P(x | y)}{P(x)}
\end{align}

where,<br><br>
P(x) is the prior probability of a feature.<br>
P(x | y) is the probability of a feature given target. It's also known as likelihood.<br>
P(y) is the prior probability of a target or class in case of classification.<br>
p(y | x) is the posterior probability of target given feature.<br>
<br>
when we have more than one feature then we can rewrite this equation as :

\begin{align}
P( y | x_1,...,x_n) = \frac{P(y)P(x_1,...,x_n | y)}{P(x_1,...,x_n)}
\end{align}


Consider an example for spam classification, our input or feature vector will be a set of words and output will be spam or ham (1 or 0). In naive bayes, we calculate probability of each class(spam or ham) given feature vector and class with maximum probability becomes our output. Our task is to solve above equation for each class. Now, let us dig deeper into this equation and see how we can use this equation to find the probability of each class.<br>
Using the naive bayes assumption we can write :<br><br>
\begin{align}
P(x_i | y, x_1,..,x_{i-1},x{i+1},..,x_n) = P(x_i | y)
\end{align}

We can rewrite bayes theorem as follows : 
<br><br>
\begin{align}
P( y | x_1,...,x_n) = \frac{P(y)\prod_{i=1}^{n}P(x_i| y)}{P(x_1,...,x_n)}
\end{align}

but we know that <b> P(x<sub>1</sub>, x<sub>2</sub>, .., x<sub>n</sub>)</b> is constant given the input. So we can say that
<br><br>

\begin{align}
P(y|x_1,...,x_n) \propto P(y) \prod_{i=1}^{n} P(x_i|y)\end{align}
<br>
For classification rule (where we want to find the class with maximum probability), we can write equation as :
<br><br>
\begin{align}
\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i|y)
\end{align}
<br>
Now, we can use <a href="https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation"> Maximum a Posteriori</a> estimation to estimate both P(y) and P(x<sub>i</sub>|y). Here, P(y) = samples with class y / total number of sample, in other words, frequency of class y in training data. <br>

We can make several variants of naive bayes by using different distribution for P(x<sub>i</sub>|y). Widely used Naive Bayes variants are <a href="https://en.wikipedia.org/wiki/Normal_distribution"> Gaussian Naive Bayes </a>, <a href="https://en.wikipedia.org/wiki/Multinomial_distribution"> Multinomial Naive Bayes </a> and <a href="https://en.wikipedia.org/wiki/Bernoulli_distribution"> Bernoulli Naive Bayes </a>.

Now, we will implement Naive Bayes algorithm in scikit-learn.

In [15]:
#we will use iris dataset
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
import numpy as np

#load the dataset
data = load_iris()

model = GaussianNB()
model.fit(data.data, data.target)

#evalaute
print(model.score(data.data, data.target))

#predict
print(model.predict([4.2, 3, 0.9, 2.1])) #0 = setosa,1 = versicolor, and 2 = virginica

0.96
[1]




# Support Vector Machines

<b> Support Vector Machines</b> are supervised learning models which can be used for both classification and regression. SVMs are among the best supervised learning algorithms. It is effective in high dimensional space and it is memory efficient as well.

Consider a binary classification problem, where the task is to assign a one of the two labels to given input. We plot each data item as a point in n-dimensional space as follows:

![title](./images/svm1.png)

We can perform classification by finding the hyperplane that differentiate the two classes very well. As you can see in above image, we can draw m number of hyperplanes. How do we find the best one? We can find the optimal hyperplane by maximizing the <b> margin </b>.
![title](./images/svm2.png)

We define margin as a distance between the hyperplane and the nearest sample points to the hyperplane. This points are known as <b>support vector</b>. In above figure, support vectors are represented with filled color.