# Machine learning algorithms

## Some general definitions

- __Labels__ are the outputs when testing, or making predictions.


## Naive Byes

- It is a supervised learning algorithm. A supervised learning algorithms is the one in which the prediction is made once the model has been trained with some examples. 

- These examples which are used for training the model is known as dataset.

- It is called ```Naive``` because it makes the assumption that the occurance of certain features is independent of the occurance of the other features.

- It is based on the ```Bayes law``` which states that the probablity of ```B``` given ```A``` is equals to the probablity of the event ```A``` given ```B``` multiplied by the probability of ```A``` upon probability of ```B```.

![](naive_bayes/Naive.png)
![](naive_bayes/terminology.png)

- Used for text classification, spam filtration, sentimation analysis and classifying news articles.

![](naive_bayes/Baye's_theoram.png)

- There are a number of naive bayes algorithm available out of which we implemented the ```GaussianNB``` to find the orignal writer of the emails with the help of the data sets given. The library used for this is ```sklearn```. 

- ```sklearn``` is a machine learning library which is provides a number of __supervised__ and __unsupervised__ learning algorithms.

Enough of theory. Now is time for some code implementation :)

In [10]:
import numpy as np
from sklearn.naive_bayes import GaussianNB

# Create some training data

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])

# Create a Gaussian classifier 
clf = GaussianNB()
# Train the model with the data
clf.fit(X,Y)
print(clf.predict([[-0.8, -1]]))


[1]


```partial_fit``` method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning.

This is especially useful when the whole dataset is __too big__ to fit in memory at once.

In [9]:
# A partial fit classifier
clf_pf = GaussianNB()
clf_pf.partial_fit(X, Y, np.unique(Y))
print(clf_pf.predict([[-0.8, -1]]))

[1]


# Support Vector Machines (SVM)

Support vector machine, also know as SVM's are the discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimentional space this hyperplane is a line dividing a plane in two parts where in each class lay in either side.

In simple terms, they are the classifiers which find the best-fit margin in the dataset. Margin is the best fit line which makes sure that the different points are as far as possible so as to avoid the noise of the dataset.

OR 


In data classification problems, SVM can be used to it provides the maximum separating margin for a linearly separable dataset. That is, of all possible decision boundaries that could be chosen to separate the dataset for classification, it chooses the decision boundary which is the most distant from the points nearest to the said decision boundary from both classes.

![](SVM/0_DO1oOt94TAhfoHf6.png)

There are two classes here, red and blue. The line in the middle is the decision boundary. The highlighted points are the support vectors. The distance between all of these points and the line is the maximum possible among all possible decision boundaries, thus making this the most optimal one.

SVM consists of many other classifiers like [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC), [NuSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVC) and [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)

SVM's are used for __classification__, __regression__ and __outliers detection__.



In [18]:
from sklearn import svm

X = [[0, 0], [1, 1]]
y = [0, 1]

# Call the SVM classifier in the same way as that of Naive Bayes 

# Use the SVC classifier
clf = svm.SVC()


SVM's have a higher rate of accuracy than Naive Bayes. From the udacity course, we saw that the algorithm ran and gave an accuracy of ```88.4%``` in Naive Bayes Algorithm while an accuracy of ```~92%``` in the SVM linear algorithm

![Linear SVM vs complicated descision boundary](SVM/Linear_vs_curved.png)

![](SVM/Non-linear_data.png)

__Important__

__Question)__ The very first point that comes into the mind is that if SVM's are there to seperate using linear hyperplanes, how come is it able to draw a circular boundary instead fo just failing?

__Answer__) The equation here which we are assuming is ``` x^2+y^2 ``` which is a circle. Let us assume that the given equation is equal to ```z```. Now, lets plot the points on a 3D graph. We can observe that the blue dots here(point z) are farther than that of ```x``` or ```y``` initially.

If all the ```z``` points are farther than ```x``` and ```y``` then, it is very much possible to draw a linear hyperplane (in 3D) which would cover the circular area.

![](SVM/circular-data-funda.png)


__Ques )__ Which of the three features if added would give out a linear hyperplane?

![](SVM/Ques_1.png)

__Ans__ |x|

![](SVM/Ans_1.png)

It is pretty obvious to think that this method of adding new features is very hard. Adding a single feature would require us to make a ton of calculations which in real world applications is not feasible. A simple app that you would like to intergrate could have 10 or even 100 features and, adding these many features using this method is just simply not possible. 

To save us from this, there is something known as the __Kernal Trick__ 

![](SVM/0_DO1oOt94TAhfoHf6.png)

There are two classes here, red and blue. The line in the middle is the decision boundary. The highlighted points are the support vectors. The distance between all of these points and the line is the maximum possible among all possible decision boundaries, thus making this the most optimal one.

Isn’t that good now? It is, except for the fact that this is applicable; in it’s raw form; only for linearly separable data. What if it isn’t? Enter kernels. The idea is that our data, which isn’t linearly separable in our ’n’ dimensional space may be linearly separable in a higher dimensional space. To understand how kernels work, some math would be necessary so brace yourselves!

