# Machine learning algorithms

## Some general definitions

- __Labels__ are the outputs when testing, or making predictions.


## Naive Byes

- It is a supervised learning algorithm. A supervised learning algorithms is the one in which the prediction is made once the model has been trained with some examples. 

- These examples which are used for training the model is known as dataset.

- It is called ```Naive``` because it makes the assumption that the occurance of certain features is independent of the occurance of the other features.

- It is based on the ```Bayes law``` which states that the probablity of ```B``` given ```A``` is equals to the probablity of the event ```A``` given ```B``` multiplied by the probability of ```A``` upon probability of ```B```.

![](naive_bayes/Naive.png)
![](naive_bayes/terminology.png)

- Used for text classification, spam filtration, sentimation analysis and classifying news articles.

![](naive_bayes/Baye's_theoram.png)

- There are a number of naive bayes algorithm available out of which we implemented the ```GaussianNB``` to find the orignal writer of the emails with the help of the data sets given. The library used for this is ```sklearn```. 

- ```sklearn``` is a machine learning library which is provides a number of __supervised__ and __unsupervised__ learning algorithms.

Enough of theory. Now is time for some code implementation :)

In [20]:
import numpy as np
from sklearn.naive_bayes import GaussianNB

# Create some training data

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])

# Create a Gaussian classifier 
clf = GaussianNB()
# Train the model with the data
clf.fit(X,Y)
print(clf.predict([[-0.8, -1]]))


[1]


```partial_fit``` method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning.

This is especially useful when the whole dataset is __too big__ to fit in memory at once.

In [21]:
# A partial fit classifier
clf_pf = GaussianNB()
clf_pf.partial_fit(X, Y, np.unique(Y))
print(clf_pf.predict([[-0.8, -1]]))

[1]


# Support Vector Machines (SVM)

Support vector machine, also know as SVM's are the discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimentional space this hyperplane is a line dividing a plane in two parts where in each class lay in either side.

In simple terms, they are the classifiers which find the best-fit margin in the dataset. Margin is the best fit line which makes sure that the different points are as far as possible so as to avoid the noise of the dataset.

OR 


In data classification problems, SVM can be used to it provides the maximum separating margin for a linearly separable dataset. That is, of all possible decision boundaries that could be chosen to separate the dataset for classification, it chooses the decision boundary which is the most distant from the points nearest to the said decision boundary from both classes.

![](SVM/0_DO1oOt94TAhfoHf6.png)

There are two classes here, red and blue. The line in the middle is the decision boundary. The highlighted points are the support vectors. The distance between all of these points and the line is the maximum possible among all possible decision boundaries, thus making this the most optimal one.

SVM consists of many other classifiers like [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC), [NuSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVC) and [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)

SVM's are used for __classification__, __regression__ and __outliers detection__.



In [22]:
from sklearn import svm

X = [[0, 0], [1, 1]]
y = [0, 1]

# Call the SVM classifier in the same way as that of Naive Bayes 

# Use the SVC classifier
clf = svm.SVC()


SVM's have a higher rate of accuracy than Naive Bayes. From the udacity course, we saw that the algorithm ran and gave an accuracy of ```88.4%``` in Naive Bayes Algorithm while an accuracy of ```~92%``` in the SVM linear algorithm

![Linear SVM vs complicated descision boundary](SVM/Linear_vs_curved.png)

![](SVM/Non-linear_data.png)

__Important__

__Question)__ The very first point that comes into the mind is that if SVM's are there to seperate using linear hyperplanes, how come is it able to draw a circular boundary instead fo just failing?

__Answer__) The equation here which we are assuming is ``` x^2+y^2 ``` which is a circle. Let us assume that the given equation is equal to ```z```. Now, lets plot the points on a 3D graph. We can observe that the blue dots here(point z) are farther than that of ```x``` or ```y``` initially.

If all the ```z``` points are farther than ```x``` and ```y``` then, it is very much possible to draw a linear hyperplane (in 3D) which would cover the circular area.

![](SVM/circular-data-funda.png)


__Ques )__ Which of the three features if added would give out a linear hyperplane?

![](SVM/Ques_1.png)

__Ans__ |x|

![](SVM/Ans_1.png)


![](SVM/0_DO1oOt94TAhfoHf6.png)

There are two classes here, red and blue. The line in the middle is the decision boundary. The highlighted points are the support vectors. The distance between all of these points and the line is the maximum possible among all possible decision boundaries, thus making this the most optimal one.

Isn’t that good now? It is, except for the fact that this is applicable; in it’s raw form; only for linearly separable data. What if it isn’t? Enter kernels. The idea is that our data, which isn’t linearly separable in our ’n’ dimensional space may be linearly separable in a higher dimensional space. To understand how kernels work, some math would be necessary so brace yourselves!

There are a number of kernals that are supported by sklearn for SVM as listed in the [documentation](https://scikit-learn.org/stable/modules/svm.html#kernel-functions)

## Parameters in Machine Learning

__Parameters__ are the arguments that we pass when we create a classifier. These can make a huge difference in the decision boundary of our algorithm.

### Parameters for an SVM

![](SVM/parameters.png)

- Kernal

- C : controls tradeoff between smooth decision boundary and classify training points correctly. 
    
    - A large valve of C means that more values/training data would be accommodated. 

- Gamma: Defines how far the influence of a single training example reaches

    - low values of gamma: far
    ![](svm/low-value-svm.png)
    
    In case of a low value gamma, the line would be linear since equal weights would be put for close as well as far by points.
    
    - high values of gamma: close
    
    ![](svm/high-value-svm.png)
    
    In this high value of gamma, the classifier would give more weight to near-by/close points which would result in ignoring of far by points to some extend.(We can see this feature from the above diagram)
    
    ![](svm/distortion-high-value-gamma.png)
    
    In case of a high value, the line would be a non-uniform curve which is because more weight is put in for near-by points which can even make the boundary change by a huge margin as seen in the above diagram.


### Overfitting of data

We are trying to make the boundary perfectly fit the data which at the end results in a very eradicate line. On the other hand, we can draw a simple line. 

The one in which we are getting an eradicate line, we know it as overfitting of data. An eradicate line is not preferred even though it has been trained on every sample because prediction would not be that reliable.

![](SVM/Overfitting-data.png)

Overfitting of data can be prevented by using the correct parameters. In case of SVM, the parameters used are ```C```, ```Gamma``` and ```Kernel```.

It is the artistry of machine learning to perfectly tune in all the parameters 


### SVM strengths and weakness

- They work very well in the complicated domain where there is a clear margin of seperation

- But they don't work very well in the very large data sets because the training time is supposed to be cubic of the size of the data set.

- They don't work well with lots and lots of noise so, when the class are very overlapping you have to count independent evidences that's where a Naive Bayes classifier would be better.


So again, it boils down to the dataset you have a the features that are available. If you have a really big dataset or lots and lots of features, SVMs right of the box might be very slow and they might be prone to overfit the data.

To decide, just test it out on the testing set(which ideally should be 10% of the dataset available) and see if it works.


SVM <<<< Bayes (training time)

Though on reducing the size of the dataset, the model performs good enough (99.4% in 100% data while 88% in 1% data)

Naive Bayes is great for text--it’s faster and generally gives better performance than an SVM for this particular problem. Of course, there are plenty of other problems where an SVM might work better. Knowing which one to try when you’re tackling a problem for the first time is part of the art and science of machine learning. In addition to picking your algorithm, depending on which one you try, there are parameter tunes to worry about as well, and the possibility of overfitting (especially if you don’t have lots of training data).

A general suggestion is to try a few different algorithms for each problem. Tuning the parameters can be a lot of work, but just sit tight for now--toward. GridCV, a great sklearn tool that can find an optimal parameter tune almost automatically.

## Decision trees

Like SVMs use kernal trick to convert from linear to non-linear decision, Decision trees use a trick to let you do a non-linear decision making with simple linear decision surfaces. 

In Decision trees, you find the answer by asking multiple linear questions one after the other. One way to put this is shown in the diagram below. Let's say that Sam does surfing only when it's windy and sunny. To deal with this, we plot the data on a 2-D graph with ```Sun``` and ```Wind```. 

![](decision_trees/sun_and_wind.png)

In the above question, you first ask the model if it is ```windy```. There are only two possibilities of answers, Yes or No. If it's a NO, from the graph we can see that there is no chance that he could go to surf. On the other hand, if it's windy then we ask another linear question i.e. is it sunny? 

### Sklearn DecisionTreeClassifier

In [1]:
from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
clf.predict([[2., 2.]])

array([1])

### Parameters available in DecisionTreeClassifier

- __min_sample_split__: Using this parameter, we can find if the sample could be further split. Looking into the below example we can se ```min_sample_split``` is 2 so the node with value cannot be further split.

    ![](decision_trees/min_sample_split.png)

    The high ```min_sample_split```(simple decision boundary) allows you to prevent overfitting since having a higher variable would ensure that atleast that many nodes are present in the tree in order to draw a decision boundary. On the contrary note, a low ```min_sample_split```(complex decision boundary) will make the decision boundary go to cover it even if it is the only point outside the boundy which will cause the overfitting of data.

    This is the difference in the accuracy on using two ```min_samples_split``` i.e. ```2``` and ```50```

    ```{"message": "{'acc_min_samples_split_50': 0.912, 'acc_min_samples_split_2': 0.908}"}```

- __entropy__: Controls how a DT decides where to split the data. It is the measure of impurity in a bunch of examples.
    Max value that entropy can take is 1.0 The higher it is, the higher the impure the sample is.
    
    ![](decision_trees/entropy_formula.png)
    
    Here, p<sub>i</sub> = fraction of examples in class ```i```
    
- __Information Gain__: The decision tree algorithm would try to __maximize the information gain__. So this is how it will choose which feature to make a split on and in cases where the feature has many different values that it can take, this will help figure out where to make a split.

    ![](decision_trees/info-gain.png)
    


## Bias and Variance

- __Bias__: Bias are the simplifying assumptions made by a model to make the target function easier to learn.

    Generally, parametric algorithms have a high bias making them fast to learn and easier to understand but generally less flexible. In turn, they have lower predictive performance on complex problems that fail to meet the simplifying assumptions of the algorithms bias.

    - __Low Bias__: Suggests less assumptions about the form of the target function.
    - __High-Bias__: Suggests more assumptions about the form of the target function.
    
    
    
- __Variance__: Variance is the amount that the estimate of the target function will change if different training data was used.

    The target function is estimated from the training data by a machine learning algorithm, so we should expect the algorithm to have some variance. Ideally, it should not change too much from one training dataset to the next, meaning that the algorithm is good at picking out the hidden underlying mapping between the inputs and the output variables.

    Machine learning algorithms that have a high variance are strongly influenced by the specifics of the training data. This means that the specifics of the training have influences the number and types of parameters used to characterize the mapping function.

    - __Low Variance__: Suggests small changes to the estimate of the target function with changes to the training dataset.
    - __High Variance__: Suggests large changes to the estimate of the target function with changes to the training dataset.
    
    
        Decision tree is a low bias and high variance algorithm.