# Chapter 3: Classification#

## 0. Classification Template ##

In [None]:
# Classification template

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Fitting classifier to the Training set
# Create your classifier here

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Classifier (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Classifier (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

## 1. Linear Regression ##

You try to predict the probability that a user converts given his age.

Formula: $ln(\frac{p}{1-p}) = b_0 + b_1 * x$

In [None]:
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

## 2. k-NN ##

Out of the k Neighbors, how many are category 1? How many are category 2?

In [None]:
# Fitting K-NN to the Training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

## 3. SVM & Kernel SVM ##

The aim of that method is to seperate the two classes through an hyperplane. This hyperplane should maximize the distance between the two closest instances of the classes. That hyperplane is the maximum margin classifier.

Usual algorithms use the paragon of each class. The most appley apple and orangey orange. And then it will compare the new instances and compute how similar they are to each 'idea' of the class.
SVMs work the other way round: the most orangey apple and the most appley orange. What's different between them?

Those are called the support vector.

In [None]:
# Fitting SVM to the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)

NB: There are other functions for the kernel than simply linear which doesn't do much more than a linear regression.

For example: Gaussian kernel for a circle decision boundary. Sigmoid kernel, Polynomial kernel, Laplacian, etc...

Some visualisation are available at https://mlkernels.readthedocs.io/en/latest/

In [None]:
# Fitting Kernel SVM to the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

## 4. Naive Bayes ##

Theorem: $ P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|not A)P(not A)}$

or less general $ P(A|B) = \frac{P(B|A)P(A)}{P(B)}$

We have: $ Posterior   Probability = \frac{Likelihood * Prior   Probability}{Marginal   Likelihood} $

With: 
- Posterior Probability = Probability that a sample is of category A given that it is in the similar area
- Likelihood = Probability that a sample in the area deemed similar is of category A
- Prior Probability = Probability of A over the total population
- Marginal Likelihood = Probability that a sample is in the area deemed similar to the new datapoint
    
What's cool is that we get to fidget with the diameter of the similar area.

NB: 
1. It is called Naïve because it assumes independance, and it still gives good results when variable are not independent
2. When comparing P(A|B) and P(notA|B) sometimes people omit the Marginal Likelihood since it is the same on both side: P(B)

In [None]:
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

## 5. Decision Tree Classifier ##

Quite similar to the Decision Tree Regression. It had kind of died off but came back with new methods such as Gradient Boosting.

Maybe not use the feature scaling: not necessary. Especially for plotting.

In [None]:
# Fitting Decision Tree Classification to the Training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

## 6. Random Forest Classifier ##

Typical ensemble learning.
1. Pick at random K data points from the original set
2. Build the decision tree associated with these K data points
3. Choose N the number of trees I want to build, loop over 1 and 2
4. Pick a function (e.g. average or median) to apply to the N decision trees for each new points

In [None]:
# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

## 7. Evaluating a model ##

+ **False positives (type 1 error) should have been negative and False negatives (type 2 error) should have been positive**
+ **Confusion Matrix**

|||      **y^ predicted value**         |
| -------------- | -------------- | -------------- | -------------- |
|                        |    |         0      |      1         |
|      **y actual value**| 0  | True Negative  | False Positive |
|      **y actual value**| 1  | False Negative | True Positive  |

    Accuracy ratio = Correct / Total
    Error ratio = Error / Total

+ **Accuracy paradox**
    
    Here are two confusion matrices, the second has no model but still has a better accuracy
    
|||      **y^ predicted value**         |
| -------------- | -------------- | -------------- | -------------- |
|                        |    |         0      |      1         |
|      **y actual value**| 0  | 9700  | 150 |
|      **y actual value**| 1  | 50 | 100  |

    Accuracy ratio = 9800 / 10000 = 98%

|||      **y^ predicted value**         |
| -------------- | -------------- | -------------- | -------------- |
|                        |    |         0      |      1         |
|      **y actual value**| 0  | 9850  | 0 |
|      **y actual value**| 1  | 150 | 0  |    

    Accuracy ratio = 9850 / 10000 = 98.5%

+ **Cumulative Accuracy Profile**
    
    Rank the customers to contact first to augment the surface above the diagonal line of the random sample. A CAP curve is not a ROC curve although they look very similar!! ROC = Receiver Operating Characteristics.
    
    **CAP:** Total Purchased = f(Total Contacted) <br>
    **ROC:** True Positive Rate = f(False positive Rate)
    
+ How to assess a CAP curve

    Calculate the area under the curve and above random then divide it by the perfect model. <br>
    Or... Take the vertical 50% and plot it on the CAP curve.
    
| Value at 50%   |   Quality      |
| -------------- | -------------- |  
| X < 60%        |   Rubbish      | 
| 60% < X < 70%  |   Poor         | 
| 70% < X < 80%  |   Good         | 
| 80% < X < 90%  |   Very Good    | 
| 90% < X < 100% |   Suspect      |