# Classification

Unlike regression where you predict a continuous number, you use classification to predict a category. 

Classification models include linear models like Logistic Regression, SVM, and nonlinear ones like K-NN, Kernel SVM and Random Forests.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Logistic Regression

Logistic Regression is used when the dependent variable(target) is categorical.<br>
Logistic Regression is a linear classifier.<br>
Logistic Regression returns probabilities.

For example,
* To predict whether an email is spam (1) or (0)
* Whether the tumor is malignant (1) or not (0)

Linear Function: $y = b_0 + b_1x$<br>
<font color='blue'>Logistic Function: $p = \frac{1}{1+e^{-y}}$  $\longrightarrow$  $p(x) = \frac{1}{1+e^{-(b_0 + b_1x)}}$</font> where $p(x)$ is the Probability of $x$<img src='https://miro.medium.com/max/875/1*RqXFpiNGwdiKBWyLJc_E7g.png' width=600>
Inverse of Logistic Function:  $ln(\frac{p(x)}{1-p(x)}) = b_0 + b_1x$

In [2]:
dataset = pd.read_csv('Social_Network_Ads.csv')
dataset.head()

FileNotFoundError: [Errno 2] File b'Social_Network_Ads.csv' does not exist: b'Social_Network_Ads.csv'

In [None]:
X = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:,-1].values

from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

### Fitting Logistic Regression

Logistic Regression is a linear classifier which means the categories will be separated by a straight line

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0,solver='lbfgs')
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
y_pred

### Evaluate the Logistic Regression (Confusion Matrix)

In [None]:
from sklearn.metrics import confusion_matrix

Class contains Upper case at beginning of the word, but function is all Lower case

In [None]:
cm = confusion_matrix(y_test,y_pred)
cm

Where '65' and '24' are correct prediction, it means there're 89 (65+24) correct predictions and 11 (3+8) incorrect predictions

### Visualising

The prediction boundary is always a straight line if The logistic regression classifier is linear<br>
如果是 2 Dimensions 的话, prediction boundary是一条线。如果是3D的话, prediction boundary是一个直平面<br>
The prediction boundary can be non-linear when we build non-linear classifiers<br>

The idea to plot the following graph is to take all the pixel points of the framework and apply the classifier on it. It makes each pixel will have X (age) and y (salary).<br>
Then we apply the logistic regression to predict if each pixel point has value 0 or 1. <br>
It's going to colorize the pixel in red if it is 0, and it's green if it's 1.

contour  - 绘制等高线<br>
contourf - 填充等高线。实际上contourf 相当于 contour filled<br>
等高线填充: mp.contourf(x, y, z, 等高线条数，cmap=颜色映射)<br>
等高线绘制: mp.contour(x, y, z, 等高线条数，colors=颜色, linewidth=线宽)

In [None]:
from matplotlib.colors import ListedColormap    # ListedColormap helps us colorize all the data points
# Create some local variable X set that we can replace X_train easily by other data later (ie. X_test)
X_set, y_set = X_train, y_train    
# np.meshgrid prepares all the pixel points, min and max value +-1 because we don't want the points to be squeezed on the axes
# X_set[:,0] is age and X_set[:,1] is salary in this case
# 生成网格数据
X1,X2 = np.meshgrid(np.arange(start=X_set[:,0].min()-1, stop=X_set[:,0].max()+1, step=0.01),
                    np.arange(start=X_set[:,1].min()-1, stop=X_set[:,1].max()+1, step=0.01))
# Use the contourf function to actually make the contour between the two prediction regions
# X1.ravel() 将 matrix 变成 1D array
plt.contourf(X1,X2,classifier.predict(np.array([X1.ravel(),X2.ravel()]).T).reshape(X1.shape),
             alpha=0.75, cmap=ListedColormap(('red','green')))
plt.xlim(X1.min(),X1.max())
plt.ylim(X2.min(),X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set==j,0],X_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Logistic Regression (Training Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

## K-Nearest Neighbors (K-NN)

**STEP 1:** Choose the number K of neighbors (most common default k = 5)<br>
**STEP 2:** Take the K nearest neighbors of the new data point, according to the Euclidean distance (most case)<br>
**STEP 3:** Among these K neighbors, count the number of data points in each category<br>
**STEP 4:** Assign the new data point to the category where you counted the most neighbors<br>

Euclidean Distance:  $dist(p,q) = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + ... + (q_n - p_n)^2}$<br>

K-NN is a non-linear classification

### Fitting K-NN Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier as knn
classifier = knn()    
# default n_neighbors = 5; use default setting metric = 'minkowski'and p =2, 
# so that we can use Euclidean distance (default)
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
cm

In [None]:
# Create some local variable X set that we can replace X_train easily by other data later (ie. X_test)
X_set, y_set = X_train, y_train    
# np.meshgrid prepares all the pixel points, min and max value +-1 because we don't want the points to be squeezed on the axes
# X_set[:,0] is age and X_set[:,1] is salary in this case
X1,X2 = np.meshgrid(np.arange(start=X_set[:,0].min()-1, stop=X_set[:,0].max()+1, step=0.01),
                    np.arange(start=X_set[:,1].min()-1, stop=X_set[:,1].max()+1, step=0.01))
# Use the contourf function to actually make the contour between the two prediction regions
# X1.ravel() 将 matrix 变成 1D array
plt.contourf(X1,X2,classifier.predict(np.array([X1.ravel(),X2.ravel()]).T).reshape(X1.shape),
             alpha=0.75, cmap=ListedColormap(('red','green')))
plt.xlim(X1.min(),X1.max())
plt.ylim(X2.min(),X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set==j,0],X_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('K-NN Regression (Training Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

## Support Vector Machine (SVM)

SVM finds the best decision boundary to separate the space<br>

The line will draw the equal distance between the closet points in the different categories<br>

Those closet points in the different categories are called Support Vector<br>

The line (in 2D, in multi D, it's hyperplane) to separate the categories is called Maximum Margin Hyperplane<br>

SVM focuses on the closet categories (ie. the apple looks like orange, and the orange looks like the apple).这点和其他机器学习相反。因为Support Vector本身就非常互相接近。

## Kernel SVM

## Naive Bayes

## Decision Tree Classification

## Random Forest Classification

## Evaluating Classification Models Performance