# Supervised Machine Learning Models (Classificaiton)
-  **Classification**: Finding the distinct value of outcomes, referred as class. Example, species of iris ([included in sklearn]) is a three-class classification, aka. Multi-class classification. Binary classification refers to outcomes of either or values, e.g.  Survival of passengers in Titanic ([kaggle])  
 - **Regression**:  Finding a continuous number in a range, aka *floating-point *number (programming terms) or *real number* (mathematical terms). Example included housing price in Boston ([included in sklearn])

## Libraries

In [102]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

#k-Nearest Neighbours
from sklearn.neighbors import KNeighborsClassifier

#Logistic Regression - a type of linear model
from sklearn.linear_model import LogisticRegression

#Linear support vector - a type of linear model
from sklearn.svm import LinearSVC

## Reading the data

In [103]:
pwd = os.getcwd()
data = os.path.join(pwd, "data.csv")
df = pd.read_csv(data)
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


## Feature and target selection

In [104]:
features = df[["Pclass", "Sex","Fare"]]
target = df[["Survived"]]

## Splitting Dataset

In [105]:
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=42)

## Preprocessing

In [106]:
ct = ColumnTransformer([
    ("onehot", OneHotEncoder(sparse=False), ["Pclass", "Sex"]),
    ("scaling", StandardScaler(),["Fare"])
    ])

X_train_fit_trans = ct.fit_transform(X_train)
X_test_fit_trans = ct.transform(X_test)

# Model Selection
- The following are some of the most common supervised machine training models for classification. 
- Basically, all models are just a line of code with import from sklearn, more or less the same syntax-wise. 
    - Step 1. Instantiate a class into an object
    - Step 2. use .fit to feed training data into the object
    - Step 3. use .score to find the accuracy of the trainied model
- What sets them apart are the parameters and coefficients used in different models to tune the model for generalisation. 
- Generalisation means a given model to predict result of unseen data

## k-Nearest Neighbours
- The nearest neighbour of a test data point to the train dataset
- Number of neighbours higher, smoother the decision boundaries, simpler model
- n-nearest-neighbour algorithm

In [119]:
# instantiate the knn class into an object
knn = KNeighborsClassifier(n_neighbors=5)
# feeding in the train data into the object
knn.fit(X_train_fit_trans, np.ravel(y_train))

KNeighborsClassifier()

In [120]:
# accuray of a model
knn.score(X_test_fit_trans, y_test)

0.7982062780269058

# Linear Models for classification
- Not to be confused with linear regression
- Two most common models:   
	- logistic regression  
	- linear support vector
- L2 regularisation as default
- Parameter C, higher C less regularisation better fitting (importance to individual data points). Lower C, putting coefficient vector (w) close to zero(adjust to the “majority” of data points).
- Higher C more complex model, lower C less complex model
## Logistic Regression

In [109]:
logreg = LogisticRegression(C=100)
logreg.fit(X_train_fit_trans, np.ravel(y_train))

LogisticRegression(C=100)

In [110]:
logreg.score(X_test_fit_trans, y_test)

0.7802690582959642

In [111]:
logreg.coef_

array([[ 0.67128411,  0.26544327, -0.89948136,  1.31842335, -1.28117734,
         0.15360815]])

In [112]:
logreg.intercept_

array([0.03725291])

## Linear Support Model 

In [113]:
linear_svm = LinearSVC(C=10, max_iter=100000)
linear_svm.fit(X_train_fit_trans, y_train.values.ravel())

LinearSVC(C=10, max_iter=100000)

In [114]:
linear_svm.score(X_test_fit_trans, y_test)

0.7847533632286996

In [115]:
linear_svm.coef_

array([[ 0.21858849,  0.08175322, -0.29026886,  0.51807711, -0.50800427,
         0.05475173]])

In [116]:
linear_svm.intercept_

array([0.01007285])