# Supervised Learning: Classification

In classification, the data that we have on the label column is discrete: there are two or more options for what the value (or quality) of the outcome can take. 

Below, we can see illustrated what the values of our outcome look like in the two cases, and what we are trying to achieve with each. In classification, our value has only two options, and therefore we are trying to find the boundary between the two classes. In regression, we are trying to find the line (not necessarily linear!) that best follows the formation of our data. 

![title](img/classification_vs_regression.png)

Classification problems can be grouped into:
- binary problems: is this tumor cancerous or not?
- multi-class problems: what type of animal is this?

## Dataset

https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

Features: Sepal Length, Sepal Width, Petal Length and Petal Width

Labels: Setosa, Versicolour, and Virginica

In [1]:
import numpy as np

from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

print(X[:5])
print(np.unique(y))
print(len(X))

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[0 1 2]
150


In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(112, 4)
(38, 4)
(112,)
(38,)


## Algorithms

For restless souls:

Andriy Burkov - The Hundred-Page Machine Learning Book

### Logistic Regression

In [3]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [4]:
lr.predict(X_test)

array([2, 0, 0, 0, 0, 1, 0, 1, 2, 0, 0, 1, 1, 1, 0, 1, 2, 1, 2, 2, 1, 1,
       1, 2, 2, 2, 0, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2, 0])

In [5]:
lr.predict_proba(X_test)

array([[5.71856018e-04, 1.84312183e-01, 8.15115961e-01],
       [8.10552507e-01, 1.89272604e-01, 1.74889122e-04],
       [8.78474321e-01, 1.21474505e-01, 5.11743505e-05],
       [8.97136168e-01, 1.02732673e-01, 1.31159183e-04],
       [8.28684658e-01, 1.71206990e-01, 1.08351937e-04],
       [2.61006847e-02, 7.95666221e-01, 1.78233095e-01],
       [7.89584557e-01, 2.10267051e-01, 1.48392401e-04],
       [6.84036691e-02, 7.31908730e-01, 1.99687601e-01],
       [1.96733806e-03, 2.58319180e-01, 7.39713482e-01],
       [8.86319153e-01, 1.13651561e-01, 2.92864062e-05],
       [8.52046413e-01, 1.47814111e-01, 1.39475782e-04],
       [1.12656659e-02, 6.72369225e-01, 3.16365110e-01],
       [1.56620293e-02, 6.47578662e-01, 3.36759308e-01],
       [4.59375279e-02, 5.69325553e-01, 3.84736919e-01],
       [8.45623127e-01, 1.54287498e-01, 8.93752934e-05],
       [2.44569802e-02, 5.70138341e-01, 4.05404678e-01],
       [1.41134288e-03, 3.62013051e-01, 6.36575606e-01],
       [4.47655203e-02, 7.17697

In [6]:
from sklearn.metrics import accuracy_score

accuracy_score(y_true=y_test, y_pred=lr.predict(X_test))

0.9473684210526315

### Random Forest

In [7]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

rf.fit(X_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [8]:
accuracy_score(y_true=y_test, y_pred=rf.predict(X_test))

0.9736842105263158

### SVM

In [9]:
from sklearn.svm import SVC

svm = SVC()

svm.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [10]:
accuracy_score(y_true=y_test, y_pred=svm.predict(X_test))

1.0

### K-Nearest Neighbours (kNN)

In [11]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [12]:
accuracy_score(y_true=y_test, y_pred=knn.predict(X_test))

0.9736842105263158