---

## Model Comparison
#### Language: Python 3.8.8
#### Author: Tianjian Sun



---
### Table of Contents

* [Introduction](#Introduction)
* [Algorithms](#Algorithm)
    * [Classification](#ClassificationAlgorithms)
    * [Regression](#RegressionAlgorithms)
* [Applications on data sets](#Applications)
    * [Classification](#Classification)

---

### Introduction <a class="anchor" id="Introduction"></a>
In this section we compare different supervised learning models learned in class, for both classification and regression. Models we choose to compare are as follows.

Among these models, **KNN** and **Linear Regression**, are used from scripts, in *models* folder, while other models are imported from *sklearn*.

---

### Algorithms <a class="anchor" id="Algorithm"></a>

#### Classification <a class="anchor" id="ClassificationAlgorithms"></a>

* K-Nearest Neighbor
* Logistic Regression
* Multilayer Neural networks
* Decision Tree
* Random Forest

#### Regression <a class="anchor" id="RegressionAlgorithms"></a>

* K-Nearest Neighbor
* Linear Regression
* Multilayer Neural networks
* Decision Tree
* Random Forest

---

### Applications on data sets <a class="anchor" id="Applications"></a>

* classification: *digits* data set

#### Classification <a class="anchor" id="Classification"></a>

The *digits* data set is loaded from *sklearn*. Each datapoint is a 8x8 image of a digit, from 0 to 9.

Import modules

In [1]:
from functions.models import K_Nearest_Neighbor
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import pandas as pd

Load and scale data.

In [2]:
X, y = load_digits(return_X_y=True, as_frame=True)
X_scaler = StandardScaler()
X = pd.DataFrame(X_scaler.fit_transform(X))
X.shape

(1797, 64)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=24, stratify=y)

Train 4 models.

In [4]:
KNN = K_Nearest_Neighbor(k=10)
Logit_R = LogisticRegression()
MLP = MLPClassifier()
DT = DecisionTreeClassifier()
RF = RandomForestClassifier()

In [5]:
KNN.fit(X_train, y_train)
KNN_y_pred = [KNN.predict(x) for x in X_test.to_numpy()]
print(f'KNN, accuracy score = {accuracy_score(y_test, KNN_y_pred)}')

KNN, accuracy score = 0.9722222222222222


In [6]:
Logit_R.fit(X_train, y_train)
Logit_R_y_pred = Logit_R.predict(X_test)
print(f'Logistic Regression, accuracy score = {accuracy_score(y_test, Logit_R_y_pred)}')

Logistic Regression, accuracy score = 0.975


In [7]:
MLP.fit(X_train, y_train)
MLP_y_pred = MLP.predict(X_test)
print(f'Multilayer Neural networks, accuracy score = {accuracy_score(y_test, MLP_y_pred)}')

Multilayer Neural networks, accuracy score = 0.9833333333333333


In [8]:
DT.fit(X_train, y_train)
DT_y_pred = DT.predict(X_test)
print(f'Decision Tree, accuracy score = {accuracy_score(y_test, DT_y_pred)}')

Decision Tree, accuracy score = 0.8555555555555555


In [9]:
RF.fit(X_train, y_train)
RF_y_pred = DT.predict(X_test)
print(f'Random Forest, accuracy score = {accuracy_score(y_test, RF_y_pred)}')

Random Forest, accuracy score = 0.8555555555555555


Compare this three models, KNN (k=10), Logistic Regression, and Multilayer Neural networks have the top performance (over 97%), while Decision Tree and Random Forest have the same performance, and they are both weaker than models before.