# Model Assessment
You should build a machine learning pipeline with a complete model assessment step. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 
- Choose a few machine learning algorithms, such as [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
- Define a grid of hyperparameters for every selected model.
- Conduct [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) using k-fold cross-validation on the training set to find out the best model (i.e., the best algorithm and its hyperparameters).
- Train the best model on the whole training set.
- Test the trained model on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

## Importing Modules

In [1]:
import pandas as pd
import sklearn.model_selection
import sklearn.metrics
import sklearn.svm
import sklearn.tree
import sklearn.neighbors
import plotly.express as px

## Loading the Dataset

In [2]:
df = pd.read_csv("../../datasets/mnist.csv")
df = df.set_index("id")
df.head(3)

Unnamed: 0_level_0,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
31953,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
34452,8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60897,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Splitting the Data into Training and Test Sets

In [3]:
x = df.drop(["class"], axis=1)
y = df["class"]

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y)

## Model Selection and Hyperparameter Tuning

### Decistion Tree

In [4]:
parameters_grid = {
    "criterion": ["gini", "entropy"], 
    "max_depth": range(1, 20, 3),   # [1, 4, 7, ...]
    "min_samples_split": range(2, 20, 3)
}
model_1 = sklearn.model_selection.GridSearchCV(sklearn.tree.DecisionTreeClassifier(), 
                                               parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)
model_1.fit(x_train, y_train)
print("Accuracy of best decision tree classfier = {:.2f}".format(model_1.best_score_))
print("Best found hyperparameters of decision tree classfier = {}".format(model_1.best_params_))

Accuracy of best decision tree classfier = 0.77
Best found hyperparameters of decision tree classfier = {'criterion': 'entropy', 'max_depth': 19, 'min_samples_split': 2}


### SVM

In [5]:
parameters_grid = {
    "kernel": ["linear", "rbf", "poly"], 
    "C": [0.001, 0.01, 0.1, 1, 10, 100]
}
model_2 = sklearn.model_selection.GridSearchCV(sklearn.svm.SVC(), 
                                               parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)
model_2.fit(x_train, y_train)
print("Accuracy of best SVM classfier = {:.2f}".format(model_2.best_score_))
print("Best found hyperparameters of SVM classifier = {}".format(model_2.best_params_))

Accuracy of best SVM classfier = 0.95
Best found hyperparameters of SVM classifier = {'C': 10, 'kernel': 'rbf'}


### KNN

In [6]:
parameters_grid = {
    "n_neighbors": [1, 5, 10, 15, 20], 
    "metric": ["minkowski", "euclidean", "manhattan"]
}
model_3 = sklearn.model_selection.GridSearchCV(sklearn.neighbors.KNeighborsClassifier(),
                                               parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)
model_3.fit(x_train, y_train)
print("Accuracy of best KNN classfier = {:.2f}".format(model_3.best_score_))
print("Best found hyperparameters of KNN classifier = {}".format(model_3.best_params_))

Accuracy of best KNN classfier = 0.92
Best found hyperparameters of KNN classifier = {'metric': 'minkowski', 'n_neighbors': 5}


## Testing the Best Model

In [7]:
y_predicted = model_2.predict(x_test)
accuracy = sklearn.metrics.accuracy_score(y_test, y_predicted)
cm = sklearn.metrics.confusion_matrix(y_test, y_predicted)
precision, recall, f1, support = sklearn.metrics.precision_recall_fscore_support(y_test, y_predicted)

print("Accuracy =", accuracy)
print("Precision =", precision)
print("Recall =", recall)
print("F1-Score =", f1)
print("Confusion Matrix:\n", cm)

Accuracy = 0.955
Precision = [0.97849462 0.9765625  0.94186047 0.94736842 0.95698925 0.87096774
 0.98958333 0.99065421 0.9375     0.94680851]
Recall = [1.         0.96899225 0.92045455 0.95575221 0.94680851 0.95294118
 0.95       0.95495495 0.94736842 0.94680851]
F1-Score = [0.98913043 0.97276265 0.93103448 0.95154185 0.95187166 0.91011236
 0.96938776 0.97247706 0.94240838 0.94680851]
Confusion Matrix:
 [[ 91   0   0   0   0   0   0   0   0   0]
 [  0 125   1   1   0   2   0   0   0   0]
 [  1   1  81   1   2   0   1   0   1   0]
 [  0   0   1 108   0   3   0   1   0   0]
 [  0   0   1   0  89   1   0   0   0   3]
 [  1   0   0   1   0  81   0   0   2   0]
 [  0   0   1   0   0   3  95   0   1   0]
 [  0   1   1   0   1   0   0 106   1   1]
 [  0   0   0   2   0   2   0   0  90   1]
 [  0   1   0   1   1   1   0   0   1  89]]
