<a href="https://colab.research.google.com/github/smraytech/test/blob/main/Yet_another_copy_of_model_assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Assessment
You should build a machine learning pipeline with a complete model assessment step. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Conduct data exploration, data preprocessing, and feature engineering if necessary.
- Choose a few machine learning algorithms, such as [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
- Define a grid of hyperparameters for every selected model.
- Conduct [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) using k-fold cross-validation on the training set to find out the best model (i.e., the best algorithm and its hyperparameters).
- Train the best model on the whole training set.
- Test the trained model on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

In [None]:
import pandas as pd
import sklearn.model_selection
import sklearn.preprocessing
import sklearn.compose
import sklearn.tree
import sklearn.metrics
import sklearn.neighbors
import sklearn.svm


**Load the mnist dataset using Pandas**

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/m-mahdavi/teaching/main/datasets/mnist.csv")
df


Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,31953,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,34452,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,60897,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,36953,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1981,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3995,25268,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3996,6473,6,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3997,5821,7,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3998,1751,9,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df.head()

Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,31953,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,34452,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,60897,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,36953,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1981,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Split the dataset into training and test sets using Scikit-Learn.**

In [None]:
x = df.drop(['id', 'class'], axis = 1)
y = df['class']
x_train,x_test,y_train,y_test = sklearn.model_selection.train_test_split(x,y)

In [None]:
print('df_size:' ,df.shape)
print('x_train_size:' ,x_train.shape)
print('x_test_size:' ,x_test.shape)
print('y_train_size:' ,y_train.shape)
print('y_test_size:' ,y_train.shape)


df_size: (4000, 786)
x_train_size: (3000, 784)
x_test_size: (1000, 784)
y_train_size: (3000,)
y_test_size: (3000,)


In [None]:
x_train.dtypes

pixel1      int64
pixel2      int64
pixel3      int64
pixel4      int64
pixel5      int64
            ...  
pixel780    int64
pixel781    int64
pixel782    int64
pixel783    int64
pixel784    int64
Length: 784, dtype: object

In [None]:
print(x_train.isnull())

      pixel1  pixel2  pixel3  pixel4  pixel5  pixel6  pixel7  pixel8  pixel9  \
1555   False   False   False   False   False   False   False   False   False   
2701   False   False   False   False   False   False   False   False   False   
2313   False   False   False   False   False   False   False   False   False   
2534   False   False   False   False   False   False   False   False   False   
491    False   False   False   False   False   False   False   False   False   
...      ...     ...     ...     ...     ...     ...     ...     ...     ...   
1351   False   False   False   False   False   False   False   False   False   
1543   False   False   False   False   False   False   False   False   False   
3012   False   False   False   False   False   False   False   False   False   
2562   False   False   False   False   False   False   False   False   False   
2789   False   False   False   False   False   False   False   False   False   

      pixel10  ...  pixel775  pixel776 

**Data Preprocessing**

In [None]:
numerical_attributes = x_train.select_dtypes(include=['int64', 'float64']).columns

ct = sklearn.compose.ColumnTransformer([
        ('standard_scaling', sklearn.preprocessing.StandardScaler(), numerical_attributes)],
        remainder='passthrough')

ct.fit(x_train)
x_train = ct.transform(x_train)
x_test = ct.transform(x_test)

**Decision Tree**

In [None]:
parameters_grid = {
    "criterion": ["gini", "entropy"],
    "max_depth": range(1,20,3), #[1,4,7,...]
    "min_samples_split": range(2,20,3)
    }
model_1 = sklearn.model_selection.GridSearchCV(sklearn.tree.DecisionTreeClassifier(),parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)
model_1.fit(x_train,y_train)
print("Accuracy of the best decision tree classfier = {:.2f}".format(model_1.best_score_))
print("Best found hyperparemeters of decision tree classfier = {}".format(model_1.best_params_))


Accuracy of the best decision tree classfier = 0.74
Best found hyperparemeters of decision tree classfier = {'criterion': 'entropy', 'max_depth': 19, 'min_samples_split': 11}


In [None]:
#parameters_grid.fit(x_train,y_train)

**svm**

In [None]:
parameters_grid = {
"kernel": ["linear", "rbf" ,"poly","sigmoid"], #kernel type
"C": [0.001, 0.01, 0.1, 1, 10, 100], #Regularization parameter
"gamma": ["scale","auto"]      # kernel coefficient for   "linear", "rbf" ,"poly","sigmoid"


    }
model_2 = sklearn.model_selection.GridSearchCV(sklearn.svm.SVC(),parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)
model_2.fit(x_train,y_train)
print("Accuracy of the best SVM classfier = {:.2f}".format(model_2.best_score_))
print("Best found hyperparemeters of SVM classfier = {}".format(model_2.best_params_))

Accuracy of the best SVM classfier = 0.92
Best found hyperparemeters of SVM classfier = {'C': 100, 'gamma': 'scale', 'kernel': 'poly'}


**KNN**

In [None]:
parameters_grid = {
    "n_neighbors": [1,5,10,15,20],
    "metric": ["minkowski", "euclidean", "manhattan"]
   }
model_3 = sklearn.model_selection.GridSearchCV(sklearn.neighbors.KNeighborsClassifier(),parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)
model_3.fit(x_train,y_train)
print("Accuracy of the best KNN classfier = {:.2f}".format(model_3.best_score_) )
print("Best found hyperparemeters of KNN classfier = {}".format(model_3.best_params_))

Accuracy of the best KNN classfier = 0.89
Best found hyperparemeters of KNN classfier = {'metric': 'manhattan', 'n_neighbors': 1}


**Testing the Best Model**

In [None]:
y_predicted = model_2.predict(x_test)
accuracy = sklearn.metrics.accuracy_score(y_test,y_predicted)
cm = sklearn.metrics.confusion_matrix(y_test,y_predicted)
precision,recall,f1, support = sklearn.metrics.precision_recall_fscore_support(y_test,y_predicted)
cr = sklearn.metrics.classification_report(y_test,y_predicted)
print("Accuracy =", accuracy)
print("Precision =", precision)
print("Recall =", recall)
print("F1-Score =", f1)
print("Confusion Matrix:\n", cm)
print("Classification Report:\n", cr)

Accuracy = 0.93
Precision = [1.         0.96031746 0.98850575 0.94736842 0.8952381  0.89361702
 0.97959184 0.95145631 0.81308411 0.88421053]
Recall = [0.96774194 0.98373984 0.89583333 0.91836735 0.96907216 0.93333333
 0.88888889 0.89090909 0.94565217 0.90322581]
F1-Score = [0.98360656 0.97188755 0.93989071 0.93264249 0.93069307 0.91304348
 0.93203883 0.92018779 0.87437186 0.89361702]
Confusion Matrix:
 [[ 90   0   0   0   0   1   0   0   2   0]
 [  0 121   0   2   0   0   0   0   0   0]
 [  0   1  86   0   3   0   1   1   4   0]
 [  0   1   1  90   0   1   0   0   5   0]
 [  0   0   0   0  94   0   0   0   1   2]
 [  0   0   0   3   0  84   1   0   1   1]
 [  0   2   0   0   2   5  96   0   3   0]
 [  0   0   0   0   3   0   0  98   1   8]
 [  0   1   0   0   0   3   0   1  87   0]
 [  0   0   0   0   3   0   0   3   3  84]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.97      0.98        93
           1       0.96      0