# Model Training and Evaluation
You should build a machine learning pipeline with a complete model training and evaluation step. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Conduct data exploration, data preprocessing, and feature engineering if necessary.
- Choose a few machine learning algorithms, such as [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and [gradient boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).
- Define a grid of hyperparameters for every selected model.
- Conduct [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) or [random search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) using k-fold cross-validation on the training set to find out the best model (i.e., the best algorithm and its hyperparameters).
- Train the best model on the whole training set.
- Test the trained model on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

In [1]:
from google.colab import files
uploaded = files.upload()

Saving mnist.csv to mnist.csv


In [2]:
import pandas as pd

df = pd.read_csv("mnist.csv")
df.head()

Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,31953,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,34452,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,60897,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,36953,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1981,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
df.columns

Index(['id', 'class', 'pixel1', 'pixel2', 'pixel3', 'pixel4', 'pixel5',
       'pixel6', 'pixel7', 'pixel8',
       ...
       'pixel775', 'pixel776', 'pixel777', 'pixel778', 'pixel779', 'pixel780',
       'pixel781', 'pixel782', 'pixel783', 'pixel784'],
      dtype='object', length=786)

In [7]:
X = df.drop(['id','class'], axis=1)
y = df["class"]

Data exploration

In [8]:
print(df.shape)
print(df.info())
print(df.describe())
print(df.isnull().sum())

(4000, 786)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Columns: 786 entries, id to pixel784
dtypes: int64(786)
memory usage: 24.0 MB
None
                 id        class  pixel1  pixel2  pixel3  pixel4  pixel5  \
count   4000.000000  4000.000000  4000.0  4000.0  4000.0  4000.0  4000.0   
mean   34415.179250     4.439500     0.0     0.0     0.0     0.0     0.0   
std    20508.890104     2.879655     0.0     0.0     0.0     0.0     0.0   
min       17.000000     0.000000     0.0     0.0     0.0     0.0     0.0   
25%    16575.750000     2.000000     0.0     0.0     0.0     0.0     0.0   
50%    34435.500000     4.000000     0.0     0.0     0.0     0.0     0.0   
75%    52111.500000     7.000000     0.0     0.0     0.0     0.0     0.0   
max    69998.000000     9.000000     0.0     0.0     0.0     0.0     0.0   

       pixel6  pixel7  pixel8  ...     pixel775     pixel776     pixel777  \
count  4000.0  4000.0  4000.0  ...  4000.000000  4000.000000  4000.00

In [9]:
print(y.value_counts())

class
1    486
7    426
3    417
8    416
6    391
2    390
0    376
4    369
9    366
5    363
Name: count, dtype: int64


Split Train and Test

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Preprocessing

In [12]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

KNN

In [13]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

param_grid_knn = {
    "n_neighbors": [3, 5, 7],
    "weights": ["uniform", "distance"],
    "metric": ["euclidean", "manhattan"]
}
from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_neighbors": [3, 5, 7]
}

grid = GridSearchCV(model, param_grid, cv=3)

grid.fit(X_train, y_train)

DecisionTree

In [14]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)

param_grid_dt = {
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5, 10],
    "criterion": ["gini", "entropy"]
}




GradientBoosting




In [16]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(random_state=42)

param_grid_gb = {
    "n_estimators": [100, 200],
    "learning_rate": [0.05, 0.1],
    "max_depth": [3, 5]
}

GridSearch for k-fold Cross validation

In [None]:
from sklearn.model_selection import GridSearchCV

models = [
    ("KNN", knn, param_grid_knn),
    ("Decision Tree", dt, param_grid_dt),
    ("Gradient Boosting", gb, param_grid_gb)
]

best_models = {}

for name, model, param_grid in models:
    grid = GridSearchCV(
        model,
        param_grid,
        cv=5,
        scoring="accuracy",
        n_jobs=-1
    )

    grid.fit(X_train, y_train)

    best_models[name] = grid
    print(f"{name} Best Score:", grid.best_score_)
    print(f"{name} Best Params:", grid.best_params_)
    print("------------------------------------------------")

KNN Best Score: 0.8943749999999999
KNN Best Params: {'metric': 'manhattan', 'n_neighbors': 3, 'weights': 'distance'}
------------------------------------------------
Decision Tree Best Score: 0.764375
Decision Tree Best Params: {'criterion': 'entropy', 'max_depth': None, 'min_samples_split': 2}
------------------------------------------------


In [None]:
best_model = max(best_models.items(), key=lambda x: x[1].best_score_)

print("Best Algorithm:", best_model[0])
print("Best CV Score:", best_model[1].best_score_)

In [None]:
final_model = best_model[1].best_estimator_

final_model.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

y_pred = final_model.predict(X_test)

print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))