# Loading Data

1. Load mnist_test.csv from https://www.kaggle.com/datasets/oddrationale/mnist-in-csv?select=mnist_test.csv as data.

2. Split data into X and y. X should have the shape as (10000,784) and y should have the shape as (10000,1).

3. Split X and y into the train set (80%) and the test set (20%). The train set is for fitting your model while the test set is for evaluating your model. As a result, you will have X_train.shape as (8000,784), y_train.shape as (8000,1), X_test.shape as (2000,784),and y_test.shape as (2000,1). (Hint: use sklearn.model_selection.train_test_split.) 

In [35]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

data_pd = pd.read_csv('mnist_test.csv')
data = np.array(data_pd)

X = data[:,1:]
y = data[:,0]
print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(10000, 784)
(10000,)
(8000, 784)
(2000, 784)
(8000,)
(2000,)


## Fitting and Evaluating Your Model
1. Use sklearn.linear_model.RidgeClassifier to fit X_train and y_train to get a multi-class classification model. 
2. Test your model on (X_test, y_test) and get testing accuracy by using clf.score(X_test, y_test) assuming your model is named as "clf".

In [36]:
from sklearn.linear_model import RidgeClassifier

clf = RidgeClassifier()
clf.fit(X_train, y_train)

test_accuracy = clf.score(X_test, y_test)
print(test_accuracy)


0.849


## Optimizing the RidgeClassifier
1. In sklearn.linear_model.RidgeClassifier, there is one argument called "alpha" corresponding to the coefficient for the regularization. By default, alpha is equal to 1. There are benefits and drawbacks to having a large alpha. The larger is the alpha, the more likely you are going to have underfitting problems with your graph. Higher alpha does not necessarily mean better results. On the other hand, a low alpha may lead to overfitting problems and a more complicated model. 

  More information: https://towardsdatascience.com/preventing-overfitting-with-lasso-ridge-and-elastic-net-regularization-in-machine-learning-d1799b05d382

  Please try different alpha to train your model and evaluate your model's test accuracy. Note: you cannot try number ranges such as (1-10) or (1-50), these numbers are too similar. (Hint: you'll want try alphas that are different powers of 10.) Out of what you chose, what is the best choice for alpha in MNIST classification?

2. Instead of fitting the full dimension (784) of data to the RidgeClassifier, you can apply PCA to the data (PCA over X with the shape 10000*784) to reduce the dimension from 784 to 100 (for example) and train another RidgeClassifier with 100-dimension features. 

  Typically, we want the explained variance to be between 95–99% (which is what we would set n_components to). With alpha=1, iterate through the array 0.95-0.99 ([0.95, 0.96, ...., 0.99]) and set n_components to the value you're currently on in the array. Each time, print the shape of X_reduced to get the number of components that are left from the second value in the tuple. For example, (10000, 168) has 168 components left. For further explanation, see the scikit learn PCA function documentation: 
  > if 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components
 
  Then, set n_components to 784 and run again. I should see the results from 0.95-0.99 variance as well as the result of running n_components = 784 in your answer. Then answer this question: what is the best reduced dimension number of components to get the highest test accuracy? (Hint: after applying PCA to 10000 samples, remember to split train and test.)
  
  More information on variance and PCA:  https://stackoverflow.com/questions/32857029/python-scikit-learn-pca-explained-variance-ratio-cutoff


In [37]:
# 1 

# Init best accuracy performance
acc_best = 0
for alpha in [10,100,1000, 10000, 100000]: 
  clf = RidgeClassifier(alpha=alpha)
  clf.fit(X_train, y_train)

  # Check classifier's accuracy on the test set
  test_acc = clf.score(X_test, y_test)
  print(f"For {alpha}, acc: {test_acc}")

  # Update the best choice of alpha according to the test accuracy
  if test_acc > acc_best:
    acc_best = test_acc
    clf_best = clf
print(f"Best choice of alpha is {alpha} with acc: {clf_best.score(X_test, y_test)}")



For 10, acc: 0.8495
For 100, acc: 0.851
For 1000, acc: 0.85
For 10000, acc: 0.851
For 100000, acc: 0.856
Best choice of alpha is 100000 with acc: 0.856


In [40]:
# 2 
from sklearn.decomposition import PCA

for i in [0.95, 0.96, 0.97, 0.98, 0.99, 784]:
  pca = PCA(n_components = i)
  data_reduced = pca.fit_transform(X)
  target = y
  print(data_reduced.shape)

  X_train, X_test, y_train, y_test = train_test_split(data_reduced,target, test_size=0.2, random_state=42)

  clf = RidgeClassifier(alpha= 1)
  clf.fit(X_train, y_train)

  # Check classifier's accuracy on the test set
  test_acc = clf.score(X_test, y_test)
  print(f"For {i}, acc: {test_acc}")

print("(10000,253) is the best reduced dimension number of components to get the highest test accuracy of 86.35%")



(10000, 149)
For 0.95, acc: 0.8615
(10000, 174)
For 0.96, acc: 0.8605
(10000, 207)
For 0.97, acc: 0.863
(10000, 253)
For 0.98, acc: 0.8635
(10000, 323)
For 0.99, acc: 0.8595
(10000, 784)
For 784, acc: 0.849
(10000,253) is the best reduced dimension number of components to get the highest test accuracy of 86.35%


## Confusion Matrix and Classification Report
1. Read https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html to understand how to use a confusion matrix. Based on the information you learned from #1, can you plot the confusion matrix accordingly? (Hint: use clf.predict(X_test) to get the prediction labels over X_test.)

2. Read https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report to understand how to use a classification report. Based on the information you learned in #3, can you output the classification report accordingly? What is the label with the lowest precision?

In [39]:
from sklearn.metrics import confusion_matrix, classification_report

y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[193   0   1   2   0   3   3   0   1   0]
 [  0 211   2   0   0   0   2   0   1   0]
 [  3  10 171   4   4   2   8   3   6   2]
 [  1   7   4 178   1   5   1   5   2   4]
 [  0   1   5   1 184   2   6   0   3  13]
 [  2   2   2  10   6 125   5   3  12   7]
 [  6   1   1   1   3   5 180   2   1   0]
 [  1   8   1   1   7   0   0 161   0   8]
 [  1   6   5  11   5   6   2   7 140   3]
 [  4   3   0   3  15   1   0  15   2 155]]
              precision    recall  f1-score   support

           0       0.91      0.95      0.93       203
           1       0.85      0.98      0.91       216
           2       0.89      0.80      0.84       213
           3       0.84      0.86      0.85       208
           4       0.82      0.86      0.84       215
           5       0.84      0.72      0.77       174
           6       0.87      0.90      0.88       200
           7       0.82      0.86      0.84       187
           8       0.83      0.75      0.79       186
           9       0.81     