# Task 2: Optimize the code to improve the accuracy using given tutorial

In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [24]:
train_df = pd.read_csv('MNIST/mnist_train.csv')
test_df = pd.read_csv('MNIST/mnist_test.csv')

print('Train df shape: ', train_df.shape,'\n', 'Test df shape: ', test_df.shape)

Train df shape:  (60000, 785) 
 Test df shape:  (10000, 785)


### Data Preprocessing (Normalizaiton & Standardization)

In [25]:
# Setting up and Preparing Training and Testing DataFrames: 

X_train = train_df.T[1:].T
y_train = train_df['label']
X_test = test_df.T[1:].T
y_test = test_df['label']

print('X_train Type is: ', X_train.shape)
print('Y_train Type is: ', y_train.shape)
print('X_test Type is: ', X_test.shape)
print('Y_test Type is: ', y_test.shape)

X_train Type is:  (60000, 784)
Y_train Type is:  (60000,)
X_test Type is:  (10000, 784)
Y_test Type is:  (10000,)


In [26]:
# Checking the Min & Max values in the training data
train_min = X_train.to_numpy().min()
train_max = X_train.to_numpy().max()

print('X_Train Min. Value = ', train_min)
print('Y_Train Man. Value = ', train_max)

X_Train Min. Value =  0
Y_Train Man. Value =  255


> NORMALIZATION

Normalization: it is the process of rescaling data points values to fit between a range of 0 to 1. It essentially imporoves on the model feature and provide better accuracy rate. Mathematically, normalization is calculated with two slightly different formulas:

            Formula (1): Xnormalized = (X- Xminimum) / range of x        ---> where the range of x = Max value - Min Value
    
            Formula (2): Xnormalized = a + ( ((X - Xminimum) * (b - a)) / range of X)

Formula (2) is mainly used when data is needed to be within a custom range of minimum (a) and maximum (b) values.

> STANDARDIZATION:

For the standardization of the data, we use statistical anylsis techniques which provide better understanding of the data. For example, how distant or close are from each other. In other terms, we can calculate and visualize the distribution of the data points on a scatterplot to look for any abnormalities or high variations (to spot outliers) from which an analyist would be able to obtain insightful information that maybe hidden by these outliers. Then perform some mathematical operations to clean up the and normalize the data to be able to conduct further investigation with higher accuracy.

Mathematically:
There are different methods used for the standardization of data, such as Z-score, feature clipping and log scaling. The Z-score value tells us how, on average, how far or close each data point is from the mean with a given SD. is calculated using the following equation:

                       Z-Score of X = (X - u) / SD

In [27]:
# Conversion to float
X_train = X_train.astype('float32')  
X_test = X_test.astype('float32')

# Normalization: should return the minimum and maximum values of X_train to be in the range 0 to 1.
# -- for that reason, we converted all integer values to float values in the step above.
# Here we used Formula (2) for normalization: where the range = 255 - 0 = 255. And X - Xmin will always be 0, 
#--- That is why we are dividing by the range of 255.

X_train = X_train/255.0   
X_test = X_test/255.0

# Here we are using the Z-score formula for Standardization:  Z-Score of X = (X - u) / SD
scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)

In [35]:
standardized_X_test.shape

(10000, 784)

In [63]:
from sklearn.cluster import MiniBatchKMeans
total_clusters = len(np.unique(y_test))
# Initialize the K-Means model
kmeans = MiniBatchKMeans(n_clusters = total_clusters)
# Fitting the model to training set
kmeans.fit(X_train)



In [64]:
kmeans.labels_

array([8, 2, 0, ..., 8, 9, 5])

In [65]:
def retrieve_info(cluster_labels,y_train):

#Associates most probable label with each cluster in KMeans model
#returns: dictionary of clusters assigned to each label

  # Initializing
  reference_labels = {}
  # For loop to run through each label of cluster label
  for i in range(len(np.unique(kmeans.labels_))):
    index = np.where(cluster_labels == i,1,0)
    num = np.bincount(y_train[index==1]).argmax()
    reference_labels[i] = num
  return reference_labels

In [82]:
reference_labels = retrieve_info(kmeans.labels_,y_train)
number_labels = np.random.rand(len(kmeans.labels_))
for i in range(len(kmeans.labels_)):
  number_labels[i] = reference_labels[kmeans.labels_[i]]

In [83]:
print(reference_labels)

{0: 2, 1: 5, 2: 6, 3: 1, 4: 0, 5: 4, 6: 9, 7: 5, 8: 9, 9: 8, 10: 0, 11: 8, 12: 2, 13: 7, 14: 9, 15: 6, 16: 7, 17: 3, 18: 3, 19: 2, 20: 0, 21: 8, 22: 1, 23: 7, 24: 1, 25: 0, 26: 4, 27: 0, 28: 3, 29: 3, 30: 3, 31: 7, 32: 5, 33: 5, 34: 0, 35: 4, 36: 6, 37: 2, 38: 4, 39: 5}


In [68]:
# Comparing Predicted values and Actual values
print(number_labels[:20].astype('int'))
print(y_train[:20])

[3 0 4 1 7 2 1 3 1 4 3 1 8 6 1 4 2 4 6 4]
0     5
1     0
2     4
3     1
4     9
5     2
6     1
7     3
8     1
9     4
10    3
11    5
12    3
13    6
14    1
15    7
16    2
17    8
18    6
19    9
Name: label, dtype: int64


In [69]:
# Calculating accuracy score
from sklearn.metrics import accuracy_score
print(accuracy_score(number_labels,y_train))

0.5557166666666666


As you can see that the accuracy rate has gone up from 11.3% to 55.6%. That is about 20% improvment. That is actually big jump. And the better improvement performace of our algorithm and imporovement of the accuracy rate was due to the standardization (Z-score) step that we introduced. 

But, how does the Z-score actually help?

The objective of K-meas clustering is that after the initialization step (randomly selecting the first centriods to start the classification process), the K-means algorithm re-checks and re-evaluates the location of all the data points in each cluster, it calculates the mean and assignes a the new centroid for that cluster. It does that for each of the 10 clusters. By calculating the mean, it learns how the data points are distributed, thus making it easier to find update the centroids. Then it repeats the same process iteratively by re-calculating the new distances between every data point in the dataset and the new centroids. Basiclly, it's checking to see if any of the points has come closer to or gone farther from the new centriod. The ultimate goal for the algorithm is not only to group points into clusters based on their distances from the centroid but it also makes sure to do so while reducing the distance to the minimum. Well, here is where the Z-score comes handy! The essense of calculating the Z-score is becuase it tells us how the datas points are ditributed around and how far or close they are from the mean. And this is very important information for the algorithm as it may speed up the clustering and classification process and most importantly improve the predictions accuracy rate based on standardized calculated measures. 

Also, it's worth mentioning that the Z-score results in a special normal distribution of the data points where the mean is equal to 0 and the standard divation is 1. This allows for easier calculation of the probability of certain values occurring in the distribution, it also makes it eaiser to compare data sets with other clusters of different means and standard deviations.

## But can we do even better than 55% accuracy?? 

The other optimization technique that we will explore is to manipulate the number of clusters. 
The hypothesis is that, given every other element of the code is properly programed, with more clusters, the easier it is for the algorithm to achieve its goal of reducing the distance between the data points and the centroids. Hence, also reducing the variablity. There, we predict to see better perfomance and improvement in the accuracy rate as we increase the number of clusters.

So, we are just going to run the code again and make a simple adjustment by increasing the number of cluster to 40 (randomly selected). Let's how we do.

In [78]:
from sklearn.cluster import MiniBatchKMeans
total_clusters = len(np.unique(y_test))
# Initialize the K-Means model
kmeans = MiniBatchKMeans(n_clusters = 40)
# Fitting the model to training set
kmeans.fit(X_train)

print('\nKmeans Lables: ', kmeans.labels_)

def retrieve_info(cluster_labels,y_train):

#Associates most probable label with each cluster in KMeans model
#returns: dictionary of clusters assigned to each label

  # Initializing
  reference_labels = {}
  # For loop to run through each label of cluster label
  for i in range(len(np.unique(kmeans.labels_))):
    index = np.where(cluster_labels == i,1,0)
    num = np.bincount(y_train[index==1]).argmax()
    reference_labels[i] = num
  return reference_labels

reference_labels = retrieve_info(kmeans.labels_,y_train)
number_labels = np.random.rand(len(kmeans.labels_))
for i in range(len(kmeans.labels_)):
  number_labels[i] = reference_labels[kmeans.labels_[i]]




Kmeans Lables:  [ 7 27 26 ...  7 15 35]


In [79]:
print(reference_labels)

{0: 2, 1: 5, 2: 6, 3: 1, 4: 0, 5: 4, 6: 9, 7: 5, 8: 9, 9: 8, 10: 0, 11: 8, 12: 2, 13: 7, 14: 9, 15: 6, 16: 7, 17: 3, 18: 3, 19: 2, 20: 0, 21: 8, 22: 1, 23: 7, 24: 1, 25: 0, 26: 4, 27: 0, 28: 3, 29: 3, 30: 3, 31: 7, 32: 5, 33: 5, 34: 0, 35: 4, 36: 6, 37: 2, 38: 4, 39: 5}


In [80]:
# Comparing Predicted values and Actual values
print(number_labels[:10].astype('int'))
print(y_train[:10])

[5 0 4 1 4 2 1 3 1 4]
0    5
1    0
2    4
3    1
4    9
5    2
6    1
7    3
8    1
9    4
Name: label, dtype: int64


In [81]:
# Calculating accuracy score
from sklearn.metrics import accuracy_score
print(accuracy_score(number_labels,y_train))

0.7841166666666667


And as we can see, the accuracy rate has gone up another 20%. Which means our hypothsis is accepted.