# This notebook covers machine learning techniques for solving the computer vision problem as given in the problem statement. It is done on Google Colab jupyter notebook, since Google Colab offers great GPU and helps solve computationally expensive tasks easily

Setting up Google Colab

In [137]:
!pip install PyDrive



In [0]:
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [0]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

Importing the necessary libraries

In [0]:
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

 Importing all classifier libraries

In [0]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm 
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

Downloading of training images from Google Drive. ID refers to the id of the file in google drive

In [0]:
download = drive.CreateFile({'id': ''})
download.GetContentFile('train_image.pkl')
with open('train_image.pkl', 'rb') as f:
    train_images = pickle.load(f)

Downloading of training labels from Google Drive. ID refers to the id of the file in google drive

In [0]:
download = drive.CreateFile({'id': ''})
download.GetContentFile('train_label.pkl')
with open('train_label.pkl', 'rb') as f:
    train_labels = pickle.load(f)

Exploration of training data

In [145]:
len(train_images[0])
 

784

In [0]:
type(train_images)

list

In [0]:
# getting a glimpse of the training data
data = pd.DataFrame({'label':train_labels, 
            'index': range(0, 8000)})

In [0]:
data.head()

Unnamed: 0,index,label
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0


Assigning the label values to a variable y. y refers to the labels that should be assigned to the images

In [0]:
y=data['label'].values


In [0]:
len(y)

8000

In [0]:
type(y)

numpy.ndarray

converting training images into a numpy array to check the shape of training images

In [0]:
X = np.array(train_images)

In [147]:
X.shape

(8000, 784)

Dividing the data into training and cross validation set. This is very important since we can test out the performance of the algorithm on the cross validation set and experiment with the algorithms to get a better performance. When we get our best performance we use that algorithm for the test set. Evaluation on the test set is done only once.

In [0]:
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42, test_size=0.2)

## Model 1 -->* K-Nearest Neighbour*. This can be used for both classification and regression problems. Since our problem involves multicalss classification, we can start with this classifier

K-Nearest Neighbour classifier takes one argumet that is the number of neighbours. We can experiment with different values satrting with 2

In [0]:
model = KNeighborsClassifier(n_neighbors=2)

In [152]:
model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=2, p=2,
           weights='uniform')

Testing the accuracy on training set and cross validation set. Checking on  training set is also very important to let us know of the overfirring, underfitting and a good fit

In [153]:
# predict function in this case is used to predict the classes for the inputed data
print("Accuracy on Training Set {}".format(accuracy_score(y_train, model.predict(X_train))))
print("Accuracy on Validation Set {}".format(accuracy_score(y_val, model.predict(X_val))))

Accuracy on Training Set 0.88390625
Accuracy on Validation Set 0.774375


In [154]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

In [155]:
# predict function in this case is used to predict the classes for the inputed data
print("Accuracy on Training Set {}".format(accuracy_score(y_train, model.predict(X_train))))
print("Accuracy on Validation Set {}".format(accuracy_score(y_val, model.predict(X_val))))

Accuracy on Training Set 0.88546875
Accuracy on Validation Set 0.794375


In [156]:
model = KNeighborsClassifier(n_neighbors=4)
model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=4, p=2,
           weights='uniform')

In [157]:
# predict function in this case is used to predict the classes for the inputed data
print("Accuracy on Training Set {}".format(accuracy_score(y_train, model.predict(X_train))))
print("Accuracy on Validation Set {}".format(accuracy_score(y_val, model.predict(X_val))))

Accuracy on Training Set 0.861875
Accuracy on Validation Set 0.789375


Hence n = 3 performs the best for K-Nearest Neighbour classifier

### Model 2 --> * Support Vector Machines*. This can also be used for both classification and regression problems. These are used heavily in classification problems because of its various advantages such as specifying custom kernels, and its effectiveness in high dimentional space

####  First variant of svm invoves using C-Support Vector Classification. SVC is so called because of its use of C parameter which is just like the regularization parameter. Usually if C is large the algorithm can make well separable decision boundary taking into account the outliers as well

In [0]:
# by default the kernel is gaussian kernel and gamma is the coefficient of the kernel, gamma should'nt be too large as it will lead to 
# overfitting, it should be set to 'scale', setting it to 'auto' leads to overfitting 
# C is by default 1.0
model = svm.SVC(gamma='scale')

In [0]:
# training the model
model.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

### Gaussian Kernel should perform the best out of all kernels since here there is not a huge difference between the number of features and number of training examples, others -> linear, polynomial kernel performs well when there is a huge difference between n and m

In [0]:
print("Accuracy on Training Set {}".format(accuracy_score(y_train, model.predict(X_train))))
print("Accuracy on Validation Set {}".format(accuracy_score(y_val, model.predict(X_val))))

Accuracy on Training Set 0.87875
Accuracy on Validation Set 0.8225


The above model provides good accuracy, but let's see whether it can be better by tuning the parameters

### Now, setting C to a larger value. Generally setting C to a larger value makes the algorithm define well separable decision boundary taking into account the outliers as well, however setting C too large(relative term, have to experiment with the values) may cause overfitting

In [0]:
model = svm.SVC(C=10, gamma='scale')

In [0]:
# training the model
model.fit(X_train, y_train)

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [0]:
print("Accuracy on Training Set {}".format(accuracy_score(y_train, model.predict(X_train))))
print("Accuracy on Validation Set {}".format(accuracy_score(y_val, model.predict(X_val))))

Accuracy on Training Set 0.9803125
Accuracy on Validation Set 0.84125


C=10 is close to overfitting the training set, but it provides better accuracy as well, so let's test on one more value of C=5

In [0]:
model = svm.SVC(C=5, gamma='scale')
# training the model
model.fit(X_train, y_train)

SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [0]:
print("Accuracy on Training Set {}".format(accuracy_score(y_train, model.predict(X_train))))
print("Accuracy on Validation Set {}".format(accuracy_score(y_val, model.predict(X_val))))

Accuracy on Training Set 0.95171875
Accuracy on Validation Set 0.84


#### This change decreases the overfitting by 0.03 and the accuracy on the validation slightly, so we can go with C=5 since we have to account for overfitting as well

Let's test with the linear kernel, linear kernel in SVM is actually similar to logistic regression, hence we don't need to test with logistic regression separately. This algorithm is expected to give lower accuracy, as described above, but let's confirm

In [0]:
model = svm.LinearSVC()

In [0]:
model.fit(X_train, y_train)



LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [0]:
print("Accuracy on Training Set {}".format(accuracy_score(y_train, model.predict(X_train))))
print("Accuracy on Validation Set {}".format(accuracy_score(y_val, model.predict(X_val))))

Accuracy on Training Set 0.85625
Accuracy on Validation Set 0.74625


It was expected to have a lower accuracy with linear kernel

Polynomial kernel is also expected to give less accuracy as compared to Gaussian as described above 

In [0]:
model = svm.SVC(C=5, kernel='poly', gamma='scale')

In [0]:
model.fit(X_train, y_train)

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [0]:
print("Accuracy on Training Set {}".format(accuracy_score(y_train, model.predict(X_train))))
print("Accuracy on Validation Set {}".format(accuracy_score(y_val, model.predict(X_val))))

Accuracy on Training Set 0.97546875
Accuracy on Validation Set 0.819375


### So Model 2 had various sub-models(kernels and parameter tuning), finally we go with Gaussian Kernel with C=5

## Model 3 --> *Naive Bayes* classifier. This works on Bayes theorem of probability for prediction of class.

In [0]:
# Naive Bayes
model = GaussianNB()

In [0]:
# training 
model.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [0]:
print("Accuracy on Training Set {}".format(accuracy_score(y_train, model.predict(X_train))))
print("Accuracy on Validation Set {}".format(accuracy_score(y_val, model.predict(X_val))))

Accuracy on Training Set 0.64828125
Accuracy on Validation Set 0.664375


Naive Bayes performs poorly 

## Model 4 --> *Decision Trees* classifiers. These make use of tree structure having root node splitting into branches and then each branch splitting further. Each node makes a decision. 

In [0]:
model = DecisionTreeClassifier()

In [0]:
model.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [0]:
print("Accuracy on Training Set {}".format(accuracy_score(y_train, model.predict(X_train))))
print("Accuracy on Validation Set {}".format(accuracy_score(y_val, model.predict(X_val))))

Accuracy on Training Set 1.0
Accuracy on Validation Set 0.714375


Default parameters of decision trees performs poorly with overfitting the training set and providing a low accuracy, so we need to perform parameter tuning

In [0]:
# lowering max_depth might lower the overfitting, since large value of max_depth causes overfitting
model = DecisionTreeClassifier(criterion = 'entropy', splitter = 'random', max_depth=10)

In [0]:
# training the model
model.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='random')

In [0]:
# checking the accuracy on training and validation set
print("Accuracy on Training Set {}".format(accuracy_score(y_train, model.predict(X_train))))
print("Accuracy on Validation Set {}".format(accuracy_score(y_val, model.predict(X_val))))

Accuracy on Training Set 0.9140625
Accuracy on Validation Set 0.76125


Lowers the overfitting and increases the accuracy

## Model 5 --> *Random Forest* classifiers. These are basically a collection of decision tress and might outperform decision trees since it consists of multiple decision tress with each based on random sample and random forest is the aggregation of the performance of all the trees 

In [0]:
# n_estimators are the number of trees --> more the trees better the accuracy, but not too large
model = RandomForestClassifier(n_estimators=10, n_jobs=-1)

In [0]:
# training of model
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [0]:
# checking the accuracy on training and validation set
print("Accuracy on Training Set {}".format(accuracy_score(y_train, model.predict(X_train))))
print("Accuracy on Validation Set {}".format(accuracy_score(y_val, model.predict(X_val))))

Accuracy on Training Set 0.9921875
Accuracy on Validation Set 0.794375


Generally random forsts prevent overfitting, but sometimes it might overfit due to same reason as decision tree, but it might outperform decision trees with the same depth. It might also overfit due to lots of unimportant features

In [0]:
feature_imp = pd.Series(model.feature_importances_).sort_values(ascending=False)
feature_imp

610    0.041644
621    0.041361
453    0.040548
190    0.034115
565    0.020913
526    0.020482
425    0.020343
499    0.018120
145    0.016997
106    0.015908
247    0.015058
202    0.015051
230    0.013330
37     0.012510
400    0.011545
720    0.011273
245    0.011006
770    0.008762
17     0.008560
200    0.007417
739    0.007371
772    0.006218
38     0.005798
46     0.005605
742    0.005219
746    0.004662
35     0.004534
773    0.004445
258    0.004364
256    0.004310
         ...   
755    0.000000
756    0.000000
757    0.000000
393    0.000000
392    0.000000
364    0.000000
168    0.000000
55     0.000000
83     0.000000
84     0.000000
110    0.000000
111    0.000000
113    0.000000
138    0.000000
139    0.000000
140    0.000000
167    0.000000
195    0.000000
363    0.000000
196    0.000000
224    0.000000
249    0.000000
279    0.000000
280    0.000000
281    0.000000
308    0.000000
335    0.000000
336    0.000000
362    0.000000
0      0.000000
Length: 784, dtype: floa

As we can see towards the end there are a lot of unimportant features but we can't eliminate them since they are pixel values and they are unimportant by slight amount, so let's prune the random forest to the same depth as decision tree used above

In [0]:
# max_depth = 10 -> pruning
model = RandomForestClassifier(criterion = 'entropy', n_estimators=20, n_jobs=-1, max_depth=10)

In [0]:
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [0]:
print("Accuracy on Training Set {}".format(accuracy_score(y_train, model.predict(X_train))))
print("Accuracy on Validation Set {}".format(accuracy_score(y_val, model.predict(X_val))))

Accuracy on Training Set 0.96078125
Accuracy on Validation Set 0.81125


Performs much better than the previous random forest by decresing the overfitting and increasing the accuracy

Let's try out one more variant, increasing the trees to 100

In [0]:
model = RandomForestClassifier(criterion = 'entropy', n_estimators=100, n_jobs=-1, max_depth=10)

In [0]:
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [0]:
print("Accuracy on Training Set {}".format(accuracy_score(y_train, model.predict(X_train))))
print("Accuracy on Validation Set {}".format(accuracy_score(y_val, model.predict(X_val))))

Accuracy on Training Set 0.9684375
Accuracy on Validation Set 0.815625


Doesn't effect much

# So out of all machine learning models we are choosing Model 2 --> SVM classifier with Gaussian kernel and C=5 which outperforms all models described in this notebook