# CS 498AM1 Applied Machine Learning
## Problem 4 : Classifying MNIST Images using Decision Tree Classifier
### Prepared by: Vardhan Dongre (vdongre2@illinois.edu)

#### Problem Description: 
Investigate classifying MNIST using a decision forest. Using the same parameters for your forest construction i.e., same depth of tree; same number of trees for untouched raw pixels and stretched bounding box raw pixels

__compare the following cases:__ <font color = blue>untouched raw pixels and stretched
bounding box raw pixels</font>

• __Untouched:__ do not re-center the digits, but use the images as is.

• __Bounding box:__ construct a b × b bounding box so that the horizontal (resp. vertical) range of ink pixels is centered in the box.

• __Stretched bounding box:__ construct an b×b bounding box so that the horizontal
(resp. vertical) range of ink pixels runs the full horizontal (resp. vertical) range of
the box. Obtaining this representation will involve rescaling image pixels: you find
the horizontal and vertical ink range, cut that out of the original image, then resize
the result to b × b.

__Dataset:__ http://yann.lecun.com/exdb/mnist/

__References:__ 

https://stackoverflow.com/questions/21521571/how-to-read-mnist-database-in-r (Reader Code in R)

https://gist.github.com/mfathirirhas/f24d61d134b014da029a (Reader Code in Python)

https://colah.github.io/posts/2014-10-Visualizing-MNIST/ (Interesting insight on Dimensionality)

https://scikit-image.org/docs/stable/auto_examples/transform/plot_rescale.html (Resize images using scikit-image)


In [28]:
import numpy as np  
import pandas as pd
import struct
import matplotlib.pyplot as plt
 
from sklearn.metrics import accuracy_score

from skimage.transform import resize
from sklearn.ensemble import RandomForestClassifier

In [29]:
def loadImageSet(filename):  
  
    binfile = open(filename, 'rb') 
    buffers = binfile.read()  
  
    head = struct.unpack_from('>IIII', buffers, 0)  
  
    offset = struct.calcsize('>IIII')   
    imgNum = head[1]  
    width = head[2]  
    height = head[3]  
  
    bits = imgNum * width * height   
    bitsString = '>' + str(bits) + 'B'   
  
    imgs_frame = struct.unpack_from(bitsString, buffers, offset)  
  
    binfile.close()  
    imgs = np.reshape(imgs_frame, [imgNum, width * height])  
  
    return imgs,head
  
def loadLabelSet(filename):  
  
    binfile = open(filename, 'rb')   
    buffers = binfile.read()  
  
    head = struct.unpack_from('>II', buffers, 0)   
  
    labelNum = head[1]  
    offset = struct.calcsize('>II')   
  
    numString = '>' + str(labelNum) + "B"  
    labels = struct.unpack_from(numString, buffers, offset) 
  
    binfile.close()  
    labels = np.reshape(labels, [labelNum])  
  
    return labels,head  

In [30]:
# Reference : 
#opencv2 resize 
# img = cv2.imread('your_image.jpg')
# res = cv2.resize(img, dsize=(54, 140), interpolation=cv2.INTER_CUBIC)
def resize_img(img,tol=0):
    stretch_list=[]
    for im in img:
        single=np.reshape(im,[28,28])
        mask = single>tol
        cropped=single[np.ix_(mask.any(1),mask.any(0))]
        resized=resize(cropped,(20,20))
        im = np.reshape(resized, [20 * 20])
        stretch_list.append(im)        
    return np.array(stretch_list)

In [48]:
def DecisionTree(depth,trees,train_x,train_y,test_x,test_y):
    clf = RandomForestClassifier(max_depth=depth, max_leaf_nodes=trees)
    clf.fit(train_x, train_y)
    pred = clf.predict(test_x)
    accuracy = accuracy_score(test_y, pred)
    return accuracy

In [45]:
train_x= 'train-images-idx3-ubyte'  
train_y= 'train-labels-idx1-ubyte'  
test_x='t10k-images-idx3-ubyte'
test_y='t10k-labels-idx1-ubyte'
  
train_imgs,train_data_head = loadImageSet(train_x)  
test_imgs,test_data_head = loadImageSet(test_x)

train_imgs_threshold=1*(train_imgs>128)
test_imgs_threshold=1*(test_imgs>128)



train_imgs_resize = resize_img(train_imgs)

# Adjusting the transformed output of resize()
avg = (np.amax(train_imgs_resize)+np.amin(train_imgs_resize))/2.0
train_imgs_resize_threshold=1*(train_imgs_resize>avg)

test_imgs_resize = resize_img(test_imgs)
test_imgs_resize_threshold=1*(test_imgs_resize>avg)

train_labels,train_labels_head = loadLabelSet(train_y)
test_labels,test_labels_head = loadLabelSet(test_y)

depth_list=[8,16,32]
trees_list=[20,30,40]
stretched_df = pd.DataFrame(index=depth_list, columns=trees_list)
stretched_df = stretched_dict.fillna(0)
untouched_df = pd.DataFrame(index=depth_list, columns=trees_list)
untouched_df = untouched_dict.fillna(0)
for i in depth_list:
    for j in trees_list:
        #print("resized")
        stretched_df.loc[i,j] = DecisionTree(i,j,train_imgs_resize_threshold,train_labels,test_imgs_resize_threshold,test_labels)
        #print("untouched")
        untouched_df.loc[i,j] = DecisionTree(i,j,train_imgs,train_labels,test_imgs,test_labels)

# Results

print('Untouched raw pixels ')
print(untouched_df)
print('Stretched bounding box raw pixels ')
print(stretched_df)

  .format(dtypeobj_in, dtypeobj_out))


Untouched raw pixels 
        20      30      40
8   0.8156  0.8248  0.8481
16  0.7942  0.8265  0.8390
32  0.8033  0.8305  0.8470
Stretched bounding box raw pixels 
        20      30      40
8   0.7603  0.7975  0.8184
16  0.7721  0.8099  0.8116
32  0.7677  0.8149  0.8272


### Keeping both parameters same (same depth of tree; same number of trees )

In [49]:
depth_d = 32
trees_n = 32
print("stretched")
print(DecisionTree(depth_d,trees_n,train_imgs_resize_threshold,train_labels,test_imgs_resize_threshold,test_labels))
print("untouched")
print(DecisionTree(depth_d,trees_n,train_imgs,train_labels,test_imgs,test_labels))

stretched




0.8087
untouched




0.8198


## Which works best? 
From the above results it seems the results were comparable with the untouched performing slightly better than stretched images. Although image resizing should have performed better as it is a very common technique for boosting performance of DL models. 

## Why?
There might be downside to stretching if we cut the image and lose useful information. However, if the image is carefully resized it makes it easier for the model to identify the useful features in the data.