Sawyer Byrd

Homework 3

In [1]:
import os
import pandas as pd
import numpy as np
from PIL import Image

In [2]:
file_path = '/home/sawbyrd/CMSC422/HW3/ATT'

Below is a function that organizes the images used in this project into a data array.

It returns a data array where each row is a 1-Dim representation of an image

This returns an array with original size representation, as well as one with the images resized for Task 3

And a labels array that contains the label for each image.

In [3]:
def organize_data(path):
    data_arr = []
    resize_data_arr = []
    labels_arr = []
    for file_name in os.listdir(path):  # Looping through files in the directory
        if file_name.endswith('.png'):
            id = int(file_name.split('_')[0])   # getting "Class Label"
            
            # Converting image into 1-Dim np array
            image_path = os.path.join(path, file_name)
            image = Image.open(image_path)
            resized = image.resize((56, 46))    # Also creating a resized image for Task 3
            image = np.array(image).flatten()
            resized = np.array(resized).flatten()
            
            data_arr.append(image)
            resize_data_arr.append(resized)
            labels_arr.append(id)
    
    # Converting data to pandas dataframe
    data = pd.DataFrame(data_arr)
    labels = pd.DataFrame(labels_arr, columns=['Label'])
    data = pd.concat([data, labels], axis=1)
    
    # Converting resized data to pandas dataframe
    resize = pd.DataFrame(resize_data_arr)
    resize = pd.concat([resize, labels], axis=1)
    
    return data, resize

In [4]:
data, resized = organize_data(file_path)
print('Original Data')
print(data.head())
print('Resized Data: ')
print(resized.head())

Original Data
     0    1    2    3    4    5    6    7    8    9  ...  10295  10296  10297  \
0  110  111  110  111  108  110  108  112  109  110  ...    156    128     72   
1  124  126  124  125  125  127  121  127  124  124  ...     68     69     78   
2  103  104  105  105  104  106  101  105  101  104  ...     78     75     75   
3   44   43   32   32   30   30   38   40   48   66  ...     42     42     40   
4   86   90   87   90   91   88   88   90   87   91  ...    132    127    131   

   10298  10299  10300  10301  10302  10303  Label  
0     61     65     61     63     59     60     30  
1     74     76     78     78     78     77     23  
2     75     76     73     76     76     75     29  
3     33     29     37     43     43     37      1  
4    139    139    137    127    124    126     39  

[5 rows x 10305 columns]
Resized Data: 
     0    1    2    3    4    5    6    7    8    9  ...  2567  2568  2569  \
0  110  110  109  110  109  109  107  103   94   92  ...   170

Transfering data into a pandas dataframe

Shuffling the data before splitting into groups

In [5]:
data = data.sample(frac=1, random_state=42).reset_index(drop=True)
resized = resized.sample(frac=1, random_state=42).reset_index(drop=True)
print('Original: ')
print(data.head())
print('Resized: ')
print(resized.head)

Original: 
     0    1    2    3    4    5    6    7    8    9  ...  10295  10296  10297  \
0  108  107  106  110  111  111  111  106  109  109  ...     48     50     51   
1   50   49   50   48   52   49   49   52   49   52  ...    153     99     50   
2   89   87   92   88   91   84   91   90   85   89  ...     91     96     84   
3  116  123  120  123  121  125  120  121  123  125  ...     57    134    164   
4  104  101  104  104  105  103  104  104  103  106  ...     76     73     77   

   10298  10299  10300  10301  10302  10303  Label  
0     49     51     50     43     43     40     26  
1     90    133    163    159    111    117     14  
2     55     19     63     93    112    109     39  
3    147    133    128    108     96     98     37  
4     74     77     73     78     73     79     29  

[5 rows x 10305 columns]
Resized: 
<bound method NDFrame.head of        0    1    2    3    4    5    6    7    8    9  ...  2567  2568  2569  \
0    109  108  110  111  109  110  111

Task 1 Code

Below are 2 functions for classifying using 1NN.

L2_distance calculates the L2 distance (Euclidean distance) between 2 features.

predics_1NN uses 1NN to predict the label for a single test sample.

In [6]:
# L2 (Euclidean) Distance
def L2_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

# Predicts a label for a single test sample
def predict_1NN(train_feat, train_labels, test_feat):
    # Initializing vars for closest neighbor
    best_guess = None
    best_dist = float('inf')
    
    # Looping through train labels and finding closest neighbor
    for i, label in enumerate(train_feat):
        dist = L2_distance(label, test_feat)
        if dist < best_dist:
            best_dist = dist
            best_guess = train_labels[i]
            
    return best_guess
    

Task 2 Code

Below is the function that implements PCA using SVD

In [7]:
def pca_svd(X, n_comp):
    U, S, Vt = np.linalg.svd(X, full_matrices=False) # Computing SVD
    comps = Vt[:n_comp]     # Getting the n left singular values
    pca = np.dot(X, comps.T)     # Transforming the data using PCA
    
    return pca, comps

Below is a function that combines the helper functions to implement 1NN with K-Fold cross validation

The "PCA_option variable is a boolean. When True, the algorithm will run with PCA on each fold. When False, it will run without PCA

In [8]:
def full_1NN_kfold_alg(data, n_folds, PCA_option):
    folds = np.array_split(data, n_folds) # Splitting data into n_splits groups
    
    accuracy_arr = []
    # Looping through each fold
    for i, fold in enumerate(folds):
        # Collecting train and test arrays
        test = fold
        train = pd.concat([folds[j] for j in range(n_folds) if j != i], axis=0)
        
        # Seperating test array into features and labels
        test_labels = test['Label'].values
        test_feat = test.iloc[:, :-1].values
        # Seperating features from labels
        train_labels = train['Label'].values
        train_feat = train.iloc[:, :-1].values
        
        # Using PCA to tranform the data (If option selected)
        if PCA_option:
            train_feat, comps = pca_svd(train_feat, 100)    # Project the train data to first 100 components
            test_feat = test_feat @ comps.T    # Apply the same transformation to the test data

        
        predictions = []
        # Predicting label of each feature using predict1NN
        for feat in test_feat:
            prediction = predict_1NN(train_feat, train_labels, feat)
            predictions.append(prediction)
        
        # Computing the accuracy of  each fold
        predictions = np.array(predictions)
        accuracy = np.mean(predictions == test_labels)
        accuracy_arr.append(accuracy)
        print('Accuracy for Fold ', i, ': ', accuracy * 100, '%')
    
    # Computing average accuracy of all folds
    accuracy_arr = np.array(accuracy_arr)
    print('--------------------')
    print('Average Accuracy: \n', accuracy_arr.mean() * 100, '%')

Task 1 Implementation and Results

In [9]:
print('-----------------------------------------------------')
print('| Task 1: 1NN With 5-Fold Cross Validation (No PCA) |')
print('-----------------------------------------------------')
full_1NN_kfold_alg(data, 5, False)

-----------------------------------------------------
| Task 1: 1NN With 5-Fold Cross Validation (No PCA) |
-----------------------------------------------------


  return bound(*args, **kwds)


Accuracy for Fold  0 :  100.0 %
Accuracy for Fold  1 :  98.75 %
Accuracy for Fold  2 :  97.5 %
Accuracy for Fold  3 :  97.5 %
Accuracy for Fold  4 :  98.75 %
--------------------
Average Accuracy: 
 98.5 %


Task 2 Implementation and results

In [10]:
print('-------------------------------------------------------')
print('| Task 2: 1NN With 5-Fold Cross Validation (With PCA) |')
print('-------------------------------------------------------')
full_1NN_kfold_alg(data, 5, True)

-------------------------------------------------------
| Task 2: 1NN With 5-Fold Cross Validation (With PCA) |
-------------------------------------------------------
Accuracy for Fold  0 :  97.5 %
Accuracy for Fold  1 :  96.25 %
Accuracy for Fold  2 :  96.25 %
Accuracy for Fold  3 :  95.0 %
Accuracy for Fold  4 :  98.75 %
--------------------
Average Accuracy: 
 96.74999999999999 %


Task 3 Implementation and results

In [12]:
print('--------------------------------------------------------------------------')
print('| Task 2: 1NN With 5-Fold Cross Validation (With PCA and Resized Images) |')
print('--------------------------------------------------------------------------')
full_1NN_kfold_alg(resized, 5, True)

--------------------------------------------------------------------------
| Task 2: 1NN With 5-Fold Cross Validation (With PCA and Resized Images) |
--------------------------------------------------------------------------


  return bound(*args, **kwds)


Accuracy for Fold  0 :  97.5 %
Accuracy for Fold  1 :  96.25 %
Accuracy for Fold  2 :  96.25 %
Accuracy for Fold  3 :  95.0 %
Accuracy for Fold  4 :  98.75 %
--------------------
Average Accuracy: 
 96.74999999999999 %
