# Image Classification

In [1]:
#importing libraries
import numpy as np
from PIL import Image
import glob

#loading files
jersey = glob.glob("C:/Users/rhyth/Documents/CS156/Jersey - n03595614/*")
shirt = glob.glob("C:/Users/rhyth/Documents/CS156/Shirt - n04197391/*")

#to store jersey and shirt data
jersey_data, shirt_data = [], []

## Data Preprocessing
Here I load two folders, which contains all the images, categorized by shirts and jerseys.
Then, I open each image in the folder one-by-one and resize them to 20 x 20 and then obtain the flattened array form of the image and then store the respective dataset (jersey or shirt)

In [2]:
#data preprocessing
for image in jersey: #for jerseys
    img = Image.open(image) #open image
    img = img.resize((20, 20), resample=0) #resize
    img = np.array(img).flatten() #transform to array & flatten
    jersey_data.append(img) #store

for image in shirt: #for shirts
    img = Image.open(image) #open image
    img = img.resize((20, 20), resample=0) #resize
    img = np.array(img).flatten() #transform to array & flatten
    shirt_data.append(img) #store

After having the data ready in two different dataset, I classified the elements in dataset to 0: for shirt and 1: for jersey. 

In [3]:
#labelling the data
labelled_jersey = np.asarray([(pic, 1) for pic in jersey_data]) #class 1 for jersey
labelled_shirt = np.asarray([(pic, 0) for pic in shirt_data]) #class 0 for shirt
#deleting an element which had a problematic shape, manually found it
#using a loop over the dataset and array size, required size 1200, rest are invalid
#only this one had array length of 400, while others were 1200 (20*20*3, 3 for the RGB) 
labelled_jersey = np.delete(labelled_jersey, 987, 0)

  return array(a, dtype, copy=False, order=order)


Now preparing (for ML) a merged list of features from jersey and shirt, into X variable and labels for y variable.

In [4]:
#separating X & y and stacking them
X = np.append(labelled_jersey[:,0], labelled_shirt[:,0])
y = np.append(labelled_jersey[:,1], labelled_shirt[:,1])
X = np.stack(i for i in X)
y = np.stack(i for i in y)

  if self.run_code(code, result):


Now that our data is ready, we will split it into training set and testing set into 80%-20% ratio, respectively.

In [5]:
#splitting data into training & testing with 80:20 ratio respectively
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Logistic Regression Classifier
Here we use a logistic regression model to train and then test it's accuracy on train & test set.

In [6]:
from sklearn.linear_model import LogisticRegression
#logistic regression model
lr = LogisticRegression()
#fitting model to train data
lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [7]:
#train data accuracy
print("Train data .score(): ", lr.score(X_train,y_train))
#test data accuracy
print("Test data .score(): ", lr.score(X_test,y_test))

Train data .score():  0.9776785714285714
Test data .score():  0.5490196078431373


We have obtained, 97.8% accuracy on training set which signals overfitting although 54.9% accuracy on test data, which is still okay enough.

## Support Vector Classifier (SVC)
Here I repeat linear classification but with support vectors using RBF kernel

In [8]:
#Support Vector Classifier Model
from sklearn.svm import SVC
clf = SVC(kernel='rbf', gamma='auto') #RBF Kernel

#process: model fitting and computing accuracy scores
#input: training and testing data with features + labels, svc model
#output: accuracy score on training and testing data
def fit_metrics(X_train, y_train, X_test, y_test, clf):
    clf.fit(X_train,y_train) #model fitting
    y_train_pred = clf.predict(X_train) #predicting on train data
    y_test_pred = clf.predict(X_test) #predicting on test data
    print("Train data accuracy_score(): ", accuracy_score(y_train_pred,y_train))
    print("Test data accuracy_score(): ", accuracy_score(y_test_pred,y_test))

In [9]:
from sklearn.metrics import accuracy_score
#metrics of SVC model
fit_metrics(X_train, y_train, X_test, y_test, clf) 

Train data accuracy_score():  0.9776785714285714
Test data accuracy_score():  0.48484848484848486


## Reduced Representation: Principal Component Analysis (PCA)
Now, we reduce the data into 2 principal components and then fit our SVC model and measure accuracy. I chose 2 after trying with different n_components value, because only the first two had major explained variance ratio. Example, with 10 components I got the explained variance ratio as 
array([0.30917742, 0.15128824, 0.04738025, 0.03173914, 0.02739319, 0.02393147, 0.02155885, 0.01492888, 0.01339326, 0.01066094])

In [10]:
#Principal Component Analysis decomposition of data
#train and test data PCA transformation
from sklearn import decomposition
pca = decomposition.PCA(n_components=2)
pca.fit(X_train) #PCA fitting
X_train_PCA = pca.transform(X_train) #PCA transformation
X_test_PCA = pca.transform(X_test) #PCA transformation

In [11]:
#Metrics of SVC trained model with PCA transformed data
fit_metrics(X_train_PCA, y_train, X_test_PCA, y_test, clf)

Train data accuracy_score():  0.9776785714285714
Test data accuracy_score():  0.48663101604278075


In [12]:
#PCA Explained Variance Ratio, only the first 2 had major ratio
pca.explained_variance_ratio_

array([0.30917742, 0.15128824])

Here, we get again ~ 97% accuracy on training but decreased test accuracy of 48.7%. PCA did not improve our model

## Reduced Representation: Linear Discriminant Analysis (LDA)
Similarly, we attempt LDA to reduce our data and then fit SVC model to obtain accuracy

In [13]:
#Linear Discriminant Analysis
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
X_train_LDA = lda.fit_transform(X_train, y_train)
X_test_LDA = lda.fit_transform(X_test, y_test)



In [14]:
#F#Metrics of SVC trained model with LDA transformed data
fit_metrics(X_train_LDA, y_train, X_test_LDA, y_test, clf)

Train data accuracy_score():  0.8705357142857143
Test data accuracy_score():  0.5098039215686274


We get 87% accuracy on train data, and our test score accuracy is 50% which signals overfitting and also is less than 54% (the highest score) we got for linear classifier without any data reduction. Anyway, this model is no better than taking a guess.