# CS369 Image Classifier
## Starter Code

This notebook is intended to be a starting point for writing your image classifier.

Start by setting the `root_path` variable to point to the dataset on your computer (a relative path is ok). You can verify that you're loading the data correctly by printing out the list of label names.

As is, this code loads each image and converts the image into a 1D luminance histogram. This is a very simple feature vector, and you are encouraged to experiment with more complicated ones to improve the accuracy of your predictions.

The labels, filenames, and histogram feature vectors are stored in a pandas data frame in case you want to save and load them instead of re-computing them each time.

The last part of the code trains a simple SVM classifier and computes the accuracy of the trained model on the same data it was just trained on. You're encouraged to segment the data into Train and Validation subsets, which will allow you to verify that your model isn't over-fitting to the training data.

You will need to add several components to the code, listed below. We will talk about these in class, and you can also look up the documentation for the suggested functions online.

A rubric describing how the project will be graded will be provided separately.

In [16]:
import numpy as np
import pandas as pd
from PIL import Image
from glob import glob

from skimage.exposure import histogram
from skimage.color import rgb2gray

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

In [29]:
# Path to Dataset
# root_path = './Intel\ Training\ Dataset/'
root_path = './Intel Training Dataset/'

# split into subfolders based on class label
subfolders = sorted(glob(root_path + '*'))
label_names = [p.split('/')[-1] for p in subfolders]
# print(label_names)

In [30]:
# create a list to organize labels, filenames, and feature vectors
data = []

for i, (label, subfolder) in enumerate(zip(label_names, subfolders)):
    # get list of file paths for each subfolder
    file_paths = sorted(glob(subfolder + '/*.jpg'))
    for f in file_paths:
        # read image
        img = np.array(Image.open(f))
        fname = f.split('/')[-1].split('_')[-1]
        # convert to luminance histogram (feature vector)
        img_hist, _ = histogram(rgb2gray(img), nbins=256, 
                                  source_range='dtype', 
                                  normalize=False)
        # append to data list with labels
        data.append({'labelname':label, 
                     'filename':fname, 
                     'labelnum':i, 
                     'lumhist':img_hist})

# convert to dataframe for storage
# can also export to a file here
df = pd.DataFrame(data=data)

In [31]:
# re-load data
label_array = df['labelnum'] # vector
feature_matrix = np.vstack(df['lumhist']) #2D Array

In [32]:
# train a simple classifier
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(feature_matrix, label_array)

# report overall accuracy on the training data
print('Total Accuracy: {}'.format(clf.score(feature_matrix, label_array)))

Total Accuracy: 0.4985743380855397


In [12]:
# Project To Do's
# 0. split the data into Train and Validation sets
# 1. use sklearn.metrics.confusion_matrix to get more detailed results
# 2. use sklearn.model_selection.GridSearchCV to try different params
# 3. try different feature vectors and classifiers to improve accuracy
# 4. use python's time.time() function to measure compute time costs