# Image features exercise

In this exercise we will show that we can improve our classification performance by training  classifiers not on raw pixels but on features that are computed from the raw pixels.

All of your work for this exercise will be done in this notebook.

In [None]:
import random
import numpy as np
from camalab.data_utils import load_CIFAR10
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (15., 12.) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading extenrnal modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

## Load data
Similar to previous exercises, we will load CIFAR-10 data from disk.

In [None]:
def get_CIFAR10_data(num_training=5000, num_validation=500, num_test=500):
  # Load the raw CIFAR-10 data
  cifar10_dir = 'camalab/datasets/cifar-10-batches-py' # you should change it to your own path, 
                                                      # or put the dataset to this path  
    
  X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir, 3)
  
  # Subsample the data
  mask = range(num_training, num_training + num_validation)
  X_val = X_train[mask]
  y_val = y_train[mask]
  mask = range(num_training)
  X_train = X_train[mask]
  y_train = y_train[mask]
  mask = range(num_test)
  X_test = X_test[mask]
  y_test = y_test[mask]

  return X_train, y_train, X_val, y_val, X_test, y_test

X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()
print X_train.shape
print X_val.shape
print X_test.shape

## Extract Features
For each image we will compute a Histogram of Oriented
Gradients (HOG) as well as a color histogram. We form our final feature vector for each image by concatenating
the HOG and color histogram feature vectors.

Roughly speaking, HOG should capture the texture of the image while ignoring
color information, and the color histogram represents the color of the input
image while ignoring texture. As a result, we expect that using both together
ought to work better than using either alone. Verifying this assumption would
be a good thing to try for the bonus section.

You should import your `hog` and `color_histogram` functions. Both operate on a single
image and return a feature vector for that image. 

Your function should takes a set of images and a list of feature functions and evaluates each feature function on each image, storing the results in a matrix where each column is the concatenation of all feature vectors for a single image.

In [None]:

# color_histogram
num_color_bins = 10 # Number of bins in the color histogram
X_train_feats1 = None  # Use your functhon
X_val_feats1 = None
X_test_feats1 = None

# HOG
X_train_feats2 = None  # Use your functhon
X_val_feats2 = None
X_test_feats2 = None

# Concatenating the HOG and color histogram feature vectors.
X_train_feats = None  # you may use the 'np.concatenate' function
X_val_feats = None
X_test_feats = None

# Preprocessing: Subtract the mean feature
mean_feat = np.mean(X_train_feats, axis=0, keepdims=True)
X_train_feats -= mean_feat  
X_val_feats -= mean_feat
X_test_feats -= mean_feat

# Preprocessing: Divide by standard deviation. This ensures that each feature
# has roughly the same scale.
std_feat = np.std(X_train_feats, axis=0, keepdims=True)
X_train_feats /= std_feat
X_val_feats /= std_feat
X_test_feats /= std_feat

# Preprocessing: Add a bias dimension
# In k-NN, the bias dimension is useless. But you may need it in other classifiers.
X_train_feats = np.hstack([X_train_feats, np.ones((X_train_feats.shape[0], 1))])
X_val_feats = np.hstack([X_val_feats, np.ones((X_val_feats.shape[0], 1))])
X_test_feats = np.hstack([X_test_feats, np.ones((X_test_feats.shape[0], 1))])

## k-NN on features
Using the k-NN code developed earlier in the assignment. This should achieve better results than k-NN directly on top of raw pixels.

In [None]:
from camalab.classifiers import KNearestNeighbor

classifier = KNearestNeighbor()
classifier.train(X_train_feats, y_train)

In [None]:
# Use the validation set to tune the 'k'

k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
k_to_accuracies = {} # your should store the results in this dict

################################################################################
# TODO:                                                                        #
# You must use the simple validation method. Because we have divided the       #
# dataset to validation set and train set.                                     #
################################################################################
pass
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

for k in sorted(k_to_accuracies):
    print 'k = %d, accuracy = %f' % (k, k_to_accuracies[k])

In [None]:
# Evaluate your classifier

best_k = 1  # choose your best k

# Compute and display the accuracy
y_test_pred = classifier.predict(X_test_feats, k=best_k)
test_accuracy = np.mean(y_test == y_test_pred)
print test_accuracy