# Scene Categorization

The goal of this assignment is to introduce you to image categorization. We will focus on the task of scene categorization. You task is to implement image features, train a classifier using the training samples, and then evaluate the the classifier on the test set.


Dataset: In the supplemental material, we have supplied images with 8 outdoor scene categories: coast, mountain, forest, open country, street, inside city, tall buildings and highways. The dataset has been split into a train set (1888 images) and test set (800 images), placed in train and test folders separately. The associated labels are stored in `gs.mat`, for example, label id of `42.jpg` in the training folder corresponds to `train_gs(42)`. Its actual label name will be `names{train_gs(42)}`.

In [32]:
import numpy as np
import pandas as pd
import cv2
import re
import time
from imageio import imread
from skimage.transform import resize
from pathlib import Path
from scipy.io import loadmat
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans,MiniBatchKMeans
from sklearn.mixture import GaussianMixture
from sklearn.neighbors import KDTree, KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from keras.applications import *
from keras.models import Model

In [14]:
def numerical_sort(value):
    numbers = re.compile(r'(\d+)')
    value = str(value)
    parts = numbers.split(value)
    parts[1::2] = map(int, parts[1::2])
    return parts

def get_images():
    train_path = Path('./data/train')
    test_path = Path('./data/test')
    X_train = []
    X_test = []
    for image_path in sorted(train_path.glob('*.jpg'),key=numerical_sort):
        image = imread(image_path)
        X_train.append(image)
    for image_path in sorted(test_path.glob('*.jpg'),key=numerical_sort):
        image = imread(image_path)
        X_test.append(image)
    X_train = np.array(X_train)
    X_test = np.array(X_test)
    return X_train,X_test

def get_labels():
    label_path = Path('./data/gs.mat')
    labels = loadmat(label_path)
    y_train = labels['train_gs'].flatten()
    y_test = labels['test_gs'].flatten()
    return y_train,y_test

def histogram_distance(h1,h2):
    h1 = h1.astype(np.float32)
    h2 = h2.astype(np.float32)
    method = cv2.HISTCMP_INTERSECT
    return cv2.compareHist(h1,h2,method)

def get_scores(model,X_train,y_train,X_test,y_test):
    yhat = model.predict(X_train)
    score = accuracy_score(y_train,yhat)
    print('Training Accuracy: %f' % round(score,3))    
    yhat = model.predict(X_test)
    score = accuracy_score(y_test,yhat)
    print('Test Accuracy: %f' % round(score,3))
    print('Confusion Matrix:\n', confusion_matrix(y_test,yhat))

In [3]:
X_train,X_test = get_images()
print('training data size:',X_train.shape)
print('test data size:',X_test.shape)

y_train,y_test = get_labels()
print('training label size:',y_train.shape)
print('test label size:',y_test.shape)

training data size: (1888, 256, 256, 3)
test data size: (800, 256, 256, 3)
training label size: (1888,)
test label size: (800,)


## Color histogram and kNN classifier

Implement a function to compute the color histogram of an image. For example, you can use the Matlab function hist for computing marginal histogram of RGB channels.

Use nearest neighbor classifier (kNN) to categorize the test images.

- Describe your quantization/binning method and parameters
- Report number of K for the kNN classifer
- Display the confusion matrix and categorization accuracy.

In [14]:
def get_color_hists(X_train,X_test,bins=10):
    X_train_hist = np.zeros((len(X_train),bins*3))
    X_test_hist = np.zeros((len(X_test),bins*3))
    for i in range(len(X_train)):
        red = cv2.calcHist([X_train[i,:,:,0]],[0],None,[bins],[0,256]).flatten()
        green = cv2.calcHist([X_train[i,:,:,1]],[0],None,[bins],[0,256]).flatten()
        blue = cv2.calcHist([X_train[i,:,:,2]],[0],None,[bins],[0,256]).flatten()
        X_train_hist[i] = np.concatenate([red,green,blue])
    for i in range(len(X_test)):
        red = cv2.calcHist([X_test[i,:,:,0]],[0],None,[bins],[0,256]).flatten()
        green = cv2.calcHist([X_test[i,:,:,1]],[0],None,[bins],[0,256]).flatten()
        blue = cv2.calcHist([X_test[i,:,:,2]],[0],None,[bins],[0,256]).flatten()
        X_test_hist[i] = np.concatenate([red,green,blue])
    return X_train_hist,X_test_hist

In [15]:
X_train_hist,X_test_hist = get_color_hists(X_train,X_test,bins=10)

In [16]:
model = KNeighborsClassifier(n_neighbors=10,n_jobs=-1)
model.fit(X_train_hist,y_train)
get_scores(model,X_train_hist,y_train,X_test_hist,y_test)

Training Accuracy: 0.560000
Test Accuracy: 0.475000
Confusion Matrix:
 [[30 16  3 19  8 10  4 10]
 [ 3 81  0  7  2  1  3  3]
 [ 9  4 58  7  5  6  6  5]
 [ 4 25  1 45  4  2 13  6]
 [16 14  2  9 34  6  3 16]
 [ 5 28  3 11  4 32  7 10]
 [ 1  9  0 10  5  1 68  6]
 [11  9  0 17 13  9  9 32]]


## Bag of visual words model and nearest neighbor classifier

Implement K-means cluster algorithm to compute visual word dictionary. The feature dimension of SIFT features is 128.

Use the included SIFT word descriptors included in `sift_desc.mat` to build bag of visual words as your image representation.

Use nearest neighbor classifier (kNN) to categorize the test images.

- Describe the number of visual words you use, K-means stopping criterion, and the categorization accuracy.
- Display the confusion matrix and categorization accuracy.

In [15]:
def get_sift():
    sift_path = Path('./data/sift_desc.mat')
    sift = loadmat(sift_path)
    train = sift['train_D'].flatten()
    test = sift['test_D'].flatten()
    train_sifts = [t.T for t in train]
    test_sifts = [t.T for t in test]
    return train_sifts,test_sifts

In [16]:
train_sifts,test_sifts = get_sift()

In [None]:
class K_Means:
    def __init__(self, n_clusters, max_iters=1000):
        self.n_clusters = n_clusters
        self.centroids = None
        self.labels = None
        self.max_iters = max_iters

    def fit(self, X):
        mins = X.min()
        maxs = X.max()
        self.centroids = np.zeros((self.n_clusters, X.shape[1]))
        for idx, col in enumerate(X.T):
            col_min, col_max = np.min(col), np.max(col)
            for k in range(self.n_clusters):
                self.centroids[k, idx] = np.random.uniform(col_min, col_max, size=(1))
        self.labels = np.random.random_integers(low=0, high=self.n_clusters - 1, size=(X.shape[0]))
        for i in range(self.max_iters):
            print(f"Iteration {i}")
            changed = 0
            distances = self.get_distances(X)
            groups = {key: [] for key in range(self.n_clusters)}
            for idx, feat in enumerate(X):
                closest = distances[idx].argmin()
                if self.labels[idx] != closest:
                    self.labels[idx] = closest
                    changed += 1
                groups[closest].append(feat)
            for group in sorted(groups):
                group_feats = groups[group]
                if len(group_feats):
                    self.centroids[group] = np.mean(groups[group], axis=0)
                else:
                    self.centroids[group] = np.random.uniform(mins, maxs, size=(X.shape[1]))
            if not changed:
                break

    def predict(self, x):
        dist = self.get_distances(x)
        preds = dist.argmin()
        return preds
    
    def get_distances(self, X):
        try:
            p_squared = np.square(X).sum(axis=1)
        except:
            p_squared = np.square(X)
        q_squared = np.square(self.centroids).sum(axis=1)
        product   = -2*X.dot(self.centroids.T)
        distances = np.sqrt(product+q_squared+np.matrix(p_squared).T)
        return distances

In [17]:
def vis_bow(train_sifts,test_sifts,n_clusters=100):
    stacked_train_sifts = np.vstack(train_sifts)
    stacked_test_sifts = np.vstack(test_sifts)
    kmeans = MiniBatchKMeans(n_clusters=n_clusters) # K_Means(n_clusters=n_clusters)
    kmeans.fit(stacked_train_sifts)
    train_clusters = [kmeans.predict(words) for words in train_sifts]
    X_train_bow = np.array([np.bincount(words, minlength=n_clusters) for words in train_clusters])
    test_clusters = [kmeans.predict(words) for words in test_sifts]
    X_test_bow = np.array([np.bincount(words, minlength=n_clusters) for words in test_clusters])
    return X_train_bow,X_test_bow

In [20]:
X_train_bow,X_test_bow = vis_bow(train_sifts,test_sifts,n_clusters=100)

In [21]:
model = KNeighborsClassifier(n_neighbors=15,n_jobs=-1)
model.fit(X_train_bow,y_train)
get_scores(model,X_train_bow,y_train,X_test_bow,y_test)

Training Accuracy: 0.635000
Test Accuracy: 0.559000
Confusion Matrix:
 [[60  1 18  0  6 12  2  1]
 [ 0 91  0  1  2  3  2  1]
 [19  2 51  3  7 10  4  4]
 [ 6  7  1 58  2  4 13  9]
 [ 7  6  3  1 60 15  6  2]
 [31  7  3  2 10 41  2  4]
 [ 2 10  1  9 11 13 50  4]
 [11  6  5 15 12  6  9 36]]


## Bag of visual words model and a discriminative classifier

Use the bag of visual word representation.

Replace the nearest neighbor classifier with SVM classifer. Use 1 vs. all SVM for training the multi-class classifier.

- Report the training time and testing time for SVM
- Display the confusion matrix and categorization accuracy.

In [22]:
start = time.clock()
model = SVC(C=.001,kernel='linear')
model.fit(X_train_bow,y_train)
end = time.clock()
print('training time:',round(end-start,3),'seconds')

start = time.clock()
get_scores(model,X_train_bow,y_train,X_test_bow,y_test)
end = time.clock()
print('testing time:',round(end-start,3),'seconds')

training time: 0.767 seconds
Training Accuracy: 0.707000
Test Accuracy: 0.615000
Confusion Matrix:
 [[68  0  8  0 10 12  1  1]
 [ 0 84  0  0  7  3  4  2]
 [37  0 32  4  8  9  4  6]
 [ 2  4  1 65  1  7  6 14]
 [ 4  2  1  0 67 18  8  0]
 [21  3  1  2 12 57  1  3]
 [ 2  7  0  8 14  5 57  7]
 [ 0  1  4 13  7  7  6 62]]
testing time: 0.314 seconds


## CNN model and a discriminative classifier

Using pre-trained convolutional neural network as a feature extractor and a SVM for scene categorization. You can use the ConvNet library in MATLAB MatConvNet with one of the pre-trained classification models here.

Use 1 vs. all SVM classifier to categorize the test images

- Describe the model you used.
- Display the confusion matrix and categorization accuracy.

In [8]:
vgg = vgg16.VGG16(weights='imagenet', include_top=False,pooling='avg')
X_train_vgg16 = vgg.predict(vgg16.preprocess_input(X_train),verbose=1,batch_size=1)
X_test_vgg16 = vgg.predict(vgg16.preprocess_input(X_test),verbose=1,batch_size=1)



In [13]:
model = SVC(C=1,kernel='linear')
model.fit(X_train_vgg16,y_train)
get_scores(model,X_train_vgg16,y_train,X_test_vgg16,y_test)

Training Accuracy: 1.000000
Test Accuracy: 0.938000
Confusion Matrix:
 [[94  0  0  0  0  6  0  0]
 [ 0 95  0  0  2  3  0  0]
 [ 0  0 97  0  0  1  2  0]
 [ 0  0  1 88  0  0  6  5]
 [ 0  1  0  0 96  3  0  0]
 [ 8  2  0  0  2 88  0  0]
 [ 0  0  1  3  0  0 95  1]
 [ 0  0  0  3  0  0  0 97]]


## Graduate Points

up to 10 points: Use “soft assignment” to assign visual words to histogram bins. Each visual word will cast a distance-weighted vote to multiple bins.

In [None]:
def vis_bow_soft(train_sifts,test_sifts,n_components=100):
    stacked_train_sifts = np.vstack(train_sifts)
    stacked_test_sifts = np.vstack(test_sifts)
    gmm = GaussianMixture(n_components=n_components)
    gmm.fit(stacked_train_sifts)
    train_clusters = [gmm.predict_proba(words) for words in train_sifts] # word is now a vector...must change
    X_train_bow = np.array([np.bincount(words, minlength=n_clusters) for words in train_clusters])
    test_clusters = [gmm.predict_proba(words) for words in test_sifts]
    X_test_bow = np.array([np.bincount(words, minlength=n_clusters) for words in test_clusters])
    return X_train_bow,X_test_bow

In [None]:
X_train_soft,X_test_soft = vis_bow_soft(train_sifts,test_sifts,n_clusters=100)
model = SVC(C=.001,kernel='linear')
model.fit(X_train_soft,y_train)
get_scores(model,X_train_soft,y_train,X_test_soft,y_test)

up to 10 points: Implement one of the advanced feature encoding, e.g. fisher vector encoding, super vector, or LLC. See The devil is in the details: an evaluation of recent feature encoding methods, BMVC 2011 for more details. Compare the results with that from (C). You can use built-in Gaussian mixture models in MATLAB if you want to implement the fisher vector encoding.

In [None]:
# too much work

up to 10 points: For bag of visual word models, experiment with different number of visual word, e.g. K = 25, 50, 100, 200, 400, 800, 1600. Report the categorization accuracy for each K.

In [18]:
X_train_bow,X_test_bow = vis_bow(train_sifts,test_sifts,n_clusters=25)
model = SVC(C=.001,kernel='linear')
model.fit(X_train_bow,y_train)
get_scores(model,X_train_bow,y_train,X_test_bow,y_test)

Training Accuracy: 0.583000
Test Accuracy: 0.505000
Confusion Matrix:
 [[58  1 14  1  7 14  1  4]
 [ 0 83  0  3  5  2  5  2]
 [40  0 25  7  7 18  2  1]
 [ 0  6  3 51  3  6  7 24]
 [ 6  2  2  1 47 25  7 10]
 [17  4  4  4 19 44  1  7]
 [ 0  8  1 17  9  6 44 15]
 [ 3  3  3  8 13 13  5 52]]


In [19]:
X_train_bow,X_test_bow = vis_bow(train_sifts,test_sifts,n_clusters=50)
model = SVC(C=.001,kernel='linear')
model.fit(X_train_bow,y_train)
get_scores(model,X_train_bow,y_train,X_test_bow,y_test)

Training Accuracy: 0.658000
Test Accuracy: 0.575000
Confusion Matrix:
 [[59  0 13  0  9 15  2  2]
 [ 0 86  0  1  7  2  4  0]
 [35  0 27  3  8 18  4  5]
 [ 1  6  1 63  1  5  7 16]
 [ 3  3  1  0 64 22  4  3]
 [16  1  7  1 17 51  3  4]
 [ 0  8  0 11 17  4 49 11]
 [ 3  1  2 12  8  6  7 61]]


In [20]:
X_train_bow,X_test_bow = vis_bow(train_sifts,test_sifts,n_clusters=100)
model = SVC(C=.001,kernel='linear')
model.fit(X_train_bow,y_train)
get_scores(model,X_train_bow,y_train,X_test_bow,y_test)

Training Accuracy: 0.700000
Test Accuracy: 0.610000
Confusion Matrix:
 [[65  0  7  0 10 13  1  4]
 [ 0 85  0  2  9  1  3  0]
 [33  0 35  2  7 12  8  3]
 [ 1  6  2 57  0  5  7 22]
 [ 4  2  1  0 66 18  6  3]
 [17  2  1  0 15 63  1  1]
 [ 1  5  0  7 14  7 55 11]
 [ 1  2  3 10 11  7  4 62]]


In [21]:
X_train_bow,X_test_bow = vis_bow(train_sifts,test_sifts,n_clusters=200)
model = SVC(C=.001,kernel='linear')
model.fit(X_train_bow,y_train)
get_scores(model,X_train_bow,y_train,X_test_bow,y_test)

Training Accuracy: 0.757000
Test Accuracy: 0.640000
Confusion Matrix:
 [[70  0  6  0  7 13  3  1]
 [ 0 85  0  0  7  3  4  1]
 [38  0 30  5 12 11  2  2]
 [ 1  2  0 69  3  4  7 14]
 [ 3  3  1  0 74 17  1  1]
 [17  4  3  1 16 58  1  0]
 [ 0  4  2  8 10  4 62 10]
 [ 0  1  3 14 11  2  5 64]]


In [29]:
X_train_bow,X_test_bow = vis_bow(train_sifts,test_sifts,n_clusters=400)
model = SVC(C=.001,kernel='linear')
model.fit(X_train_bow,y_train)
get_scores(model,X_train_bow,y_train,X_test_bow,y_test)

Training Accuracy: 0.798000
Test Accuracy: 0.651000
Confusion Matrix:
 [[71  0  4  0  7 16  1  1]
 [ 0 88  0  0  7  2  3  0]
 [38  0 30  3  7 17  4  1]
 [ 2  4  1 62  0  4  8 19]
 [ 4  3  1  0 75 15  2  0]
 [13  2  1  1 14 66  1  2]
 [ 2  3  0  7 12  5 61 10]
 [ 2  0  4  8  7  6  5 68]]


In [31]:
X_train_bow,X_test_bow = vis_bow(train_sifts,test_sifts,n_clusters=800)
model = SVC(C=.001,kernel='linear')
model.fit(X_train_bow,y_train)
get_scores(model,X_train_bow,y_train,X_test_bow,y_test)

Training Accuracy: 0.818000
Test Accuracy: 0.654000
Confusion Matrix:
 [[74  0  4  0  6 15  1  0]
 [ 0 84  0  0 10  4  2  0]
 [38  0 28  1  7 18  6  2]
 [ 2  1  0 68  1  5  8 15]
 [ 3  4  1  0 69 22  1  0]
 [18  1  0  0 12 68  1  0]
 [ 0  3  0  5 13  5 62 12]
 [ 6  0  2  9  5  5  3 70]]


up to 10 points: Try using two different pre-trained CNN models. Report the accuracy of each of the models.

In [34]:
resnet = resnet50.ResNet50(weights='imagenet',include_top=False,pooling='avg')
X_train_resnet50 = resnet.predict(resnet50.preprocess_input(X_train),verbose=1,batch_size=1)
X_test_resnet50 = resnet.predict(resnet50.preprocess_input(X_test),verbose=1,batch_size=1)

model = SVC(C=1,kernel='linear')
model.fit(X_train_resnet50,y_train)
get_scores(model,X_train_resnet50,y_train,X_test_resnet50,y_test)

Training Accuracy: 1.000000
Test Accuracy: 0.961000
Confusion Matrix:
 [[97  0  1  0  0  2  0  0]
 [ 0 97  0  0  1  2  0  0]
 [ 0  0 98  0  0  0  2  0]
 [ 0  0  0 92  0  0  5  3]
 [ 0  0  0  0 99  1  0  0]
 [ 4  2  0  0  2 92  0  0]
 [ 0  0  1  2  0  0 97  0]
 [ 0  0  0  3  0  0  0 97]]


In [36]:
inception = inception_v3.InceptionV3(weights='imagenet',include_top=False,pooling='avg')
X_train_inception = inception.predict(inception_v3.preprocess_input(X_train),verbose=1,batch_size=1)
X_test_inception = inception.predict(inception_v3.preprocess_input(X_test),verbose=1,batch_size=1)

model = SVC(C=1,kernel='linear')
model.fit(X_train_inception,y_train)
get_scores(model,X_train_inception,y_train,X_test_inception,y_test)

Training Accuracy: 1.000000
Test Accuracy: 0.930000
Confusion Matrix:
 [[95  0  0  0  0  5  0  0]
 [ 1 89  0  0  4  5  0  1]
 [ 1  0 93  0  0  0  4  2]
 [ 0  0  2 89  0  0  6  3]
 [ 0  0  0  0 97  3  0  0]
 [ 5  1  2  0  0 92  0  0]
 [ 0  0  1  4  0  0 93  2]
 [ 0  0  0  1  0  1  2 96]]


up to 10 points: For one specific CNN model (e.g., AlexNet or VGGNet), report the classification accuracy when you use different levels of feature activations, e.g., Pool4, Pool5, Fc6, Fc7.

In [5]:
X_train_resized = np.array([resize(image,(224,224),anti_aliasing=True) for image in X_train])
X_test_resized = np.array([resize(image,(224,224),anti_aliasing=True) for image in X_test])
vgg = vgg16.VGG16(weights='imagenet')

In [6]:
cnn = Model(inputs=vgg.input,outputs=vgg.get_layer('block3_pool').output)
X_train_cnn = cnn.predict(vgg16.preprocess_input(X_train_resized),verbose=1,batch_size=1).reshape(len(X_train),-1)
X_test_cnn = cnn.predict(vgg16.preprocess_input(X_test_resized),verbose=1,batch_size=1).reshape(len(X_test),-1)

model = SVC(C=1,kernel='linear')
model.fit(X_train_cnn,y_train)
get_scores(model,X_train_cnn,y_train,X_test_cnn,y_test)

Training Accuracy: 1.000000
Test Accuracy: 0.741000
Confusion Matrix:
 [[69  2  6  0  2 21  0  0]
 [ 0 87  0  1  8  4  0  0]
 [15  1 65  4  2 10  2  1]
 [ 4  0  0 69  1  2  8 16]
 [ 6  5  3  0 71  8  4  3]
 [18  2  3  0  7 69  1  0]
 [ 0  1  5  4  5  1 84  0]
 [ 3  0  1  7  5  1  4 79]]


In [7]:
cnn = Model(inputs=vgg.input,outputs=vgg.get_layer('block4_pool').output)
X_train_cnn = cnn.predict(vgg16.preprocess_input(X_train_resized),verbose=1,batch_size=1).reshape(len(X_train),-1)
X_test_cnn = cnn.predict(vgg16.preprocess_input(X_test_resized),verbose=1,batch_size=1).reshape(len(X_test),-1)

model = SVC(C=1,kernel='linear')
model.fit(X_train_cnn,y_train)
get_scores(model,X_train_cnn,y_train,X_test_cnn,y_test)

Training Accuracy: 1.000000
Test Accuracy: 0.710000
Confusion Matrix:
 [[69  1  4  1  2 23  0  0]
 [ 0 83  0  3  8  5  0  1]
 [15  2 68  6  3  6  0  0]
 [ 2  2  0 67  1  3 11 14]
 [ 7  5  3  1 60 16  4  4]
 [16  3  5  0  7 68  1  0]
 [ 0  1  4 10  5  1 77  2]
 [ 0  3  0 10  7  1  3 76]]


In [12]:
cnn = Model(inputs=vgg.input,outputs=vgg.get_layer('block5_pool').output)
X_train_cnn = cnn.predict(vgg16.preprocess_input(X_train_resized),verbose=1,batch_size=1).reshape(len(X_train),-1)
X_test_cnn = cnn.predict(vgg16.preprocess_input(X_test_resized),verbose=1,batch_size=1).reshape(len(X_test),-1)

model = SVC(C=1,kernel='linear')
model.fit(X_train_cnn,y_train)
get_scores(model,X_train_cnn,y_train,X_test_cnn,y_test)

Training Accuracy: 0.851000
Test Accuracy: 0.691000
Confusion Matrix:
 [[65  1  6  3  8 17  0  0]
 [ 1 83  0  3  8  5  0  0]
 [21  1 66  5  2  4  0  1]
 [ 8  2  5 61  0  2  8 14]
 [10  2  2  2 61 19  1  3]
 [17  2  2  1  5 71  1  1]
 [ 0  3  5  9  9  0 74  0]
 [ 3  4  1  7  7  4  2 72]]
