# Bag of word model - Feature extraction

<br>
**DISCLAIMER: IT TAKES LOTS OF TIME TO RUNING THIS NOTEBOOK. IT IS BETTER TO USE THE GENERATED CODEBOOK AND TRANSFORMED FEATURE AS THE RESULT OF THIS MODEL WHICH YOU CAN DOWNLOAD IN THE THIS FOLDER: https://drive.google.com/drive/folders/0Bxk-xCNz8VClZEs5YVFoV3MyZEE?usp=sharing. THE DETAIL WILL BE MENTIONED IN THIS NOTEBOOK**

<br>
In this notebook, I just want to show you again the process of the BoW model and how can we represent the image by it. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features. Two most important steps are feature extraction using SIFT and codebook generation using K-clustering

In [3]:
import argparse as ap
import numpy as np
import os
import cv2
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.externals import joblib
from scipy.cluster.vq import *
from sklearn.preprocessing import StandardScaler
import skimage
from os.path import isfile, join
from os import listdir
%matplotlib inline

## *1. Obtain training data *

In [4]:
# Get the training classes names and store them in a list
train_path = "data/CifarTrain/"
training_names = os.listdir(train_path)

# image_paths and the corresponding label in image_paths
image_paths = []
image_classes = []
class_id = -1
for training_name in training_names:
    class_path_url = os.path.join(train_path, training_name)
    from os import walk

    class_path = []
    for (dirpath, dirnames, filenames) in walk(class_path_url):
        for fn in filenames:
            class_path.append(os.path.join(class_path_url,fn))
        break
    
    image_paths+=class_path
    image_classes+=[class_id]*len(class_path)
    class_id+=1

## *2. Extract feature points and its description *
In this model, I use SIFT as detectors and descriptors which is the most popular method in CV. There are lots of descriptor and detector such as FAST, ORB, SURF. Even SIFT computation is quite expensive and costly but the performance is quite good in feature extraction. In this step, I use SIFT to collect all features in training set to prepare input for next step

In [None]:
# Create feature extraction and keypoint detector objects
sift = cv2.SIFT()
# List where all the descriptors are stored
des_list = []
i = 0 
for image_path in image_paths:
    if (i%1000==0):
        print (i)
    i += 1
    # read image
    im = cv2.imread(image_path)
    
    # find the keypoints with SIFT
    kp = sift.detect(im,None)
    # compute the descriptors with SIFT
    kp, des = sift.compute(im, kp)
    
    des_list.append(des)   
    
# Remove some empty list
des_list_ = [x for x in des_list if np.shape(x)]

In [16]:
descriptors = np.concatenate(des_list_,axis=0)

In [17]:
# Stack all the descriptors vertically in a numpy array
np.save('features/train_feature_n',descriptors)
del des_list_,des_list

## *3. Perform k-means clustering to train the codebook *
Because of huge number of features and slight difference between them so that we should use clustering to group all similiar features together. The output the clustering is dictionary of centroids, called codebook. Number of words in the codebook should be high (10.000 words). Because of long training time, I just use 1000 words.

In [5]:
descriptors = np.load('features/train_feature_n.npy')

In [6]:
# Perform k-means clustering
k = 100
voc, variance = kmeans(descriptors, k, 1)

In [7]:
np.save('codebook_100',voc)

## *4. Represent training data with BoW *
After we have the codebook, we will represent our image again. The new vector has a length equal to number of words in the codebook and the value is number of occurences of those features in a image. 

In [23]:
voc = np.load('codebook_100.npy')

In [24]:
# Calculate the histogram of features
sift = cv2.SIFT()
train_features = np.zeros((len(image_paths), k), "float32")
for i in xrange(len(image_paths)):
    im = cv2.imread(image_paths[i])
    kp = sift.detect(im,None)
    kp, des = sift.compute(im, kp)
    if(des!=None):
        words, distance = vq(des,voc)
        for w in words:
            train_features[i][w] += 1



In [26]:
from imutils import paths

In [27]:
test_path = "cifarTest/"
image_paths_test = list(paths.list_images(test_path))
x = [i[10:][:-4] for i in image_paths_test]
testClass = [int(i[-1]) for i in x]
idxTest = np.argsort([int(i[:-2]) for i in x])

In [28]:
# Get the training classes names and store them in a list
#test_path = "data/cifarTest/"
#test_path = "cifarTest/"

#image_paths_test = list(paths.list_images(test_path))
#idxTest = np.argsort([int(i[15:][:-4]) for i in image_paths_test])
image_paths_test = np.array(image_paths_test)[idxTest]
# Calculate the histogram of features
test_features = np.zeros((len(image_paths_test), k), "float32")
for i in xrange(len(image_paths_test)):
    im = cv2.imread(image_paths_test[i])
    kp = sift.detect(im,None)
    kp, des = sift.compute(im, kp)
    if(des!=None):
        words, distance = vq(des,voc)
        for w in words:
            test_features[i][w] += 1



## *5. Apply tf-idf weighting for representation *
The idea is strengthen useful words which is discrimative and lighten the common words. In tf-idf, term frequency grasp the idea that if the word appears many times in the documents that means it can be "representive" word for this document and Inverse document frequency means if the word appears in almosts documents that will be useless to use it to classify the document and vice versa. There are many variance for tf and idf weighting. In this case, i implement 4 ways to weight tf:
  1. RAW: occurance of word in document 
  2. Frequency: normalize with total number of words in a document
  3. log normalization: reduce the high difference between 2 features
  4. double normalization: normalize by using maximum value in the feature vector.

In [29]:
def tf_transform(data,method=''):
    if (method == 'frequency'):
        wordfrequencyDocs = np.sum(data , axis = 1) + 1e-10
        return (data.T/wordfrequencyDocs).T
    if (method == 'log'):
        data[data != 0] = 1 + np.log(data[data != 0])
        return data
    if (method == 'doublenorm'):
        mostfrequencyWord = np.max(data , axis = 1) + 1e-10
        return (data.T/mostfrequencyWord).T
    # raw count approach
    return data

In [30]:
# Perform Tf-Idf vectorization
nbr_occurences = np.sum((train_features > 0) * 1, axis = 0)
idf = np.array(np.log((1.0*len(image_paths)+1) / (1.0*nbr_occurences + 1)), 'float32')

## *6. Save new vectors of feature for next step*

In [31]:
name_2 = '_100'
train_TfIdf = idf*tf_transform(train_features,'frequency')
np.save('features/BoW_train_frequency'+name_2,train_TfIdf)
train_TfIdf = idf*tf_transform(train_features,'doublenorm')
np.save('features/BoW_train_doublenorm'+name_2,train_TfIdf)
train_TfIdf = idf*tf_transform(train_features,'log')
np.save('features/BoW_train_log'+name_2,train_TfIdf)
train_TfIdf = idf*tf_transform(train_features,'')
np.save('features/BoW_train_raw'+name_2,train_TfIdf)
np.save('features/BoW_train _labels',image_classes)

In [32]:
test_TfIdf = idf*tf_transform(test_features,'frequency')
np.save('features/BoW_test_frequency' +name_2,test_TfIdf)
test_TfIdf = idf*tf_transform(test_features,'doublenorm')
np.save('features/BoW_test_doublenorm' +name_2,test_TfIdf)
test_TfIdf = idf*tf_transform(test_features,'log')
np.save('features/BoW_test_log' +name_2,test_TfIdf)
test_TfIdf = idf*tf_transform(test_features,'')
np.save('features/BoW_test_raw' +name_2,test_TfIdf)
np.save('features/Bow_test_labels',testClass)