# Classifier
[Implementation](./classifier.py)
[Configuration](./Classifier.ini)

The classifier class implements the learning from a feature vector set and predition for a feature vector. It also includes a X_scaler which is automatically fit upon training the classifier.

## Preprocessing image data
First we need to load all the vehicle and non-vehicle image data and extract the feature vectors from them.

In [2]:
import numpy as np
from sklearn.utils import shuffle

from images import ImageLoader
from featureextractor import FeatureExtractor

img_load = ImageLoader()
feat_ext = FeatureExtractor()

print('Load vehicle images')
%time images_vehicle = img_load.get_all('train_vehicle')
print('Load non-vehicle images')
%time images_nonvehicle = img_load.get_all('train_non-vehicle')

label_vehicle = np.ones(images_vehicle.shape[0])
label_nonvehicle = np.zeros(images_nonvehicle.shape[0])

img_set = np.concatenate((images_vehicle, images_nonvehicle))
lbl_set = np.concatenate((label_vehicle, label_nonvehicle))

print('Convert to feature vector')
def features(img_set):
    feat_set = []
    for img in img_set:
        feat_set.append(feat_ext.feature_vector(img))
    feat_set = np.array(feat_set)
    return feat_set
%time feat_set = features(img_set)
feat_set = np.vstack(feat_set)

img_set, feat_set, lbl_set = shuffle(img_set, feat_set, lbl_set)

print('Vehicle images: %d' % label_vehicle.shape[0])
print('Non-Vehicle images: %d' % label_nonvehicle.shape[0])

Load vehicle images
Wall time: 3.44 s
Load non-vehicle images
Wall time: 3.42 s
Convert to feature vector
Wall time: 31.1 s
Vehicle images: 8792
Non-Vehicle images: 8968


## Support vector machine
The first choice as classifier is a SVM. The standard [SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) from sklearn is used with several configurations.

Parameters:
- kernel: Type of kernel to be used (linear or rbf usually)
- c: loss factor

Configurations:
1. **kernel=rbf, c=1**
2. kernel=rbf, c=1.5
3. kernel=rbf, c=2
4. kernel=linear, c=1
5. kernel=linear, c=1.5
6. kernel=linear, c=2

Due to quite long training times for evaluation only a reduced data set is used. Later at the end of the notebook the best performing classifier is trained with the full set.

In [5]:
%load_ext autoreload
%autoreload 2

from sklearn.model_selection import train_test_split
import random
import time
import pickle

from classifier import Classifier

clf = Classifier()

# Subset of training and testset for debugging
REDUCED_SET = True
num_train = 4000
num_test = 400
fr_train = num_train / lbl_set.shape[0]
fr_test = num_test / lbl_set.shape[0]

config11 = {'classifier' : {'type' : 'SVM'}, 'SVM' : {'c' : '1', 'kernel' : 'rbf'}}
config12 = {'classifier' : {'type' : 'SVM'}, 'SVM' : {'c' : '1.5', 'kernel' : 'rbf'}}
config13 = {'classifier' : {'type' : 'SVM'}, 'SVM' : {'c' : '2', 'kernel' : 'rbf'}}
config14 = {'classifier' : {'type' : 'SVM'}, 'SVM' : {'c' : '1', 'kernel' : 'linear'}}
config15 = {'classifier' : {'type' : 'SVM'}, 'SVM' : {'c' : '1.5', 'kernel' : 'linear'}}
config16 = {'classifier' : {'type' : 'SVM'}, 'SVM' : {'c' : '2', 'kernel' : 'linear'}}
cfg_set = [config11, config12, config13, config14, config15, config16]

accuracy = []
time_train = []
for cfg_idx,cfg in enumerate(cfg_set):
    clf.set_config(cfg)
    
    if REDUCED_SET:
        X_train, X_test, y_train, y_test = train_test_split(feat_set, lbl_set, train_size=fr_train, test_size=fr_test)
    else:
        X_train, X_test, y_train, y_test = train_test_split(feat_set, lbl_set, test_size=0.1)

    t_start = time.time()
    clf.train(X_train, y_train)
    t_end = time.time()
    time_train.append(t_end-t_start)
    
    accuracy.append(clf.accuracy(X_test, y_test))

    print('Config Set %d' % (cfg_idx+1))
    print('Train time: %.1fs' % time_train[cfg_idx])
    print('Accuracy: %.2f%%' % (accuracy[cfg_idx]*100))
    
    pickle_file = './classifiers/clf_svm' + str(cfg_idx+1) + '.pkl'
    
    print('Write classifier to pickle file %s' % pickle_file)
    with open(pickle_file, 'wb') as fid:
        pickle.dump(clf, fid)   
    print('\n')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Config Set 1
Train time: 21.2s
Accuracy: 99.50%
Write classifier to pickle file ./classifiers/clf_svm1.pkl


Config Set 2
Train time: 20.3s
Accuracy: 98.50%
Write classifier to pickle file ./classifiers/clf_svm2.pkl


Config Set 3
Train time: 20.6s
Accuracy: 98.25%
Write classifier to pickle file ./classifiers/clf_svm3.pkl


Config Set 4
Train time: 9.7s
Accuracy: 97.00%
Write classifier to pickle file ./classifiers/clf_svm4.pkl


Config Set 5
Train time: 9.7s
Accuracy: 96.50%
Write classifier to pickle file ./classifiers/clf_svm5.pkl


Config Set 6
Train time: 9.6s
Accuracy: 97.50%
Write classifier to pickle file ./classifiers/clf_svm6.pkl




## Decision Tree
As second classifer a [DTC](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) is used.

Parameters:
- criterion: Maximization criterion either gini or entropy
- min_samples_split: Minimum number of samples to perform a split
- min_samples_leaf: Minimum number of samples in a leaf

Configurations:
1. crit=gini, split=2, leaf=1
2. crit=gini, split=4, leaf=1
3. crit=gini, split=2, leaf=2
4. crit=entropy, split=2, leaf=1
5. crit=entropy, split=3, leaf=1

In [8]:
%load_ext autoreload
%autoreload 2

from sklearn.model_selection import train_test_split
import random
import time
import pickle

from classifier import Classifier

clf = Classifier()

# Subset of training and testset for debugging
REDUCED_SET = False
num_train = 1000
num_test = 200
fr_train = num_train / lbl_set.shape[0]
fr_test = num_test / lbl_set.shape[0]

config21 = {'classifier' : {'type' : 'DT'}, 'DT' : {'criterion' : 'gini', 'min_samples_split' : '2', 'min_samples_leaf' : '1'}}
config22 = {'classifier' : {'type' : 'DT'}, 'DT' : {'criterion' : 'gini', 'min_samples_split' : '4', 'min_samples_leaf' : '1'}}
config23 = {'classifier' : {'type' : 'DT'}, 'DT' : {'criterion' : 'gini', 'min_samples_split' : '2', 'min_samples_leaf' : '2'}}
config24 = {'classifier' : {'type' : 'DT'}, 'DT' : {'criterion' : 'entropy', 'min_samples_split' : '2', 'min_samples_leaf' : '1'}}
config25 = {'classifier' : {'type' : 'DT'}, 'DT' : {'criterion' : 'entropy', 'min_samples_split' : '3', 'min_samples_leaf' : '1'}}
cfg_set = [config21, config22, config23, config24, config25]

accuracy = []
time_train = []
for cfg_idx,cfg in enumerate(cfg_set):
    clf.set_config(cfg)
    
    if REDUCED_SET:
        X_train, X_test, y_train, y_test = train_test_split(feat_set, lbl_set, train_size=fr_train, test_size=fr_test)
    else:
        X_train, X_test, y_train, y_test = train_test_split(feat_set, lbl_set, test_size=0.1)
    
    t_start = time.time()
    clf.train(X_train, y_train)
    t_end = time.time()
    time_train.append(t_end-t_start)
    
    accuracy.append(clf.accuracy(X_test, y_test))

    print('Config Set %d' % (cfg_idx+1))
    print('Train time: %.1fs' % time_train[cfg_idx])
    print('Accuracy: %.2f%%' % (accuracy[cfg_idx]*100))
    
    pickle_file = './classifiers/clf_dt' + str(cfg_idx+1) + '.pkl'
    
    print('Write classifier to pickle file %s' % pickle_file)
    with open(pickle_file, 'wb') as fid:
        pickle.dump(clf, fid)   

    print('\n')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Config Set 1
Train time: 138.6s
Accuracy: 95.21%
Write classifier to pickle file ./classifiers/clf_dt1.pkl


Config Set 2
Train time: 127.4s
Accuracy: 95.38%
Write classifier to pickle file ./classifiers/clf_dt2.pkl


Config Set 3
Train time: 120.1s
Accuracy: 96.45%
Write classifier to pickle file ./classifiers/clf_dt3.pkl


Config Set 4
Train time: 55.6s
Accuracy: 95.95%
Write classifier to pickle file ./classifiers/clf_dt4.pkl


Config Set 5
Train time: 52.6s
Accuracy: 95.95%
Write classifier to pickle file ./classifiers/clf_dt5.pkl




## Save config
Best results are achieved with a SVM Classifers with all Config Sets around 98% accuracy. The Decision Tree classifiers only achieve 96% accuracy. The **config set 1 of the SVC** is saved to the .ini file and additionally the classifiers are all dumped as pickle file for later experimentation. This allows to load only the pickle with the trained classifier, without the need to retrain it.

Further it might be benefical to use a combination of 2 or more classifiers for more robust detection.

In [7]:
%load_ext autoreload
%autoreload 2
%load_ext line_profiler

import pickle
from sklearn.model_selection import train_test_split

from classifier import Classifier

clf = Classifier()

config = config11
clf.set_config(config)

X_train, X_test, y_train, y_test = train_test_split(feat_set, lbl_set, test_size=0.1)
clf.train(X_train, y_train)
print('Accuracy: %.2f%%' % (clf.accuracy(X_test, y_test)*100))

pickle_file = './classifier.pkl'
print('Writing to %s' % pickle_file)
clf.write_config()

with open(pickle_file, 'wb') as fid:
    pickle.dump(clf, fid)      

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler
Accuracy: 99.49%
Writing to ./classifier.pkl
