# Image Classifier for Satellite Images Using Non Deep Learning Methods

This notebook demonstrates how we construct an image classifier using non deep learning methods.

More specifically, it will cover the following topics in order:

## Agenda
####    1. Data Preparation: converting image data into vector form
####    2. Vanilla Random Forest as a benchmark
####    3. Data Augmentation to deal with class inbalance
####    4. Random Forest with augmented data
####    5. XGBoost (with vanilla data?)
####    6. Performance comparison with CNN using pretrained InceptionV3
####    7. Conclusions

## 1. Data Preparation: converting image data into vector form

In [2]:
import cv2
import os
from custom_dset_new import train_val_test_split
import numpy as np
from tqdm import tqdm

In [3]:
data_dir = '/Users/ruoyangzhang/Documents/PythonWorkingDirectory/Assignment_data/images'

In [4]:
train_data, val_data, test_data = train_val_test_split(data_dir, train_split=0.8, val_split=0.2, test_split = 0)

In [5]:
ordered_train_dirs = [dir for dir in sorted(list(train_data.keys())) if os.path.split(dir)[-1] != '.DS_Store']

In [6]:
def convert_to_vector(img_dir):
    img = cv2.imread(img_dir)
    b, g, r = cv2.split(img)
    rgb_img = cv2.merge([r, g, b])
    rgb_img.shape = (1, 28*28*3)
    return(rgb_img)

In [7]:
input_images = np.array([convert_to_vector(dir) for dir in tqdm(ordered_train_dirs)])

100%|██████████| 259200/259200 [02:41<00:00, 1602.37it/s]


In [8]:
input_images.shape = (input_images.shape[0], input_images.shape[2])

In [9]:
labels = np.array([train_data[dir] for dir in ordered_train_dirs])

Now we make test data into np.arrays

In [10]:
ordered_test_dirs = [dir for dir in sorted(list(val_data.keys())) if os.path.split(dir)[-1] != '.DS_Store']

In [11]:
test_images = np.array([convert_to_vector(dir) for dir in tqdm(ordered_test_dirs)])

100%|██████████| 64800/64800 [00:36<00:00, 1768.60it/s]


In [12]:
test_images.shape = (test_images.shape[0], test_images.shape[2])

In [13]:
test_labels = np.array([val_data[dir] for dir in ordered_test_dirs])

## 2. Vanilla Random Forest as a benchmark

In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn import metrics
from collections import Counter

In [15]:
clf = RandomForestClassifier(n_estimators=100, n_jobs = 5)

In [16]:
clf.fit(input_images,labels)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=5,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [17]:
preds = clf.predict(test_images)

In [18]:
print("Accuracy:",metrics.accuracy_score(test_labels, preds))

Accuracy: 0.9661574074074074


In [19]:
metrics.confusion_matrix(test_labels, preds)

array([[24130,     0,     0,     0,     0,     0],
       [   13, 11355,     1,     4,     0,    69],
       [   94,    30,  1310,    35,   138,    49],
       [    1,    56,     0, 14065,    21,   503],
       [   18,     0,    50,     2,  2856,     0],
       [   21,   625,     1,   461,     1,  8891]])

In [20]:
bookmark = {0: 'precision', 1: 'recall   ', 2: 'fscore   ', 3: 'support'}
class_dict = {0: 'water', 1: 'trees', 2: 'road', 3: 'barren_land', 4: 'building', 5: 'grassland'}
train_class_count = Counter(val_data.values())
train_class_balance = {k:round(v/sum(train_class_count.values()),4) for k,v in train_class_count.items()}

print('{}, {}: {:.4f}, {}: {:.4f}, {}: {:.4f}, {}: {:.4f}, {}: {:.4f}, {}: {:.4f}'\
              .format('balance  ',
                      class_dict[0], train_class_balance[0],
                      class_dict[1], train_class_balance[1],
                      class_dict[2], train_class_balance[2],
                      class_dict[3], train_class_balance[3],
                      class_dict[4], train_class_balance[4],
                      class_dict[5], train_class_balance[5]))

print('---------------')

for i, scores in enumerate(metrics.precision_recall_fscore_support(test_labels, preds)):
    if i < 3:
        print('{}, {}: {:.4f}, {}: {:.4f}, {}: {:.4f}, {}: {:.4f}, {}: {:.4f}, {}: {:.4f}'\
              .format(bookmark[i],
                      class_dict[0], scores[0],
                      class_dict[1], scores[1],
                      class_dict[2], scores[2],
                      class_dict[3], scores[3],
                      class_dict[4], scores[4],
                      class_dict[5], scores[5]))
        print('---------------')

balance  , water: 0.3724, trees: 0.1766, road: 0.0256, barren_land: 0.2260, building: 0.0452, grassland: 0.1543
---------------
precision, water: 0.9939, trees: 0.9411, road: 0.9618, barren_land: 0.9655, building: 0.9469, grassland: 0.9347
---------------
recall   , water: 1.0000, trees: 0.9924, road: 0.7911, barren_land: 0.9603, building: 0.9761, grassland: 0.8891
---------------
fscore   , water: 0.9970, trees: 0.9661, road: 0.8681, barren_land: 0.9629, building: 0.9613, grassland: 0.9113
---------------


As we can tell, we have some evidence to suspect that the class inbalance is costing us performance. We note the following observations:

    1. the class 'road' is grossly underrepresented in the training set, potentially leading to a low recall score and low overal performance (fscore: 0.8681)
    
    2. curiously, the class 'building', despite being underrepresented in the training set, obtained an acceptable prediction performance, possibly due to its visual distinctiveness
    
    3. on the contrary, the class 'grassland', despite having a relatively fair representation (15.43%), its recall score is below overal performance (fscore: 0.9113), leading us to believe that the class is harder to distinguish from other classes, especially from 'trees' and 'barren_land'

### Going forward:

The vanilla Random Forest's performance reached a respectable 96.5% accuracy with not excellent but acceptable class-wise performance, notably with minimum engineering. 

Going forward, we keep its performance as our baseline benchmark.

We aim to improve the prediction performance of the model by artifitially balancing out the classes a bit by data augmentation of the 2 worst performing classes:

    1. road: 2
    2. grassland: 5

## 3. Data Augmentation to deal with class inbalance

We have written data augmentation functions (image_transformation.py) which provides the following image transformations:

    1. random rotation between -25 and 25 degrees
    2. random rotation between 26 and 75 degrees
    3. random rotation either -90 or 90 degrees
    4. adding random noise to the data
    5. horizontal flip
    6. vertical flip
    7. transpose
    8. zoom (maximum 1.4x)
    
With the 4 options, we can increase the representation of a particular class by 7 fold maximum without going into composite transformations
    
We will try 2 strategies:
    1. Only augmenting the aforementioned 2 classes
        a. to increase 8 fold the volume of the class 'road'
        b. to increase 2 fold the volume of the class 'grassland'
    2. Balancing all classes in the training set

In [271]:
from image_augmentation import *

In [272]:
test_dirs = [list(train_data.keys())[0], list(train_data.keys())[2]]

In [273]:
test_res = image_augmentation(test_dirs, 2)

100%|██████████| 2/2 [00:00<00:00, 444.74it/s]

the functions to be used for augmentation are: 
1 random_rotation_75
2 random_rotation_90





In [135]:
fun_list = function_list = [random_rotation_25, random_rotation_75, random_rotation_90, random_noise, horizontal_flip, vertical_flip, transpose, zoom]

In [138]:
for fun in fun_list:
    print(str(fun).split(' ')[1])

random_rotation_25
random_rotation_75
random_rotation_90
random_noise
horizontal_flip
vertical_flip
transpose
zoom


In [167]:
random.choice([90,2])

90

In [227]:
li = []

In [245]:
test_arrays = [np.array([[1,2,3]]), np.array([[1,2,4]]), np.array([[1,5,4]])]

In [246]:
for arra in test_arrays:
    li.append(arra)

In [247]:
np.concatenate(test_arrays, axis = 0)

array([[1, 2, 3],
       [1, 2, 4],
       [1, 5, 4]])

In [248]:
np.concatenate(li, axis = 0)

array([[1, 2, 3],
       [1, 2, 4],
       [1, 2, 3],
       [1, 2, 4],
       [1, 5, 4]])

In [249]:
print(li[0], li[0].shape)

[[1 2 3]] (1, 3)


In [252]:
np.concatenate(li, axis = 0).shape

(5, 3)

In [258]:
len(np.concatenate(li, axis = 0))

5

In [161]:
import xgboost as xgb

In [162]:
data_dmatrix = xgb.DMatrix(data = input_images, label = labels)

In [None]:
xg_clf = xgb.XGBClassifier(objective='multi:softmax', colsample_bytree = 0.3, learning_rate = 0.1, max_depth = 100, alpha = 10, n_estimators=2000)

In [None]:
xg_clf.fit(input_images,labels)

preds = xg_reg.predict(test)