#Vehicle silhouettes

##Objective
To classify a given silhouette as one of four types of vehicle, 	using a set of features extracted from the silhouette. The 	vehicle may be viewed from one of many different angles.   

##Description

###The features were extracted from the silhouettes by the HIPS
(Hierarchical Image Processing System) extension BINATTS, which extracts a combination of scale independent features utilising	both classical moments based measures such as scaled variance,	skewness and kurtosis about the major/minor axes and heuristic	measures such as hollows, circularity, rectangularity and	compactness. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400.	This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
	
##Source: https://www.kaggle.com/rajansharma780/vehicle

## ATTRIBUTES
1.	compactness	float	average perimeter**2/area
2.	circularity	float	average radius**2/area
3.	distance_circularity	float	area/(av.distance from border)**2
4.	radius_ratio	float	(max.rad-min.rad)/av.radius
5.	pr_axis_aspect_ratio	float	(minor axis)/(major axis)
6.	max_length_aspect_ratio	float	(length perp. max length)/(max length)
7.	scatter_ratio	float	(inertia about minor axis)/(inertia about major axis)
8.	elongatedness	float	area/(shrink width)**2
9.	pr_axis_rectangularity	float	area/(pr.axis length*pr.axis width)
10.	max_length_rectangularity	float	area/(max.length*length perp. to this)
11.	scaled_variance_major_axis	float	(2nd order moment about minor axis)/area
12.	scaled_variance_minor_axis	float	(2nd order moment about major axis)/area
13.	scaled_radius_gyration	float	(mavar+mivar)/area
14.	skewness_major_axis	float	(3rd order moment about major axis)/sigma_min**3
15.	skewness_minor_axis	float	(3rd order moment about minor axis)/sigma_maj**3
16.	kurtosis_minor_axis	float	(4th order moment about major axis)/sigma_min**4
17.	kurtosis_major_axis	float	(4th order moment about minor axis)/sigma_maj**4
18.	hollows_ratio	float	(area of hollows)/(area of bounding polygon)

##Target variable
19.	vehicle_class	string	Predictor Class. Values: Opel, Saab, Bus, Van	

#Tasks:
1.	Obtain the multi-class dataset from the given link
2.	Load the dataset
3.	Apply pre-processing techniques: Encoding, Scaling
4.	Divide the dataset into training (70%) and testing (30%)
5.	Build your own random forest model from scratch (using invidual decision tree model from sklearn)
6.	Train the random forest model
7.	Test the random forest model
8.	Train and test the random forest model using sklearn.
9.	Compare the performance of both the models

##Useful links:
https://machinelearningmastery.com/implement-random-forest-scratch-python/

https://towardsdatascience.com/random-forests-and-decision-trees-from-scratch-in-python-3e4fa5ae4249

https://www.analyticsvidhya.com/blog/2018/12/building-a-random-forest-from-scratch-understanding-real-world-data-products-ml-for-programmers-part-3/

# Part 1: Random Forest from scratch

Random forests are an ensemble learning method for classification and regression that operate by constructing multiple decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

In [71]:
# Load the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

from sklearn.tree import DecisionTreeClassifier
from collections import Counter

In [2]:
# Load the dataset 
data = pd.read_csv("dataset/vehicle.csv")
data.head()

Unnamed: 0,compactness,circularity,distance_circularity,radius_ratio,pr.axis_aspect_ratio,max.length_aspect_ratio,scatter_ratio,elongatedness,pr.axis_rectangularity,max.length_rectangularity,scaled_variance,scaled_variance.1,scaled_radius_of_gyration,scaled_radius_of_gyration.1,skewness_about,skewness_about.1,skewness_about.2,hollows_ratio,class
0,95,48.0,83.0,178.0,72.0,10,162.0,42.0,20.0,159,176.0,379.0,184.0,70.0,6.0,16.0,187.0,197,van
1,91,41.0,84.0,141.0,57.0,9,149.0,45.0,19.0,143,170.0,330.0,158.0,72.0,9.0,14.0,189.0,199,van
2,104,50.0,106.0,209.0,66.0,10,207.0,32.0,23.0,158,223.0,635.0,220.0,73.0,14.0,9.0,188.0,196,car
3,93,41.0,82.0,159.0,63.0,9,144.0,46.0,19.0,143,160.0,309.0,127.0,63.0,6.0,10.0,199.0,207,van
4,85,44.0,70.0,205.0,103.0,52,149.0,45.0,19.0,144,241.0,325.0,188.0,127.0,9.0,11.0,180.0,183,bus


In [4]:
# Preprocessing
# Encoding categorical variables (if any)
# Feature Scaling
# Filling missing values (if any)
data.shape

(846, 19)

In [7]:
data['class'].unique()
le = LabelEncoder()

data['class'] = le.fit_transform(data['class'])

In [8]:
data['class'].unique()

array([2, 1, 0])

In [11]:
cols = list(data.columns)
cols.remove('class')
print(cols)
print(len(cols))

['compactness', 'circularity', 'distance_circularity', 'radius_ratio', 'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio', 'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity', 'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration', 'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1', 'skewness_about.2', 'hollows_ratio']
18


In [12]:
scaler = MinMaxScaler()
data[cols] = scaler.fit_transform(data[cols])

In [13]:
data.head()

Unnamed: 0,compactness,circularity,distance_circularity,radius_ratio,pr.axis_aspect_ratio,max.length_aspect_ratio,scatter_ratio,elongatedness,pr.axis_rectangularity,max.length_rectangularity,scaled_variance,scaled_variance.1,scaled_radius_of_gyration,scaled_radius_of_gyration.1,skewness_about,skewness_about.1,skewness_about.2,hollows_ratio,class
0,0.478261,0.576923,0.597222,0.323144,0.274725,0.150943,0.326797,0.457143,0.25,0.585714,0.242105,0.233813,0.471698,0.144737,0.272727,0.390244,0.366667,0.533333,2
1,0.391304,0.307692,0.611111,0.161572,0.10989,0.132075,0.24183,0.542857,0.166667,0.357143,0.210526,0.17506,0.308176,0.171053,0.409091,0.341463,0.433333,0.6,2
2,0.673913,0.653846,0.916667,0.458515,0.208791,0.150943,0.620915,0.171429,0.5,0.571429,0.489474,0.540767,0.698113,0.184211,0.636364,0.219512,0.4,0.5,1
3,0.434783,0.307692,0.583333,0.240175,0.175824,0.132075,0.20915,0.571429,0.166667,0.357143,0.157895,0.14988,0.113208,0.052632,0.272727,0.243902,0.766667,0.866667,2
4,0.26087,0.423077,0.416667,0.441048,0.615385,0.943396,0.24183,0.542857,0.166667,0.371429,0.584211,0.169065,0.496855,0.894737,0.409091,0.268293,0.133333,0.066667,0


In [28]:
# check for null values
data.isna().sum()

for i in cols:
    data[i].fillna(value = data[i].mean(), inplace = True)

In [29]:
data.isna().sum()

compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64

In [30]:
X = data.drop(columns = ['class'], axis = 1)
y = data['class']

In [31]:
print("X shape : ", X.shape)
print("y shape : ", y.shape)

X shape :  (846, 18)
y shape :  (846,)


In [32]:
# Divide the dataset to training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle = True)

In [33]:
print("X train shape : ", X_train.shape)
print("y train shape : ", y_train.shape)
print("X test shape : ", X_test.shape)
print("y test shape : ", y_test.shape)

X train shape :  (676, 18)
y train shape :  (676,)
X test shape :  (170, 18)
y test shape :  (170,)


In [65]:
# Randomly choose the features from training set and build decision tree
# Randomness in the features will help us to achieve different DTrees every time
# You can keep minimum number of random features every time so that trees will have sufficient features
# Note: You can use builtin function for DT training using Sklearn

# Randomly choose the features from training set and build decision tree
# Randomness in the features will help us to achieve different DTrees every time
# You can keep minimum number of random features every time so that trees will have sufficient features
# Note: You can use builtin function for DT training using Sklearn

class MyRandomForestClassifier:
    def __init__(self, n_estimators, max_features, voting='majority', *args, **kwargs):
        self.n_estimators = n_estimators
        self.max_features = max_features
        self.voting = voting
        self.params1 = args
        self.params2 = kwargs
        
        self.trees = [self._create_tree() for t in range(self.n_estimators)]
        
        # Will be filled at training time
        self.f_idxs = []
    
    def _create_tree(self):
        return DecisionTreeClassifier(*self.params1, **self.params2)
    
    def fit(self, X, y):
        assert(X.ndim==2), "X is not 2D"
        np.random.seed(42)
        
        X = np.array(X)
        y = np.array(y)
        
        for tree in self.trees:
            f_idx = np.random.permutation(X.shape[1])[:self.max_features]
            self.f_idxs.append(f_idx)
            
            tree.fit(X[:, f_idx], y)
    
    def predict(self, X, y=None):
        assert(self.f_idxs), "Fit the model before prediction"
        if self.voting=='weighted' and y is None:
            raise Exception("weigthed needs y for decision trees to get accuracy")
        
        X = np.array(X)
        
        if self.voting=='majority':
            all_samples_prediction = []
            for i, tree in enumerate(self.trees):
                all_samples_prediction.append(tree.predict(X[:, self.f_idxs[i]]))

            real_predictions = []
            for i in range(X.shape[0]):
                one_sample_prediction = [t[i] for t in all_samples_prediction]
                real_predictions.append(Counter(one_sample_prediction).most_common(1)[0][0])

            return real_predictions
        
        elif self.voting=='average':
            probability_matrix = None
            for i, tree in enumerate(self.trees):
                y_pred = tree.predict_proba(X[:, self.f_idxs[i]])
                if i==0:
                    probability_matrix = y_pred
                else:
                    probability_matrix += y_pred
            return np.argmax(probability_matrix, axis=1)
        
        elif self.voting=='weighted':
            probability_matrix = None
            for i, tree in enumerate(self.trees):
                y_pred = tree.predict_proba(X[:, self.f_idxs[i]])
                acc = accuracy_score(y, tree.predict(X[:, self.f_idxs[i]]))
                if i==0:
                    probability_matrix = y_pred*acc
                else:
                    probability_matrix += y_pred*acc
            return np.argmax(probability_matrix, axis=1)



In [66]:
# Train N number of decision trees using random feature selection strategy
# Number of trees N can be user input

# by default 100 trees are there in sklearn

model0 = MyRandomForestClassifier(n_estimators = 100, max_features=10, voting='majority')
model0.fit(X_train, y_train)


In [69]:
# Apply different voting mechanisms such as 
# max voting/average voting/weighted average voting (using accuracy as weightage)
# Perform the ensembling for the training set.
y_pred_test = model0.predict(X_test)

In [72]:
# Apply invidual trees trained on the testingset
# Note: You should've saved the feature sets used for training invidual trees,
# so that same features can be chosen in testing set
# Get predictions on testing set
print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

           0       1.00      0.97      0.99        38
           1       0.98      0.96      0.97        84
           2       0.92      0.96      0.94        48

    accuracy                           0.96       170
   macro avg       0.97      0.97      0.97       170
weighted avg       0.97      0.96      0.96       170



In [73]:
# Evaluate the results using accuracy, precision, recall and f-measure

# Test the model with testing set and print the accuracy, precision, recall and f-measure
y_pred_test = model0.predict(X_test)
print("Accuracy on testing data : ", accuracy_score(y_test, y_pred_test))
print("Precision on testing data : ", precision_score(y_test, y_pred_test, average='weighted'))
print("Recall on testing data : ", recall_score(y_test, y_pred_test, average='weighted'))
print("F1 Score on testing data : ", f1_score(y_test, y_pred_test, average="weighted"))

Accuracy on testing data :  0.9647058823529412
Precision on testing data :  0.9655053153791638
Recall on testing data :  0.9647058823529412
F1 Score on testing data :  0.9649387515485236


In [75]:
# Apply different voting mechanisms such as 
# max voting/average voting/weighted average voting (using accuracy as weightage)
# Perform the ensembling for the training set.
# Compare different voting mechanisms and their accuracies
model1 = MyRandomForestClassifier(n_estimators=100, max_features=8, voting='majority')
model2 = MyRandomForestClassifier(n_estimators=100, max_features=8, voting='average')
model3 = MyRandomForestClassifier(n_estimators=100, max_features=8, voting='weighted')

model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)

print("Majority voting RF accuracy:", accuracy_score(y_test, model1.predict(X_test)))
print("Average voting RF accuracy:", accuracy_score(y_test, model2.predict(X_test)))
print("Weighted voting RF accuracy:", accuracy_score(y_test, model3.predict(X_test, y_test)))

Majority voting RF accuracy: 0.9647058823529412
Average voting RF accuracy: 0.9647058823529412
Weighted voting RF accuracy: 0.9647058823529412


In [76]:
# Compare the Random forest models with different number of trees N
# Compare the Random forest models with different number of trees N
model4 = MyRandomForestClassifier(n_estimators=10, max_features=8, voting='average')
model5 = MyRandomForestClassifier(n_estimators=100, max_features=8, voting='average')
model6 = MyRandomForestClassifier(n_estimators=1000, max_features=8, voting='average')

model4.fit(X_train, y_train)
model5.fit(X_train, y_train)
model6.fit(X_train, y_train)

print("RF-10 accuracy:", accuracy_score(y_test, model4.predict(X_test)))
print("RF-100 accuracy:", accuracy_score(y_test, model5.predict(X_test)))
print("RF-1000 accuracy:", accuracy_score(y_test, model6.predict(X_test)))



RF-10 accuracy: 0.9117647058823529
RF-100 accuracy: 0.9647058823529412
RF-1000 accuracy: 0.9764705882352941


In [77]:
# Compare different values for minimum number of features needed for individual trees
for nf in range(2, X.shape[1]):
    model = MyRandomForestClassifier(n_estimators=10, max_features=nf, voting='average')
    
    model.fit(X_train, y_train)
    
    print(f"RF with max features {nf} accuracy:", accuracy_score(y_test, model.predict(X_test)))


RF with max features 2 accuracy: 0.7058823529411765
RF with max features 3 accuracy: 0.8235294117647058
RF with max features 4 accuracy: 0.888235294117647
RF with max features 5 accuracy: 0.9
RF with max features 6 accuracy: 0.9235294117647059
RF with max features 7 accuracy: 0.9176470588235294
RF with max features 8 accuracy: 0.9117647058823529
RF with max features 9 accuracy: 0.9235294117647059
RF with max features 10 accuracy: 0.9352941176470588
RF with max features 11 accuracy: 0.9470588235294117
RF with max features 12 accuracy: 0.9058823529411765
RF with max features 13 accuracy: 0.9235294117647059
RF with max features 14 accuracy: 0.9058823529411765
RF with max features 15 accuracy: 0.9058823529411765
RF with max features 16 accuracy: 0.8823529411764706
RF with max features 17 accuracy: 0.9


## Part 2: Random Forest using Sklearn

In [None]:
# Use the preprocessed dataset here

In [59]:
# Train the Random Forest Model using builtin Sklearn Dataset

model1 = RandomForestClassifier(max_depth = 5)

In [60]:
# fit the model on the training data.
model1.fit(X_train, y_train)

RandomForestClassifier(max_depth=5)

In [61]:
# testing the model on the training data.
y_pred_train = model1.predict(X_train)
print("Accuracy on training data : ", accuracy_score(y_train, y_pred_train))
print("Precision on training data : ", precision_score(y_train, y_pred_train, average='weighted'))
print("Recall on training data : ", recall_score(y_train, y_pred_train, average='weighted'))
print("F1 Score on training data : ", f1_score(y_train, y_pred_train, average="weighted"))

Accuracy on training data :  0.9733727810650887
Precision on training data :  0.9741883094803371
Recall on training data :  0.9733727810650887
F1 Score on training data :  0.973491347819417


In [62]:
# Test the model with testing set and print the accuracy, precision, recall and f-measure
y_pred_test = model1.predict(X_test)
print("Accuracy on testing data : ", accuracy_score(y_test, y_pred_test))
print("Precision on testing data : ", precision_score(y_test, y_pred_test, average='weighted'))
print("Recall on testing data : ", recall_score(y_test, y_pred_test, average='weighted'))
print("F1 Score on testing data : ", f1_score(y_test, y_pred_test, average="weighted"))

Accuracy on testing data :  0.9411764705882353
Precision on testing data :  0.9419104717758157
Recall on testing data :  0.9411764705882353
F1 Score on testing data :  0.9413805090898635


In [50]:
# Play with parameters such as
# number of decision trees
# Criterion for splitting
# Max depth
# Minimum samples per split and leaf

model2 = RandomForestClassifier(max_depth = 10, min_samples_leaf = 9, min_samples_split = 3)

In [51]:
# fit the model on the training data.
model2.fit(X_train, y_train)

RandomForestClassifier(max_depth=10, min_samples_leaf=9, min_samples_split=3)

In [52]:
# testing the model on the training data.
y_pred_train = model2.predict(X_train)
print("Accuracy on training data : ", accuracy_score(y_train, y_pred_train))
print("Precision on training data : ", precision_score(y_train, y_pred_train, average='weighted'))
print("Recall on training data : ", recall_score(y_train, y_pred_train, average='weighted'))
print("F1 Score on training data : ", f1_score(y_train, y_pred_train, average="weighted"))

Accuracy on training data :  0.9733727810650887
Precision on training data :  0.974529571932277
Recall on training data :  0.9733727810650887
F1 Score on training data :  0.9735783541828936


In [53]:
# Test the model with testing set and print the accuracy, precision, recall and f-measure
y_pred_test = model2.predict(X_test)
print("Accuracy on testing data : ", accuracy_score(y_test, y_pred_test))
print("Precision on testing data : ", precision_score(y_test, y_pred_test, average='weighted'))
print("Recall on testing data : ", recall_score(y_test, y_pred_test, average='weighted'))
print("F1 Score on testing data : ", f1_score(y_test, y_pred_test, average="weighted"))

Accuracy on testing data :  0.9529411764705882
Precision on testing data :  0.9532497016619613
Recall on testing data :  0.9529411764705882
F1 Score on testing data :  0.9529620591270563


In [54]:
model3 = RandomForestClassifier(max_depth = 10)

In [55]:
# fit the model on the training data.
model3.fit(X_train, y_train)

RandomForestClassifier(max_depth=10)

In [56]:
# testing the model on the training data.
y_pred_train = model3.predict(X_train)
print("Accuracy on training data : ", accuracy_score(y_train, y_pred_train))
print("Precision on training data : ", precision_score(y_train, y_pred_train, average='weighted'))
print("Recall on training data : ", recall_score(y_train, y_pred_train, average='weighted'))
print("F1 Score on training data : ", f1_score(y_train, y_pred_train, average="weighted"))

Accuracy on training data :  1.0
Precision on training data :  1.0
Recall on training data :  1.0
F1 Score on training data :  1.0


In [58]:
# Test the model with testing set and print the accuracy, precision, recall and f-measure
y_pred_test = model3.predict(X_test)
print("Accuracy on testing data : ", accuracy_score(y_test, y_pred_test))
print("Precision on testing data : ", precision_score(y_test, y_pred_test, average='weighted'))
print("Recall on testing data : ", recall_score(y_test, y_pred_test, average='weighted'))
print("F1 Score on testing data : ", f1_score(y_test, y_pred_test, average="weighted"))

Accuracy on testing data :  0.9647058823529412
Precision on testing data :  0.9650027650842623
Recall on testing data :  0.9647058823529412
F1 Score on testing data :  0.9647194803618042
