- Domain: Object Recognition

- Context: Silhouette of different vehicles from multiple angles are given. Attributes of each silhouette is captured in the data set. Objective is to build a model which will classify the corresponding silhouette and identify type of the vehicle from it. 

In [None]:
#1. Data pre-processing – Perform all the necessary preprocessing on the data ready to be fed to an Unsupervised algorithm

In [None]:
# Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

In [None]:
# Load the data and name it as vData (i.e. Vehicle Data)
vData = pd.read_csv("../input/vehicle2/vehicle-2.csv")
vData.shape

- Given data set have total of 19 properties or columns and 846 rows or vehicle data points

In [None]:
# Display the data set info
vData.info()

- 19 properties are provided to capture the measurement of a vehicle from different angle
- 4 properties are of integer type data
- 14 properties are having decimal point values
- The class properties are of an onject type which denotes the type of a vehicle. As per the problem statement, there are 4 types of vehicle were taken to prepare the data set. Silhouette of each type of vechile are captured through the other peroperties.

In [None]:
# Make a copy of the original data set vData for further processing through Unsupervised learning technique 
# in later part of the project
vDataOrig = vData.copy() #vDataOrig = Original copy of master data set for Vehicle silhouette

# Let's get the sample look of the data
vDataOrig.head()

- The class columns contains string value to denote the type of the vehicle
- Other properties are all numerical

In [None]:
# The vehichle properties "class" is a categorical variable
# Hence splitting it into 3 separate columns which denotes 1 = True and 0 = False
vData = pd.get_dummies(vData, columns=['class'])
vData.head()

#vData i.e. the original Vechicle Data set shall be used for EDA and Supervised learning technique models

In [None]:
# In an another approach, doing the Label Encoding to make the "class" properties into numerical value
# Applying this on the Original data set vDataOrig

from sklearn import preprocessing
encode_column = 'class' # Specify the column which needs to be encoded
lebelEncoder = preprocessing.LabelEncoder()
lebelEncoder.fit(vDataOrig[encode_column])
vDataOrig[encode_column] = lebelEncoder.transform(vDataOrig[encode_column]) 

vDataOrig.head() # Have a look into the columns after encoding the class column

- After performing the encoding, the vehicle type has been lebeled into following groups:
    - 0 = Bus
    - 1 = Car
    - 2 = Van

In [None]:
# Check if there is any missing values in the data set
vData.isnull().values.any()

In [None]:
# Check how many blank values present in each column
vData.isnull().sum()

- Above indicates that there are 41 blank records in the overall dataset

In [None]:
# View the median values of each column to replace all the blank values with corresponding median
vData.median()

In [None]:
# Replace all the blank values with corresponding median
from sklearn.impute import SimpleImputer #Use the SimpleImputer library to replace all blanks in a generic way

# Get the Imputer initialized for the data set with "class" dummy column 
imputer_for_vData = SimpleImputer(missing_values = np.nan , strategy = 'median')
imputer_for_vData.fit(vData)

# Get the Imputer initialized for the original data set with encoded "class" column 
imputer_for_vDataOrig = SimpleImputer(missing_values = np.nan , strategy = 'median')
imputer_for_vDataOrig.fit(vDataOrig)

col_names_vData = vData.columns.values # Get the column names for the data set with "class" dummy column
col_names_vDataOrig = vDataOrig.columns.values # Get the column names for the original data set with encoded "class" column

# Get the new data set after replacing the blanks with corresponding column median
vData = pd.DataFrame(imputer_for_vData.transform(vData), columns=col_names_vData)
vDataOrig = pd.DataFrame(imputer_for_vDataOrig.transform(vDataOrig), columns=col_names_vDataOrig)

In [None]:
# Check how many blank values present in each column after replacing blank with median
vData.isnull().sum()

- Above indicates there is no more blanks in the data set vData (with "class" dummy columns)

In [None]:
# Check how many blank values present in each column after replacing blank with median
vDataOrig.isnull().sum()

- Above indicates there is no more blanks in the data set vDataOrig (with Encoded "class" columns)

In [None]:
# This is just to compare if the median changed after the replacement of the blank values
vData.median()

- Above indicates the median of each column has not changed even after replacement of the blank fields in the data set

In [None]:
# Have a look into the original dataset
vData.head(10)

- Row# 5, column named "circularity" had a blank field
- The median value of the column is 44
- Above indicates the blank replacement was successful 

In [None]:
# Let's have a quick look into the data description to get an idea about the 5 point summary
vData.describe().T

- Except 'skewness_about' and 'skewness_about.1' column all other properties have non-zero values.
- Minimum value of 'skewness_about' and 'skewness_about.1' columns are ZERO but not replacing these 0 values as the properties itself ranges from very low value up to 11 and 41 respectively.
- Noting the 75% of the value and max value the above 5 point summary indicates there are possible outliers in following columns:
    - radius_ratio
    - pr.axis_aspect_ratio
    - max.length_aspect_ratio
    - skewness_about
    - skewness_about.1

In [None]:
# Let's review the skewness of the properties
vData.skew()

- Following properties are positively skewed (with value >1.5). It indicates that these atributes are not normally distributed. The tail of the distribution is longer on the right side. The mean is greater than the median for these parameters.
     - pr.axis_aspect_ratio, max.length_aspect_ratio, scaled_radius_of_gyration.1
     
- Following properties are negatively skewed. It indicates that these atributes are also not normally distributed. The tail of the distribution is longer on the left side. The mean is lesser than the median for these parameters.
    - hollows_ratio, class_car

In [None]:
#2. Understanding the attributes - Find relationship between different attributes (Independent variables) and 
# choose carefully which all attributes have to be a part of the analysis and why

In [None]:
# Describe the correlation of the variables through graphical Heat map
# +ve and -ve numbers indicate how the variabes are correalted to each other

colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize=(15,15))
plt.title('Silhouette properties Correlation of attributes', y=1.05, size=19)
sns.heatmap(vData.corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)

- Dark color - indicates the variables are less associated
- Brighter color - indicates the variables are comparatively strongly associated
    - Ex:
        - compactness is less related to elongatedness, scaled_radius_of_gyration.1 and whether a vehicle is bus or van
        - elongatedness is less related to many properties as follows
            - whether a vechile type is car, pr.axis_rectangularity, max.length_rectangularity,	scaled_variance, scaled_variance.1,	scaled_radius_of_gyration
        - pr.axis_rectangularity is strongly related to max.length_rectangularity, scaled_variance, scaled_variance.1, scaled_radius_of_gyration
        - scaled_variance is strongly related to scaled_variance.1,	scaled_radius_of_gyration
        - By referring the elongatedness properties the chance of identifying a vehichle as Van is higher
        - By referring compactness,	circularity, distance_circularity, radius_ratio the chance of identifying a vehichle as Car is higher

In [None]:
# Display the pair plot
sns.pairplot(vData)
plt.show()

- From above pair plots we see the relation between the parameters captured for silhouette of various vehicles from disfferent angle:
- Ex:
    - elongatedness and scatter_ratio are negatively related with a greater extent i.e. small increase of  scatter_ratio there is significant decrease of elongatedness
    - pr.axis_rectangularity and max.length_rectangularity has no impact on skewness and hollows_ratio. From the correlaiton matrix also we observed the same
    - pr.axis_rectangularity has strong positive relationship with compactness,	circularity, distance_circularity, radius_ratio

In [None]:
# Measure of possible outliers in the dataset -
# It can be done by checking the box plot of each properties or by evaluating the z scores of each columns

#As there are 21 properties given in the dataset, let's get the Z score of the entire data set 
#to check outliers statistically

from scipy.stats import zscore

# Get the z score
z_vData = vData.apply(zscore)

# Set the limt to 3 sigma to check outliers in the voice sample provided
limit = 3
t1 = np.where(z_vData > limit) #store the outliers in the touple variable

# print the index value of original data set which contains outliers >3 sigma
print(t1[0]) # depicts the row# containing outliers
print(t1[1]) # depicts the column# containing outliers

In [None]:
list1 = t1[0] #load the row# of the data set containing outliers
list2 = t1[1] #load the column# of the data set containing outliers

j = 0 #initiate the iterator

# print the outlier column and it's value from the data set provided
for i in list2:   #loop through the columns
    print("Outliers exist in properties: ", vData.columns[i], " and the value is: ",vData.loc[list1[j]][i])
    j +=1 #move to the next value of the corresponding row#    

- Following are few of the properties which have multiple outliers in the dataset, programatically above depicts the same.
    - pr.axis_aspect_ratio
    - scaled_variance
    - skewness_about
    - skewness_about.1

In [None]:
# Let's review the outliers visually through below graph
# plot pr.axis_aspect_ratio for car type vehicle
sns.boxplot(x = "class_car", y = "pr.axis_aspect_ratio", data = vData); 
plt.show()

In [None]:
# Let's review the outliers visually through below graph
# plot scaled_variance for van type vehicle
sns.boxplot(x = "class_van", y = "scaled_variance", data = vData); 
plt.show()

In [None]:
# Let's review the outliers visually through below graph
# plot skewness_about to view outliers in it
sns.boxplot("skewness_about", data = vData); 
plt.show()

In [None]:
# Get the target column distribution for models to be used in Supervised Technique

# In the given data set vehicle properties are provided and based on that its given what type of vehicle is it, 
# ex: whether it's a Car, Bus or Van.

# Hence for Supervised learning, we can consider all vehicle properties as independent veriable and 
# type of each car individually as target variable.

# In this case there are 3 target variable i.e. class_bus, class_car and class_van. 
# We are going to build the Supervised learning models for each of these target variables.

# The target column is 'class_bus' which is indicated by 0 and 1
# 1 - means vehicle is a bus
# 0 - means vehicle is not a bus

# Let's measure the % of split where a vehicle is a bus or not

vBus_true = len(vData.loc[vData['class_bus'] == 1])
vBus_false = len(vData.loc[vData['class_bus'] == 0])

print("Number of true cases for Bus: {0} ({1:2.2f}%)".format(vBus_true, (vBus_true / (vBus_true + vBus_false)) * 100 ))
print("Number of false cases for Bus: {0} ({1:2.2f}%)".format(vBus_false, (vBus_false / (vBus_true + vBus_false)) * 100))
print("---------------------------------------------------------------------")

# The target column is 'class_car' which is indicated by 0 and 1
# 1 - means vehicle is a car
# 0 - means vehicle is not a car

# Let's measure the % of split where a vehicle is a car or not

vCar_true = len(vData.loc[vData['class_car'] == 1])
vCar_false = len(vData.loc[vData['class_car'] == 0])

print("Number of true cases for Car: {0} ({1:2.2f}%)".format(vCar_true, (vCar_true / (vCar_true + vCar_false)) * 100 ))
print("Number of false cases for Car: {0} ({1:2.2f}%)".format(vCar_false, (vCar_false / (vCar_true + vCar_false)) * 100))
print("---------------------------------------------------------------------")

# The target column is 'class_van' which is indicated by 0 and 1
# 1 - means vehicle is a van
# 0 - means vehicle is not a van

# Let's measure the % of split where a vehicle is a van or not

vVan_true = len(vData.loc[vData['class_van'] == 1])
vVan_false = len(vData.loc[vData['class_van'] == 0])

print("Number of true cases for Van: {0} ({1:2.2f}%)".format(vVan_true, (vVan_true / (vVan_true + vVan_false)) * 100 ))
print("Number of false cases for Van: {0} ({1:2.2f}%)".format(vVan_false, (vVan_false / (vVan_true + vVan_false)) * 100))

In [None]:
# Target column "class" i.e. Vehicle Type wise count
pd.value_counts(vDataOrig['class'])

- As per the encoding performed:
    - 0 = Bus
    - 1 = Car
    - 2 = Van

In [None]:
#3. Split the data into train and test (by specifying “random state” as using train_test_split from Sklearn)

from sklearn.model_selection import train_test_split

X = vData.drop(['class_bus', 'class_car', 'class_van'],axis=1)     # Predictor feature columns (18 X m)
X_Orig = vDataOrig.drop('class', axis=1) # Get the independent features from the original data set with encoded "class" column

Y_bus = vData['class_bus']   # Predicted vehicle type Bus (1=True, 0=False) (1 X m)
Y_car = vData['class_car']   # Predicted vehicle type Car (1=True, 0=False) (1 X m)
Y_van = vData['class_van']   # Predicted vehicle type Van (1=True, 0=False) (1 X m)
Y_Orig = vDataOrig['class']   # Predicted vehicle type "class" (0=Bus, 1=Car and 2=Van) (1 X m)

# Split data considering class_bus as target
x_train_bus, x_test_bus, y_train_bus, y_test_bus = train_test_split(X, Y_bus, test_size=0.3, random_state=5) # 5 is just any random seed number

# Split data considering class_car as target
x_train_car, x_test_car, y_train_car, y_test_car = train_test_split(X, Y_car, test_size=0.3, random_state=5) # 5 is just any random seed number

# Split data considering class_van as target
x_train_van, x_test_van, y_train_van, y_test_van = train_test_split(X, Y_van, test_size=0.3, random_state=5) # 5 is just any random seed number

# Split data considering encoded class as target
x_train_orig, x_test_orig, y_train_orig, y_test_orig = train_test_split(X_Orig, Y_Orig, test_size=0.3, random_state=5) # 5 is just any random seed number


In [None]:
# Have a look into the train data for vehicle type = Bus
x_train_bus.head()

In [None]:
# Have a look into the train data for vehicle type = Car
x_train_car.head()

In [None]:
# Have a look into the train data for vehicle type = Van
x_train_van.head()

- Train data set with respect to target variable Bus, Car and Van are same
- However used different specific variable for train-test data split to avoid any confusion during model building
- Random state = 5 is used in all cases

In [None]:
# Have a look into the train data for encoded "class"
x_train_orig.head()

In [None]:
# Validate the size of data set in the train and test data

print("{0:0.2f}% data (Vehicle class = Bus) is in training set".format((len(x_train_bus)/len(vData.index)) * 100))
print("{0:0.2f}% data (Vehicle class = Bus) is in test set".format((len(x_test_bus)/len(vData.index)) * 100))
print("---------------------------------------------------------")
print("{0:0.2f}% data (Vehicle class = Car) is in training set".format((len(x_train_car)/len(vData.index)) * 100))
print("{0:0.2f}% data (Vehicle class = Car) is in test set".format((len(x_test_car)/len(vData.index)) * 100))
print("---------------------------------------------------------")
print("{0:0.2f}% data (Vehicle class = Van) is in training set".format((len(x_train_van)/len(vData.index)) * 100))
print("{0:0.2f}% data (Vehicle class = Van) is in test set".format((len(x_test_van)/len(vData.index)) * 100))


In [None]:
# Scale the data set

# There are multiple ways to scale any data set. As we don't know what units or scales were used for each of the properties,
# we are going to apply fit/transform method of scaling to compare the outcome.

# Apply the Fit Transform scaling on Train and Test data set - 
from sklearn.preprocessing import StandardScaler

scaling = StandardScaler()

x_train_bus_TransScale = scaling.fit_transform(x_train_bus) #Fit the scale on train data - for bus
x_test_bus_TransScale = scaling.transform(x_test_bus) #transform the scale on test data

x_train_car_TransScale = scaling.fit_transform(x_train_car) #Fit the scale on train data - for car
x_test_car_TransScale = scaling.transform(x_test_car) #transform the scale on test data

x_train_van_TransScale = scaling.fit_transform(x_train_van) #Fit the scale on train data - for van
x_test_van_TransScale = scaling.transform(x_test_van) #transform the scale on test data

x_train_orig_TransScale = scaling.fit_transform(x_train_orig) #Fit the scale on train data - for encoded 'class' type
x_test_orig_TransScale = scaling.transform(x_test_orig) #transform the scale on test data

In [None]:
#4. Train a Support vector machine using the train set and get the accuracy on the test set

from sklearn.svm import SVC #load the support vector classifier

#instanciate SVM Classifier
vData_bus_svm_model_TransScale = SVC() #For Vehicle Type = Bus 
vData_car_svm_model_TransScale = SVC() #For Vehicle Type = Car
vData_van_svm_model_TransScale = SVC() #For Vehicle Type = Van
vDataOrig_svm_model_TransScale = SVC() #For Vehicle Type = Encoded Class

# Fit the SVM model with train data set -  For Vehicle Type = Bus 
vData_bus_svm_model_TransScale.fit(x_train_bus_TransScale, y_train_bus)

# Fit the SVM model with train data set -  For Vehicle Type = Car
vData_car_svm_model_TransScale.fit(x_train_car_TransScale, y_train_car)

# Fit the SVM model with train data set -  For Vehicle Type = Van 
vData_van_svm_model_TransScale.fit(x_train_van_TransScale, y_train_van)

# Fit the SVM model with train data set -  For Vehicle Type = Encoded Class 
vDataOrig_svm_model_TransScale.fit(x_train_orig_TransScale, y_train_orig)

print("Accuracy of SVM model on training data (Vehicle Type = Bus): {:.2f}".format(vData_bus_svm_model_TransScale.score(x_train_bus_TransScale, y_train_bus)))
print("Accuracy of SVM model on test data (Vehicle Type = Bus): {:.2f}".format(vData_bus_svm_model_TransScale.score(x_test_bus_TransScale, y_test_bus)))
print("----------------------------------------------------------------------")
print("Accuracy of SVM model on training data (Vehicle Type = Car): {:.2f}".format(vData_car_svm_model_TransScale.score(x_train_car_TransScale, y_train_car)))
print("Accuracy of SVM model on test data (Vehicle Type = Car): {:.2f}".format(vData_car_svm_model_TransScale.score(x_test_car_TransScale, y_test_car)))
print("----------------------------------------------------------------------")
print("Accuracy of SVM model on training data (Vehicle Type = Van): {:.2f}".format(vData_van_svm_model_TransScale.score(x_train_van_TransScale, y_train_van)))
print("Accuracy of SVM model on test data (Vehicle Type = Van): {:.2f}".format(vData_van_svm_model_TransScale.score(x_test_van_TransScale, y_test_van)))
print("----------------------------------------------------------------------")
print("Accuracy of SVM model on training data (Vehicle Type = Encoded Class): {:.2f}".format(vDataOrig_svm_model_TransScale.score(x_train_orig_TransScale, y_train_orig)))
print("Accuracy of SVM model on test data (Vehicle Type = Encoded Class): {:.2f}".format(vDataOrig_svm_model_TransScale.score(x_test_orig_TransScale, y_test_orig)))

# predict the model score - For Vehicle Type = Bus
vData_bus_svm_test_predict_TransScale = vData_bus_svm_model_TransScale.predict(x_test_bus_TransScale)

# predict the model score - For Vehicle Type = Car
vData_car_svm_test_predict_TransScale = vData_car_svm_model_TransScale.predict(x_test_car_TransScale)

# predict the model score - For Vehicle Type = Van
vData_van_svm_test_predict_TransScale = vData_van_svm_model_TransScale.predict(x_test_van_TransScale)

# predict the model score - For Vehicle Type = Encoded Class
vDataOrig_svm_test_predict_TransScale = vDataOrig_svm_model_TransScale.predict(x_test_orig_TransScale)

In [None]:
# calculate confusion matrix for SVM model - For Vehicle Type = Bus
from sklearn import metrics

cm_bus_svm=metrics.confusion_matrix(y_test_bus, vData_bus_svm_test_predict_TransScale, labels=[1, 0])

df_cm_bus_svm = pd.DataFrame(cm_bus_svm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm_bus_svm, annot=True, cmap="PuRd")

- The confusion matrix (Based on scaled data) - SVM: For Vehicle Type = Bus
    - True Positives (TP): we correctly predicted that vehicle type is a Bus - 59
    - True Negatives (TN): we correctly predicted that vehicle type is not a Bus - 190
    - False Positives (FP): we incorrectly predicted that vehicle type is a Bus (a "Type I error") - 2
    - False Negatives (FN): we incorrectly predicted that vehicle type is not a Bus (a "Type II error") - 6

In [None]:
# calculate confusion matrix for SVM model - For Vehicle Type = Car
from sklearn import metrics

cm_car_svm=metrics.confusion_matrix(y_test_car, vData_car_svm_test_predict_TransScale, labels=[1, 0])

df_cm_car_svm = pd.DataFrame(cm_car_svm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm_car_svm, annot=True, cmap="PuRd")

- The confusion matrix (Based on scaled data) - SVM: For Vehicle Type = Car
    - True Positives (TP): we correctly predicted that vehicle type is a Car - 130
    - True Negatives (TN): we correctly predicted that vehicle type is not a Car - 120
    - False Positives (FP): we incorrectly predicted that vehicle type is a Car (a "Type I error") - 5
    - False Negatives (FN): we incorrectly predicted that vehicle type is not a Car (a "Type II error") - 6

In [None]:
# calculate confusion matrix for SVM model - For Vehicle Type = Van
from sklearn import metrics

cm_van_svm=metrics.confusion_matrix(y_test_van, vData_van_svm_test_predict_TransScale, labels=[1, 0])

df_cm_van_svm = pd.DataFrame(cm_van_svm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm_van_svm, annot=True, cmap="PuRd")

- The confusion matrix (Based on scaled data) - SVM: For Vehicle Type = Van
    - True Positives (TP): we correctly predicted that vehicle type is a Van - 46
    - True Negatives (TN): we correctly predicted that vehicle type is not a Van - 190
    - False Positives (FP): we incorrectly predicted that vehicle type is a Van (a "Type I error") - 9
    - False Negatives (FN): we incorrectly predicted that vehicle type is not a Van (a "Type II error") - 10

In [None]:
# calculate confusion matrix for SVM model - For Vehicle Type = Encoded Class
from sklearn import metrics

cm_orig_svm=metrics.confusion_matrix(y_test_orig, vDataOrig_svm_test_predict_TransScale, labels=[1, 0])

df_cm_orig_svm = pd.DataFrame(cm_orig_svm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm_orig_svm, annot=True, cmap="PuRd")

- The confusion matrix (Based on scaled data) - SVM: For Vehicle Type = Encoded Class
    - True Positives (TP): we correctly predicted that vehicle type is a Van - 130
    - True Negatives (TN): we correctly predicted that vehicle type is not a Van - 62
    - False Positives (FP): we incorrectly predicted that vehicle type is a Van (a "Type I error") - 2
    - False Negatives (FN): we incorrectly predicted that vehicle type is not a Van (a "Type II error") - 0

In [None]:
#5. Perform K-fold cross validation and get the cross validation score of the model

In [None]:
# Scale the original data set where X doen't contain any categorical properties

from scipy.stats import zscore
XScaled_Orig = X_Orig.apply(zscore)

In [None]:
from sklearn.model_selection import cross_val_score #Import the Library used to capture cross validation score
from sklearn.model_selection import RepeatedStratifiedKFold #This is the library to apply the K-Fold technique

vDataOrig_svm_k_fold_model_TransScale = SVC() #Instanciate the support vector model for Vehicle Type = Encoded Class

# Define the cross validation method where the original data set is splitted into 10 folds
# During each iteration (k-1) fold shall be used for training validation and kth fold shall be used as Test
# The process will be repeated for 3 times to better evaluate the score and finally find the best score
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=5) 

# Get the Score
score = cross_val_score(vDataOrig_svm_k_fold_model_TransScale, 
                         XScaled_Orig, Y_Orig, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

# Print the score of the k-fold cross validation applied through Support Vector Machine model
print('Model Mean Score: %.3f | Score Standard Deviation: (%.3f)' % (np.mean(score), np.std(score)))

In [None]:
#6. Use PCA from Scikit learn, extract Principal Components that capture about 95% of the variance in the data

# Here we shall focus on the Principal Component Analysis (PCA), hence not considering any categorical properties
# i.e. "class" in target calculation.

# X_Orig is the independent features which doesn't include the vehicle type i.e. class
# Y_Orig is the dependent feature consists of values from vehicle class

XScaled_Orig.head() #Have a look into the sample data set

- Above sample data point doesn't include any 'class' column i.e. the categorical variable type of vehicle

In [None]:
# Get the covariance matrix for PCA analysis
covMatrix = np.cov(XScaled_Orig,rowvar=False)
print(covMatrix)

- Above covariance matrix depicts the relation between all possible pair of diamensions

In [None]:
# Performe PCA for the 18 features given
from sklearn.decomposition import PCA # Import PCA library

pca_vDataOrig = PCA(n_components=18) #Instanciate the PCA with 18 components
pca_vDataOrig.fit(XScaled_Orig) #Fit the original dataset into the PCA model

In [None]:
# Print the Eigen values of each compoments
print(pca_vDataOrig.explained_variance_)

In [None]:
# Print corresponding Eigen vectors
print(pca_vDataOrig.components_)

In [None]:
# Print the % of the variation for each component
print(pca_vDataOrig.explained_variance_ratio_)

In [None]:
# Plot the % of variation for each component
# This is depict the influence of the components in an ordered way
plt.bar(list(range(1,19)),pca_vDataOrig.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()

- Above plot indicates that first 7 compoments have mejority of the the influence to the overall data set

In [None]:
# Plot the cuulative variation with respect to the eigen value of each components
# This is to get a pictorial view to identify how many components contributes most significantly in the dataset
plt.step(list(range(1,19)),np.cumsum(pca_vDataOrig.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()

- Based on above plot of cumulative variation vs eigen value it depicts first 7 features can capture around 95% of the variane
- This way dimensionality can be reduced from 18 to 7 

In [None]:
# Let's review the PCA considering # of components = 7
pca_vDataOrig_7comp = PCA(n_components=7)
pca_vDataOrig_7comp.fit(XScaled_Orig)

print(pca_vDataOrig_7comp.components_)
print(pca_vDataOrig_7comp.explained_variance_ratio_)
X_pca_vDataOrig_7comp = pca_vDataOrig_7comp.transform(XScaled_Orig)

In [None]:
# Have a look into the pair plot - with 7 selected components
# This is to depict the relations between top 7 influencing components
sns.pairplot(pd.DataFrame(X_pca_vDataOrig_7comp))

- Few observations as follows:
    - Except feature 0 and 2, all other features are normally distributed
    - Small change of Feature 2 yeilds a significant change for all other 6 properties
    - Feature# 0,1,3,4 and 5 have scattered relation with each other

In [None]:
# Let's store the eigen values and the eigen vectors into the variables for further calculation
vData_e_vals, vData_e_vecs = np.linalg.eig(covMatrix)
print('Eigenvectors \n%s' %vData_e_vals)
print('\nEigenvalues \n%s' %vData_e_vecs)

In [None]:
# Build the eigen pairs for first 7 components
eigen_pairs = [(np.abs(vData_e_vals[i]), vData_e_vecs[:,i]) for i in range(len(vData_e_vals))]
eigen_pairs.sort(reverse=True)
eigen_pairs[:7]

In [None]:
# Build dimensionally reduced datasets (with 7 top components in this case)
w = np.hstack((eigen_pairs[0][1].reshape(18,1), eigen_pairs[1][1].reshape(18,1),
               eigen_pairs[2][1].reshape(18,1), eigen_pairs[3][1].reshape(18,1),
               eigen_pairs[4][1].reshape(18,1), eigen_pairs[5][1].reshape(18,1),
               eigen_pairs[6][1].reshape(18,1)))
print('Matrix W:\n', w)
XScaled_PCA = XScaled_Orig.dot(w) #Build the full data set with only 7 compoments

In [None]:
XScaled_PCA.shape #View the shape of the Scaled Dataset built after choosing the 7 most contributing components

- Above shows the new dataset built after performing PCA along with 7 features has all 846 records of vehicle silhouette

In [None]:
# Independent data set build from original data set without categorical/target column "class"
X_pca_vDataOrig_7comp #This is the Numpy arrary built earlier for PCA with 7 components

In [None]:
XScaled_PCA #Data set built with 7 components

- Data set built with 7 components and it's matching with the Numpy array calculated above
- This Scaled and data set with 7 component (i.e. XScaled_PCA) will be used further for comparison

In [None]:
#7. Repeat steps 3,4 and 5 but this time, use Principal Components instead of the original data. 
# And the accuracy score should be on the same rows of test data that were used earlier. (hint: set the same random state) 

In [None]:
#3.a Split the data into train and test (by specifying “random state” as using train_test_split from Sklearn)

# Split data considering encoded class as target
# 5 is just any random seed number and it the same as used earlier
x_train_PCA, x_test_PCA, y_train_PCA, y_test_PCA = train_test_split(XScaled_PCA, Y_Orig, test_size=0.3, random_state=5) 

In [None]:
#4.a Train a Support vector machine using the train set and get the accuracy on the test set
# For the comparison after Principal COmponent Analysis
svm_lassifier = SVC()
svm_lassifier.fit(x_train_orig_TransScale, y_train_orig)
print ('Before PCA score', svm_lassifier.score(x_test_orig_TransScale, y_test_orig))

svm_lassifier.fit(x_train_PCA, y_train_PCA)
print ('After PCA score', svm_lassifier.score(x_test_PCA, y_test_PCA))

In [None]:
#5.a K-fold cross validation model for SVM on dimensionally reduced data set prepared for PCA

from sklearn.model_selection import cross_val_score #Import the Library used to capture cross validation score
from sklearn.model_selection import RepeatedStratifiedKFold #This is the library to apply the K-Fold technique

vDataPCA_svm_k_fold_model_TransScale = SVC() #Instanciate the support vector model for Vehicle Type = Encoded Class

# Define the cross validation method where the original data set is splitted into 10 folds
# During each iteration (k-1) fold shall be used for training validation and kth fold shall be used as Test
# The process will be repeated for 3 times to better evaluate the score and finally find the best score
cv_pca = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=5) 

# Get the Score
score_pca = cross_val_score(vDataPCA_svm_k_fold_model_TransScale, 
                         XScaled_PCA, Y_Orig, scoring='accuracy', cv=cv_pca, n_jobs=-1, error_score='raise')

# Print the score of the k-fold cross validation applied through Support Vector Machine model
print('Model Mean Score on PCA data set: %.3f | Score Standard Deviation: (%.3f)' % (np.mean(score_pca), np.std(score_pca)))

In [None]:
#8. Compare the accuracy scores and cross validation scores of Support vector machines – 
# one trained using raw data and the other using Principal Components, and mention your findings

In [None]:
#Reasoning on which model is best and their corresponding performance

print("Classification Report - SVM (for Vehicle Type - Bus)")
print(metrics.classification_report(y_test_bus, vData_bus_svm_test_predict_TransScale, labels=[1, 0]))
print("")
print("Classification Report - SVM (for Vehicle Type - Car)")
print(metrics.classification_report(y_test_car, vData_car_svm_test_predict_TransScale, labels=[1, 0]))
print("")
print("Classification Report - SVM (for Vehicle Type - Van)")
print(metrics.classification_report(y_test_van, vData_van_svm_test_predict_TransScale, labels=[1, 0]))
print("")
print("Classification Report - SVM (for Vehicle Type - Encoded Class)")
print(metrics.classification_report(y_test_orig, vDataOrig_svm_test_predict_TransScale, labels=[1, 0]))

# Reasoning on which model is best and their corresponding performance

- From above comparative study we make following observation:
- Precision:
    - Ability of a classifier not to label an instance positive that is actually negative i.e. for all instances classified positive, what percent was correct. Best observed in SVM model for encoded class column.
- Recall:
    - Ability of a classifier to find all positive instances i.e. for all instances that were actually positive, what percent was classified correctly. With this respect SVM with encoded class model has the highest score of predicting vehicle type.
- f1 score: 
    - Harmonic mean of Precision and Recall. SVM model (for encoded class) model has the highest f1 value for the prediction of vehicle type denoted by 1


- Model Accuracy comparison between SVM computed before and after of PCA:
    - Before PCA the total# of components were 18 and after PCA the component count is 7
    - By reducing 11 components, the over all model performance reduced only by 3% (94% to 90%)
    - From cost analysis perspective it's a significant gain
    

- Model Accuracy comparison between K-Fold cross validation computed before and after of PCA:
    - Before PCA the total# of components were 18 and after PCA the component count is 7
    - By reducing 11 components, the over all model performance reduced only by 5% (97% to 92%)
    - From cost analysis perspective it's still a significant gain as the model accuracy is 92%

- Hence from all above analysis we can infer following conclusion:


- Before Principal Compnent Analysis (PCA): SVM model with encoded class properties yeilds the best accuracy (94%) in predicting the vehicle type based on the silhouette provided.


- After Principal Compnent Analysis (PCA): k-fold cross validaiton model yeilds best accuracy (92%) after dropping the diamensionality from 18 to 7