Vehicle Recognition

#Overview
Welcome to my kernel!
Data Description: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

#Objective:
The objective is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

#Importing the Libraries and Basic EDA

In [None]:
#import the necessary libraries
import os

import warnings
warnings.filterwarnings('ignore')

#import the necessary libraries
import numpy as np
import pandas as pd

#Importing libraries for visulization

import matplotlib.pyplot as plt
import seaborn as sns

#Library for Data Pre-processing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import Imputer

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

#Traditional Classification Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC


#Decision Tree and other Ensemble Techniques
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier

#Library for Model Evaluation 
from sklearn import metrics
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score, confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score, auc
from sklearn.metrics import roc_curve

#Other Libraries
from collections import Counter
from scipy import stats
from matplotlib.colors import ListedColormap

from sklearn.decomposition import PCA
from scipy.stats import zscore

In [None]:
#load the csv file and make the data frame
vehicle_df = pd.read_csv('/kaggle/input/vehicle/vehicle.csv')

In [None]:
#display the first 5 rows of dataframe
vehicle_df.head()

In [None]:
print("The dataframe has {} rows and {} columns".format(vehicle_df.shape[0],vehicle_df.shape[1]))

In [None]:
#display the information of dataframe
vehicle_df.info()

From above we can see that except 'class' column all columns are numeric type and there are null values in some columns.
class column is our target column.

In [None]:
#display in each column how many null values are there
vehicle_df.apply(lambda x: sum(x.isnull()))

From above we can see that max null values is 6 which are in two columns 'radius_ratio', 'skewness_about'.
so we have two options either we will drop those null values or we will impute those null values.
Dropping null values is not a good way because we will lose some information.but we will go with both options then we will see what's the effect on model.

In [None]:
#display 5 point summary of dataframe
#vehicle_df.describe().transpose()
vehicle_df.describe().T

In [None]:
sns.pairplot(vehicle_df,diag_kind='kde', hue='class')
plt.show()

From above pair plots we can see that many columns are correlated and many columns have long tail so that is the indication of outliers.we will see down the line with the help of correlation matrix what's the strength of correlation and outliers are there or not.

From above we can see that our data has missing values in some column. so before building any model we have to handle missing values. we have two option either we will drop those missing values or we will impute missing values. we will go with both options and see what's the effect on model. so first we will drop the missing values. Before dropping missing values we will create another dataframe and copy the original dataframe data into that. It's a good practice to keep the original dataframe as it is and make all modifications to the new dataframe.

In [None]:
#Corelation Matrix of attributes 
vehicle_df.corr()

#Dropping Missing Values

In [None]:
#Function for Null values treatment

def null_values(base_dataset):
    print("Shape of DataFrame before null treatment",base_dataset.shape)
    print("Null values count before treatment")
    print("===================================")
    print(base_dataset.isna().sum(),"\n")
    ## null value percentage     
    null_value_table=(base_dataset.isna().sum()/base_dataset.shape[0])*100
    ## null value percentage beyond threshold drop , else treat the columns    
    retained_columns=null_value_table[null_value_table<30].index
    # if any variable as null value greater than input(like 30% of the data) value than those variable are consider as drop
    drop_columns=null_value_table[null_value_table>30].index
    base_dataset.drop(drop_columns,axis=1,inplace=True)
    len(base_dataset.isna().sum().index)
    #cont=base_dataset.describe().columns
    cont=[col for col in base_dataset.select_dtypes(np.number).columns ]
    cat=[i for i in base_dataset.columns if i not in base_dataset.describe().columns]
    for i in cat:
        base_dataset[i].fillna(base_dataset[i].value_counts().index[0],inplace=True)
    for i in cont:
        base_dataset[i].fillna(base_dataset[i].mean(),inplace=True)
    print("Null values counts after treatment")
    print("===================================")
    print(base_dataset.isna().sum())
    print("\nShape of DataFrame after null treatment",base_dataset.shape)

In [None]:
null_values(vehicle_df)

so now we have new dataframe called new_vehicle_df and we will make changes in this new dataframe.

In [None]:
#display 5 point summary of new dataframe
#vehicle_df.describe().transpose()
vehicle_df.describe().T

In [None]:
#display the shape of dataframe
print("Shape of dataframe after missing values treatment:",vehicle_df.shape)

#Analysis of each column with the help of plots

In [None]:
#Distribution of data

vehicle_df.hist( figsize=(15,15), color='red')
plt.show()

From above we can see that there are no outliers in compactness column and it's looks like normally distributed.

In [None]:
num_features=[col for col in vehicle_df.select_dtypes(np.number).columns ]

plt.figure(figsize=(20,20))
for i,col in enumerate(num_features,start=1):
    plt.subplot(5,4,i);
    sns.distplot(vehicle_df[col])
plt.show()

From above we can see that there are no outliers in circularity column and it's looks like normally distributed

In [None]:
num_features=[col for col in vehicle_df.select_dtypes(np.number).columns ]

plt.figure(figsize=(20,20))
for i,col in enumerate(num_features,start=1):
    plt.subplot(5,4,i);
    sns.boxplot(vehicle_df[col]);
plt.show()


From above we can see that there are no outliers in distance_circularity column but in distribution plot we can see that there are two peaks and we can see that there is right skewness because long tail is at the right side(mean>median)

In [None]:
num_features=[col for col in vehicle_df.select_dtypes(np.number).columns ]

plt.figure(figsize=(20,20))
for i,col in enumerate(num_features,start=1):
    plt.subplot(5,4,i);
    sns.boxplot(vehicle_df['class'],vehicle_df[col]);
plt.show()

From above we can see that there are outliers in radius_ratio column and there is right skewness because long tail is at the right side(mean>median)

In [None]:
vehicle_df.skew()

In [None]:
def outliers_transform_with_drop_record(base_dataset):
    num_features=[col for col in base_dataset.select_dtypes(np.number).columns ]
    print("Outliers in Dataset before Treatment")
    print("====================================")
    for i,cols in enumerate(num_features,start=1):
        x = base_dataset[cols]
        qr3, qr1=np.percentile(x, [75,25])
        iqr=qr3-qr1
        utv=qr3+(1.5*(iqr))
        ltv=qr1-(1.5*(iqr))
        count=(base_dataset[base_dataset[cols]>utv][cols].count())+(base_dataset[base_dataset[cols]<ltv][cols].count()) 
        print("Column ",cols,"\t has ",count," outliers")
        
    for i,cols in enumerate(num_features,start=1):
        x = base_dataset[cols]
        qr3, qr1=np.percentile(x, [75,25])
        iqr=qr3-qr1
        utv=qr3+(1.5*(iqr))
        ltv=qr1-(1.5*(iqr))
        for p in x:
            if p <ltv or p>utv:
                base_dataset.drop(base_dataset[base_dataset[cols]>utv].index, axis=0, inplace=True)
                base_dataset.drop(base_dataset[base_dataset[cols]<ltv].index, axis=0, inplace=True)
    
    print("\nOutliers in Dataset after Treatment")
    print("====================================")
    for i,cols in enumerate(num_features,start=1):
        x = base_dataset[cols]
        qr3, qr1=np.percentile(x, [75,25])
        iqr=qr3-qr1
        utv=qr3+(1.5*(iqr))
        ltv=qr1-(1.5*(iqr))
        count=(base_dataset[base_dataset[cols]>utv][cols].count())+(base_dataset[base_dataset[cols]<ltv][cols].count()) 
        print("Column ",cols,"\t has ",count," outliers")

In [None]:
#outliers_transform_with_drop_record(vehicle_df)

From above we can see that there are outliers in pr.axis_aspect_ratio column and there is right skewness because long tail is at right side(mean>median)

In [None]:
def outliers_transform_with_replace_mean(base_dataset):
    num_features=[col for col in base_dataset.select_dtypes(np.number).columns ]
    print("Outliers in Dataset before Treatment")
    print("====================================")
    for i,cols in enumerate(num_features,start=1):
        x = base_dataset[cols]
        qr3, qr1=np.percentile(x, [75,25])
        iqr=qr3-qr1
        utv=qr3+(1.5*(iqr))
        ltv=qr1-(1.5*(iqr))
        count=(base_dataset[base_dataset[cols]>utv][cols].count())+(base_dataset[base_dataset[cols]<ltv][cols].count()) 
        print("Column ",cols,"\t has ",count," outliers")
        
    for i,cols in enumerate(num_features,start=1):
        x = base_dataset[cols]
        qr3, qr1=np.percentile(x, [75,25])
        iqr=qr3-qr1
        utv=qr3+(1.5*(iqr))
        ltv=qr1-(1.5*(iqr))
        y=[]
        for p in x:
            if p <ltv or p>utv:
                y.append(np.mean(x))
            else:
                y.append(p)
        base_dataset[cols]=y
                
    print("\nOutliers in Dataset after Treatment")
    print("====================================")
    for i,cols in enumerate(num_features,start=1):
        x = base_dataset[cols]
        qr3, qr1=np.percentile(x, [75,25])
        iqr=qr3-qr1
        utv=qr3+(1.5*(iqr))
        ltv=qr1-(1.5*(iqr))
        count=(base_dataset[base_dataset[cols]>utv][cols].count())+(base_dataset[base_dataset[cols]<ltv][cols].count()) 
        print("Column ",cols,"\t has ",count," outliers")

In [None]:
outliers_transform_with_replace_mean(vehicle_df)

From above we can see that there are outliers in max.length_aspect_ratio and there is a right skewness because long tail is at right side(mean>median)

In [None]:
#display how many are car,bus,van. 
new_vehicle_df['class'].value_counts()

In [None]:
sns.countplot(new_vehicle_df['class'])
plt.show()

From above we can see that cars are most followed by bus and then vans.

so by now we analyze each column and we found that there are outliers in some column. now our next step is to know whether these outliers are natural or artificial. if natural then we have to do nothing but if these outliers are artificial then we have to handle these outliers.
we have 8 columns in which we found outliers:
->radius_ratio
->pr.axis_aspect_ratio
->max.length_aspect_ratio
->scaled_variance
->scaled_variance.1
->scaled_radius_of_gyration.1
->skewness_about
->skewness_about.1

after seeing the max values of above outliers column. it's looks like outliers in above columns are natural not a typo mistake or artificial.
Note: It's my assumption only. as there is no way to prove whether these outliers are natural or artificial.
As we know that mostly algorithms are affected by outliers and outliers may affect the model.as we will apply SVM on above data which is affected by outliers. so better to drop those outliers.

#Fix Outliers after dropping missing values

In [None]:
#find the correlation between independent variables
plt.figure(figsize=(20,5))
sns.heatmap(vehicle_df.corr(),annot=True)
plt.show()

In [None]:
corr = vehicle_df.drop('class', axis=1).corr() # We already examined SalePrice correlations
plt.figure(figsize=(12, 10))

sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.4)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 8}, square=True);

so our objective is to reocgnize whether an object is a van or bus or car based on some input features.
so our main assumption is there is little or no multicollinearity between the features.
if two features is highly correlated then there is no use in using both features.in that case, we can drop one feature. 
so heatmap gives us the correlation matrix there we can see which features are highly correlated.
From above correlation matrix we can see that there are many features which are highly correlated. if we see carefully then scaled_variance.1 and scatter_ratio has 1 correlation and many other features also there which having more than 0.9 correlation
so we will drop those columns whose correlation is +-0.9 or above.
so there are 8 such columns:
->max.length_rectangularity
->scaled_radius_of_gyration
->skewness_about.2
->scatter_ratio
->elongatedness
->pr.axis_rectangularity
->scaled_variance
->scaled_variance.1

now, again we have two option we will drop those above eight columns manually or we will apply pca and let pca to be decided how it will explain above data which is in high dimension with smaller number of variables.
we will see both approaches.

Principal Component Analysis is an unsupervised learning class of statistical techniques used to explain data in high dimension using small number of variables called the principal components. Principal components are the linear combinations of the original variables in the dataset. As it will explain high dimension data with small number of variables. The big disadvantage is we cannot do interpretation with the model.In other words model with pca will become blackbox.   
In pca first we have to find the covariance matrix after that from that covariance matrix we have to find eigen vectors and eigen values. There is mathematical way to find eigen vectors and eigen values. i will attach the link of how to find the eigen value and eigen vector. Corresponding to each eigen vector there is eigen value. after that we have to sort the eigen vector by decreasing eigen values and choose k eigen vectors with the largest eigen value. 

In [None]:
vehicle_df.replace({'car':0,'bus':1,'van':2},inplace=True)

SVM Classifier (Before PCA)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        print("Train Result:\n=============")
        print(f"accuracy score: {accuracy_score(y_train, pred):.4f}\n")
        #print(f"Classification Report: \n \tPrecision: {precision_score(y_train, pred,average=None)}\n\tRecall Score: {recall_score(y_train, pred,average=None)}\n\tF1 score: {f1_score(y_train, pred,average=None)}\n")
        print(f"Confusion Matrix:\n=================\n {confusion_matrix(y_train, clf.predict(X_train))}\n")
        print("Classification Report:\n======================\n",classification_report(y_train, pred))
        
    elif train==False:
        pred = clf.predict(X_test)
        print("Test Result:\n============")        
        print(f"accuracy score: {accuracy_score(y_test, pred)}\n")
        #print(f"Classification Report: \n \tPrecision: {precision_score(y_test, pred,average=None)}\n\tRecall Score: {recall_score(y_test, pred,average=None)}\n\tF1 score: {f1_score(y_test, pred,average=None)}\n")
        print(f"Confusion Matrix:\n===============\n {confusion_matrix(y_test, pred)}\n")
        print("Classification Report:\n======================\n",classification_report(y_test, pred))

In [None]:
#now separate the dataframe into dependent and independent variables
X = vehicle_df.drop('class',axis=1)
Y = vehicle_df['class']
print("shape of X :", X.shape)
print("shape of Y :", Y.shape)

In [None]:
from sklearn.model_selection import cross_val_score, train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=5)

Linear Kernel SVM

In [None]:
from sklearn.svm import SVC

lsvm = SVC(kernel='linear')
lsvm.fit(X_train, y_train)

print_score(lsvm, X_train, y_train, X_test, y_test, train=True)
print_score(lsvm, X_train, y_train, X_test, y_test, train=False)


lsvm_accuracy=accuracy_score(y_test, lsvm.predict(X_test))

Ploynomial Kernel SVM

In [None]:
from sklearn.svm import SVC

psvm = SVC(kernel='poly', degree=2, gamma='auto')
psvm.fit(X_train, y_train)

print_score(psvm, X_train, y_train, X_test, y_test, train=True)
print_score(psvm, X_train, y_train, X_test, y_test, train=False)

lsvm_accuracy=accuracy_score(y_test, psvm.predict(X_test))

Redial Kernel SVM

In [None]:
from sklearn.svm import SVC

rsvm = SVC(kernel='rbf', gamma=1)
rsvm.fit(X_train, y_train)

print_score(rsvm, X_train, y_train, X_test, y_test, train=True)
print_score(rsvm, X_train, y_train, X_test, y_test, train=False)

rsvm_accuracy=accuracy_score(y_test, rsvm.predict(X_test))

SVM on Scaled Data

In [None]:
from sklearn.preprocessing import MinMaxScaler

sc = MinMaxScaler()
X_std = sc.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_std, Y, test_size=0.3, random_state=5)

In [None]:
print("=======================Linear Kernel SVM==========================")

from sklearn.svm import SVC

lsvm = SVC(kernel='linear')
lsvm.fit(X_train, y_train)

print_score(lsvm, X_train, y_train, X_test, y_test, train=True)
print_score(lsvm, X_train, y_train, X_test, y_test, train=False)

lsvm_accuracy=accuracy_score(y_test, lsvm.predict(X_test))

print("=======================Polynomial Kernel SVM==========================")
from sklearn.svm import SVC

psvm = SVC(kernel='poly', degree=2, gamma='auto')
psvm.fit(X_train, y_train)

print_score(psvm, X_train, y_train, X_test, y_test, train=True)
print_score(psvm, X_train, y_train, X_test, y_test, train=False)

psvm_accuracy=accuracy_score(y_test, psvm.predict(X_test))

print("=======================Radial Kernel SVM==========================")
from sklearn.svm import SVC

rsvm = SVC(kernel='rbf', gamma=1)
rsvm.fit(X_train, y_train)

print_score(rsvm, X_train, y_train, X_test, y_test, train=True)
print_score(rsvm, X_train, y_train, X_test, y_test, train=False)

rsvm_accuracy=accuracy_score(y_test, rsvm.predict(X_test))


In [None]:
result = pd.DataFrame({'Model' : ['SVM Linear', 'SVM Polynomial', 'SVM Redial'], 
                       'Test Accuracy' : [lsvm_accuracy, psvm_accuracy, rsvm_accuracy],
                      })
result

**Support Vector Machine Hyperparameter tuning**

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.01, 0.1, 0.5, 1, 10, 100], 
              'gamma': [1, 0.75, 0.5, 0.25, 0.1, 0.01, 0.001], 
              'kernel': ['rbf', 'poly', 'linear']} 

grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=1, cv=5, iid=True)

grid.fit(X_train, y_train)

print_score(grid, X_train, y_train, X_test, y_test, train=True)
print_score(grid, X_train, y_train, X_test, y_test, train=False)

**K-Fold Cross Validation**

In [None]:
from sklearn.model_selection import KFold, cross_val_score


kfold = KFold(n_splits= 10, random_state = 1)

#instantiate the object
svc = SVC(kernel='linear') 


#now we will train the model with raw data

results = cross_val_score(estimator = svc, X = X_train, y = y_train, cv = kfold)

print(results,"\n")

print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean()*100, results.std()*100 * 2))

kf_accuracy=results.mean()

**Repeated Kfold Cross Validation**

In [None]:
from sklearn.model_selection import RepeatedKFold

X = vehicle_df.drop('class',axis=1).values
y = vehicle_df['class'].values

accuracies = []
#lr = LogisticRegression(random_state = 1)
svc = SVC(kernel='linear') 

rkf = RepeatedKFold(n_splits = 10, n_repeats= 3, random_state = 1)

for train_index, test_index in rkf.split(X):
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    svc.fit(X_train, y_train)
    accuracies.append(accuracy_score(y_test, svc.predict(X_test)))

print(np.round(accuracies, 3),"\n")

print("Accuracy: %0.2f (+/- %0.2f)" % (np.mean(accuracies)*100, np.std(accuracies)*100 * 2))

rkf_accuracy=np.mean(accuracies)

In [None]:
result = pd.DataFrame({'Model' : ['Linear SVM', 'Linear SVM K-Fold', 'Linear SVM Repeated K-Fold'], 
                       'Accuracy' : [lsvm_accuracy, kf_accuracy, rkf_accuracy],
                      })
result

In [None]:
#now sclaed the features attribute and replace the target attribute values with number
X = vehicle_df.drop('class',axis=1)
y = vehicle_df['class']

X_scaled = X.apply(zscore)

#With Principal Component Analysis(PCA) 

In [None]:
#make the covariance matrix and we have 18 independent features so aur covariance matrix is 18*18 matrix
cov_matrix = np.cov(X_scaled,rowvar=False)
print("cov_matrix shape:",cov_matrix.shape)
print("Covariance_matrix",cov_matrix)

In [None]:
#now with the help of above covariance matrix we will find eigen value and eigen vectors
pca = PCA(n_components=18)
pca.fit(X_scaled)

In [None]:
#display explained variance ratio
pca_to_learn_variance.explained_variance_ratio_

In [None]:
#display explained variance
pca_to_learn_variance.explained_variance_

In [None]:
#display principal components
pca_to_learn_variance.components_

In [None]:
plt.bar(list(range(1,19)),pca_to_learn_variance.explained_variance_ratio_)
plt.xlabel("eigen value/components")
plt.ylabel("variation explained")
plt.show()

In [None]:
plt.step(list(range(1,19)),np.cumsum(pca_to_learn_variance.explained_variance_ratio_))
plt.xlabel("eigen value/components")
plt.ylabel("cummalative of variation explained")
plt.show()

From above we can see that 8 dimension are able to explain 95%variance of data. so we will use first 8 principal components

In [None]:
#use first 8 principal components
pca_8c = PCA(n_components=8)
pca_8c.fit(X_scaled)

In [None]:
#transform the raw data which is in 18 dimension into 8 new dimension with pca
X_scaled_pca_8c = pca_8c.transform(X_scaled)

In [None]:
#display the shape of new_vehicle_df_pca_independent_attr
X_scaled_pca_8c.shape

now before apply pca with 8 dimension which are explaining more than 95% variantion of data we will make model on raw data after that we will make model with pca and then we will compare both models.

In [None]:
#now split the data into 80:20 ratio
rawdata_X_train,rawdata_X_test,rawdata_y_train,rawdata_y_test = train_test_split(X_scaled,Y,test_size=0.20,random_state=1)
pca_X_train,pca_X_test,pca_y_train,pca_y_test = train_test_split(X_scaled_pca_8c,Y,test_size=0.20,random_state=1)

In [None]:
print("shape of rawdata_X_train",rawdata_X_train.shape)
print("shape of rawdata_y_train",rawdata_y_train.shape)
print("shape of rawdata_X_test",rawdata_X_test.shape)
print("shape of rawdata_y_test",rawdata_y_test.shape)
print("--------------------------------------------")
print("shape of pca_X_train",pca_X_train.shape)
print("shape of pca_y_train",pca_y_train.shape)
print("shape of pca_X_test",pca_X_test.shape)
print("shape of pca_y_test",pca_y_test.shape)

**Without PCA**

In [None]:
from sklearn.model_selection import KFold, cross_val_score


kfold = KFold(n_splits= 10, random_state = 1)

svc = SVC() #instantiate the object

#now we will train the model with raw data

results = cross_val_score(estimator = svc, X = rawdata_X_train, y = rawdata_y_train, cv = kfold)

print(results,"\n")

print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean()*100, results.std()*100 * 2))

sns.boxplot(results)
plt.show()

In [None]:
svc.fit(rawdata_X_train,rawdata_y_train)

print("Raw Data Training Accuracy :\t ", svc.score(rawdata_X_train, rawdata_y_train))

raw_train_accuracy=svc.score(rawdata_X_train, rawdata_y_train)

#Scoring the model on test_data
print("Raw Data Testing Accuracy :\t  ",  svc.score(rawdata_X_test, rawdata_y_test))

raw_test_accuracy=svc.score(rawdata_X_test, rawdata_y_test)

y_pred = svc.predict(rawdata_X_test)

In [None]:
print(classification_report(rawdata_y_test, svc.predict(rawdata_X_test)))

**With PCA**

In [None]:
#now fit the model on pca data with new dimension

from sklearn.model_selection import KFold, cross_val_score

kfold = KFold(n_splits= 10, random_state = 1)

svc = SVC() #instantiate the object

#now train the model with pca data with new dimension

pca_results = cross_val_score(estimator = svc, X = pca_X_train, y = pca_y_train, cv = kfold)

print(pca_results,"\n")

print("Accuracy: %0.2f (+/- %0.2f)" % (pca_results.mean()*100, pca_results.std()*100 * 2))

sns.boxplot(pca_results)
plt.show()

From above we can see that by reducing 10 dimension we are achieving 94% accuracy

In [None]:
svc.fit(pca_X_train,pca_y_train)

print("PCA data Training Accuracy :\t ", svc.score(pca_X_train, pca_y_train))

pca_train_accuracy=svc.score(pca_X_train, pca_y_train)

#Scoring the model on test_data
print("PCA data Testing Accuracy :\t  ",  svc.score(pca_X_test, pca_y_test))

pca_test_accuracy=svc.score(pca_X_test, pca_y_test)


In [None]:
print(classification_report(pca_y_test, svc.predict(pca_X_test)))

In [None]:
#display confusion matrix of both models
print("Confusion matrix with raw data(18 dimension)\n",confusion_matrix(rawdata_y_test,rawdata_y_predict))
print("Confusion matrix with pca data(8 dimension)\n",confusion_matrix(pca_y_test,pca_y_predict))

In [None]:
result = pd.DataFrame({'TrainTest' : ['raw_train_accuracy', 'raw_test_accuracy', 'pca_train_accuracy','pca_test_accuracy'], 
                       'Accuracy' : [raw_train_accuracy,raw_test_accuracy, pca_train_accuracy, pca_test_accuracy],
                      })
result

#With dropping the above mentioned columns Manually

In [None]:
#drop the columns
X_scaled.drop(['max.length_rectangularity','scaled_radius_of_gyration','skewness_about.2','scatter_ratio','elongatedness','pr.axis_rectangularity','scaled_variance','scaled_variance.1'],axis=1,inplace=True)

In [None]:
#display the shape of new dataframe
X_scaled.shape

In [None]:
dropcolumn_X_train,dropcolumn_X_test,dropcolumn_y_train,dropcolumn_y_test = train_test_split(X_scaled,Y,test_size=0.20,random_state=1)

In [None]:
print("shape of dropcolumn_X_train",dropcolumn_X_train.shape)
print("shape of dropcolumn_y_train",dropcolumn_y_train.shape)
print("shape of dropcolumn_X_test",dropcolumn_X_test.shape)
print("shape of dropcolumn_y_test",dropcolumn_y_test.shape)

In [None]:
#fit the model on dropcolumn_X_train,dropcolumn_y_train
svc.fit(dropcolumn_X_train,dropcolumn_y_train)

In [None]:
#predict the y value
dropcolumn_y_predict = svc.predict(dropcolumn_X_test)

In [None]:
#display the accuracy score and confusion matrix
print("Accuracy score with dropcolumn data(10 dimension)",accuracy_score(dropcolumn_y_test,dropcolumn_y_predict))
print("Confusion matrix with dropcolumn data(10 dimension)\n",confusion_matrix(dropcolumn_y_test,dropcolumn_y_predict))

First let's create a new dataframe and then we will impute the missing values.

#Conclusion:
From above we can see that pca is doing a very good job.Accuracy with pca is approx 94% and with raw data approx 96% but note that pca 94% accuracy is with only 8 dimension where as rawdata has 18 dimension.But every thing has two sides, disadvantage of pca is we cannot do interpretation with the model.it's blackbox.

Thanks for reading the kernel!
Happy Learning:)