## Unsupervised Learning , PCA, SVM & Cross Validation

Import basic libraries 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

In [None]:
df=pd.read_csv('../input/vehicle/vehicle.csv')

In [None]:
df.head()

In [None]:
df.info()

##### Our DatTypes:
1. We are going to deal with neumeric (Int and Float) datatype
2. Except the Traget Column (Class) which is object

In [None]:
print("The Number of Rows in our dataset:{} & Number of columns:{}".format(df.shape[0],df.shape[1]))

Let's check for the number of missing values

In [None]:
df.isna().sum()

#### There are some missing values for 14 columns in our dataset.

In [None]:
df=df.fillna(df.median())

I am using median to fill the missing values , as using mean value may not be that effective if our data has outliers. 
The Reason being, outliers may have extreme values , and when we use mean or average to fill the missing values 
it may have an impact on our analysis. So , Median can be preferred compared to mean when filling missing values.

Tip:ML algorithms are sensitive to outliers 

In [None]:
df.describe().T

Our Dataset looks somewhat good, all columns are consistent with neumeric value

In [None]:
columns=list(df)
df[columns].hist(stacked=True,density=True, bins=100,color='Orange', figsize=(16,30), layout=(10,3)); 

From the Histogram we could infer that most of out independent variables are normally distributed and some have multiple gaussian.

### Traget Variable Callout

Our Target variable is class , with the various silhouette attributes of vehicles in different
angles provided, we are going to identify the type of vehicle.

In [None]:
grp=df.groupby('class')['class'].count()
grp.plot.pie(shadow=True, startangle=120,autopct='%.2f')

##### Our Class label vehicle type consists of three types Van,Bus and Car. Among these Car's contribute to 51% of the data, Bus & van collectively contribute 49% of data

In [None]:
df_car=df[df['class']=='car']
df_van=df[df['class']=='van']
df_bus=df[df['class']=='bus']

###### Lets see if we can distinguish our classes based on some features.

In [None]:
plt.scatter(df_bus['circularity'],np.zeros_like(df_bus['circularity']),marker='s',color='Red',alpha=0.5)
plt.scatter(df_car['circularity'],np.zeros_like(df_car['circularity']),marker='|',color='blue',alpha=0.8)
plt.scatter(df_van['circularity'],np.zeros_like(df_van['circularity']),marker='o',color='yellow',alpha=1)
plt.xlabel('Circularity')
plt.show()
plt.scatter(df_bus['distance_circularity'],np.zeros_like(df_bus['distance_circularity']),marker='s',color='Red',)
plt.scatter(df_car['distance_circularity'],np.zeros_like(df_car['distance_circularity']),marker='|',color='blue',alpha=0.4)
plt.scatter(df_van['distance_circularity'],np.zeros_like(df_van['distance_circularity']),marker='o',color='yellow',alpha=0.4)
plt.xlabel('distance_circularity')
plt.show()
plt.scatter(df_bus['hollows_ratio'],np.zeros_like(df_bus['hollows_ratio']),marker='s',color='Red',)
plt.scatter(df_car['hollows_ratio'],np.zeros_like(df_car['hollows_ratio']),marker='|',color='blue',alpha=0.4)
plt.scatter(df_van['hollows_ratio'],np.zeros_like(df_van['hollows_ratio']),marker='d',color='green',alpha=0.4)
plt.xlabel('hollows_ratio')
plt.show()

Parameters like Circularity , distnace circularity & hollows ratio for all the vehicle types Van, Bus and Car seem to overlap with each other with some slight varition and it looks tough to differentiate classes with these attributes visually.

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(df_bus['elongatedness'],kde=True,color='r',hist=False,label="Bus")
sns.distplot(df_car['elongatedness'],kde=True,color='G',hist=False,label="Car")
sns.distplot(df_van['elongatedness'],kde=True,color='B',hist=False,label="Van")
plt.legend()
plt.title("elongatedness Distribution")

Distribution of elongatedness for car seems to be more compared to bus and van

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(df_bus['max.length_rectangularity'],kde=True,color='r',hist=False,label="Bus")
sns.distplot(df_car['max.length_rectangularity'],kde=True,color='G',hist=False,label="Car")
sns.distplot(df_van['max.length_rectangularity'],kde=True,color='B',hist=False,label="Van")
plt.legend()
plt.title("max.length_rectangularity Distribution")

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(df['scaled_radius_of_gyration'],df['scaled_variance'],hue=df['class'],markers='+')

Scaled Radius of gyration and scaled variance have a linear relationship for all vehicle types.

In [None]:
#Print the corelation between columns in tabular format 
#core=df.corr()
#print(core)
#Using Pair plot to visualize the corelation
sns.pairplot(df,hue='class');

#### Pairplot Analysis:

##### Through The Diagonal :
1. Through the diagonal, the density distribution clearly states that for each class there are some differences in the distribution of attributes though there is some overlap.
2. Mostly, the distribution seems normal and data might have been collected from multiple gaussians or multiple source.
3. We could see multiple peaks for same class distribution.

##### Upper part of the Diagnol:
1. In pairplot analysis , it's fine if we consider any one part of plot either the upper or lower from the diagonal as the other is the mirror image.
2. We could observe that some features have a positive linear realtionship with each other.
3. Some have negative linear relation ship and some do not have any relation(Cloud like Figures).

#### Along the diagonal , there seem to be noise present to our data, some attributes have extended tails (both left & right) , As we proceed further its important to take care of these in data preprocessing.

#### We can use some techniques like Zscore or IQR score 

#### Box PLot to identify the outliers

In [None]:
plt.figure(figsize=(18,12))
sns.boxplot(data=df) 
plt.xticks(rotation=45)

#### Box Plot Observation :

1. Most of out independent features have the central tendancy exactly at the middle, except some which are skewed.
2. There are outliers in columns radius_ratio,axis aspect_ratio,max length aspect ratio, scaled radius of gyration etc, we can see if these attributes have impact on our analysis further in feature engineering. If they do we should think of options to handle outliers. 
3. If these attributes do not have impact on our analysis we could also ignore these columns before traning our model.
4. In this dataset the attributes having outliers are getting dropped from the analysis , The reaon being these attributes also have high collineratity , In order to avoid the multicolliniearity dispute I have dropped them in the upcoming steps. 

In [None]:
df.skew()

##### Mostly, all our attributes in the dataset have symmentrical distribution 

1. pr.axis_aspect_ratio,max.length_aspect_ratio,scaled_radius_of_gyration.1 have positive skewness and the tail is very high towrdas the right from the median.
2. hollows_ratio alone has a negative skewness where the tail is towards left and away from the median.

In [None]:
plt.subplots(figsize=(15,15))
sns.heatmap(df.corr(),annot=True)

#### Attribute Selection 

From the above heatmap, it is evident that there is a lot of positive correlation between the attributes.

When we have two variables which are highly coreraled, it is better to drop one as it may cause the problem of multicollinearity. 

For this problem statement , I am using a threshold of 0.93(0.93 allows me to delete one more column than 0.95) to remove variables those have high correlation and I am going to drop them as they contain redundant information.

In [None]:
Cor_Matrix=df.corr().abs()
Cor_Matrix
upper_tri = Cor_Matrix.where(np.triu(np.ones(Cor_Matrix.shape),k=1).astype(np.bool))
#print(upper_tri)
to_drop =[column for column in upper_tri.columns if any(upper_tri[column] > 0.93)]

print("These columns can be dropped as they are redundant:",to_drop[0:6])

#### Attribute Selection For Analysis

In [None]:
df1=df.drop(['elongatedness','pr.axis_rectangularity','max.length_rectangularity','scaled_variance','scaled_variance.1'],axis=1)
print("The Number of Rows in our dataset :{} & Number of columns after remving multicollinearity:{}".format(df1.shape[0],df1.shape[1]))


In [None]:
plt.figure(figsize=(10,12))
sns.boxplot(data=df1) 
plt.xticks(rotation=45)

#### Outlier Removal : After selecting the features as above we could see that , we still have the outliers present in the dataset for some attributes.

###### How are we going to handle them??

###### We are going to find the IQR range for each column (Q3 -Q1 ), and remove any data points beyond that range to remove outliers

In [None]:
#IQR Calculation 

Q1=df1.quantile(0.25)
Q3=df1.quantile(0.75)
IQR= Q3 - Q1

IQR

In [None]:
df1 = df1[~((df1 < (Q1 - 1.5 * IQR)) |(df1 > (Q3 + 1.5 * IQR))).any(axis=1)]
df1.shape

#### Now we have removed outliers and the number of rows in our original data has been reduced from 846 to 815

##### Lets again plot a box plot and see the outliers

In [None]:
plt.figure(figsize=(10,12))
sns.boxplot(data=df1) 
plt.xticks(rotation=45)

As you can see , Outliers have been removed from dataset and our data looks clean.

### SVM on Raw Data with Scaling 

1. We are not certain about unit of measurement for all atributes in our data, by the looks of it we could see that some have high magnitude and some have low mangnitude. 
2. Hence, it is better to scale our data to have them in the same scale.
3. I am using Standard Scaler here 

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,RocCurveDisplay
from sklearn.preprocessing import StandardScaler

X_MNMX=df1.drop(['class'],axis=1)
Y_MNMX=df1['class']
#X_MNMX=X_MNMX.apply(zscore)
X_Train,X_Test,Y_Train,Y_Test=train_test_split(X_MNMX,Y_MNMX,test_size=0.3,random_state=23)
sc=StandardScaler()
X_Train=sc.fit_transform(X_Train)
X_Test=sc.transform(X_Test)


SVM1=SVC(C=1.0,kernel='rbf')
SVM1.fit(X_Train,Y_Train)
#print(SVM1.score(X_Train,Y_Train))
#print(SVM1.score(X_Test,Y_Test))
PRED_SVM_M1=SVM1.predict(X_Test)
CM_SVM_M1=confusion_matrix(Y_Test,PRED_SVM_M1)
#print(CM_SVM_M1)

print("#####################Classification Report & Accuracy SCore#####################")
print("----SVM Model ----")
print("Model Score on Training data with selected features :{}".format(SVM1.score(X_Train,Y_Train) * 100))
print("Model Score on Testing  data with selected features :{}".format(SVM1.score(X_Test,Y_Test) * 100))
print("Accuracy Score of SVM on Test Data:{}".format(accuracy_score(Y_Test,PRED_SVM_M1)*100))
print(classification_report(Y_Test,PRED_SVM_M1))
sns.heatmap(CM_SVM_M1,annot=True,xticklabels=True,yticklabels=True,fmt='g',linewidths=.5,cmap='tab20c_r')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


Our SVM Model has a Accuracy score of 93.46, Mostly our model has predicted the class of vehicles correctly though there are some misclassifications

Our Precision & recall scores are also observed to be good.

### k- Fold CROSS VALIDATION 

#### Cross Validation is a technique to evaluate  performance of the model and see how well it can perform on "unseen data".
#### The K value set  is the number of folds that splits the data into folds, when the estimator trains on k-1 folds , it tests the performance on the left out fold and publishes the score .
#### This process is iterated and accuracy score gets collected for each iteration with change in fold.
#### Standard deviation is calculated on the K accuracy scores

In [None]:
# Applying k-Fold Cross Validation 
from sklearn.model_selection import KFold,cross_val_score
Folds=10
seed=23
kfold=KFold(shuffle=True,n_splits=Folds,random_state=seed)
accuracies = cross_val_score(estimator = SVM1, X = X_MNMX, y = Y_MNMX, cv = kfold) 
accuracies
print("K Fold score mean:{}".format(accuracies.mean()*100))
print("K Fold score standard deviation:{}".format(accuracies.std()*100))

#### As you can see from the above code , we have received a mean accuracy score of 65.9% with a standard deviation of 2.38.
#### What we can infer from this is that our SVM model on raw dataset can give us a accuracy 63.5% to 68.26% on unseen data.
###### This score is not that great, We shall try to improve our models further

### Principal Component Analysis 
#### The idea of PCA is to  reduce the dimensionality of a data set consisting of a large number of interrelated variables while retaining as much as possible of the variation present in the data set.

In [None]:
#Import PCA from sklearn
from sklearn.decomposition import PCA
from scipy.stats import zscore

In [None]:
# Dropping those columns which had high correlation that we found during EDA
df=df.drop(['elongatedness','pr.axis_rectangularity','max.length_rectangularity','scaled_variance','scaled_variance.1'],axis=1)

#### PCA STEP 1 :
######  Scaling the independent variables as different units of measure & magnitude will have impact in the PCA analysis.

In [None]:
#Scale and split data 
X_PCA=df.drop(['class'],axis=1)
Y_PCA=df['class']
X_PCA=X_PCA.apply(zscore)

PC=PCA(n_components=10,random_state=23)
PC_DF=PC.fit_transform(X_PCA)
#X_Train_PCA=PC.transform(X_PCA_Train)

#### PCA STEP 2 :
##### Create the covarience matrix 

In [None]:
covMatrix = np.cov(X_PCA,rowvar=False)
plt.subplots(figsize=(7,7))
sns.heatmap(covMatrix,annot=True,cmap='afmhot_r')

#### PCA STEP 3:
##### Find the Eigen values & Eigen vectors

In [None]:
print("##############Eigen Values##############")
print(PC.explained_variance_)
print("##############Eigen Vectors##############")
print(PC.components_)

#### PCA STEP 4 :
##### Find the amount of varience captured by each  principal components and visulize them 

In [None]:
plt.bar(list(range(1,11)),PC.explained_variance_ratio_, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()

In [None]:
plt.step(list(range(1,11)),np.cumsum(PC.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()

####  Principal Components that capture about 95% of the variance in the data

In [None]:
print('Explained variation per principal component: {}'.format(PC.explained_variance_ratio_))
P_Components=PC.explained_variance_ratio_
print("The Ideal number of components that could explain:{}% of variance in data is 7".format(np.sum(P_Components[0:7])*100))

#### Creating a dataframe  and using Principal Components instead of the original data

In [None]:
PCA_DF=pd.DataFrame(data=PC_DF,columns = ['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10'])

PCA_DF['Y']=Y_PCA
PCA_DF

In [None]:
sns.pairplot(PCA_DF,diag_kind='kde');

#### As shown by the pairplot all our PCA components are independent , and we are not able to see any correlation between each other and have normal distribution

#### Support Vector Machine on PCA data 

In [None]:
X_PCA_DF=PCA_DF.drop(['Y'],axis=1)
Y_PCA_DF=PCA_DF['Y']

In [None]:
X_PCA_Train,X_PCA_Test,Y_PCA_Train,Y_PCA_Test=train_test_split(X_PCA_DF,Y_PCA_DF,test_size=0.3,random_state=23)

In [None]:
PCA_Final=PCA(n_components=7,random_state=23)

In [None]:
SVM3=SVC(C=1.0,kernel='rbf')

In [None]:
SVM3.fit(X_PCA_Train,Y_PCA_Train)

In [None]:
PRED_PCA=SVM3.predict(X_PCA_Test)

In [None]:
CM_SVM_PCA=confusion_matrix(Y_PCA_Test,PRED_PCA)

In [None]:
print("#####################Classification Report & Accuracy SCore of PCA Data  on SVM#####################")
print("----SVM Model ----")
print("Model Score on Training data with PCA features :{}".format(SVM3.score(X_PCA_Train,Y_PCA_Train) * 100))
print("Model Score on Testing  data with PCA features :{}".format(SVM3.score(X_PCA_Test,Y_PCA_Test) * 100))
print("Accuracy Score of SVM on Test Data:{}".format(accuracy_score(Y_PCA_Test,PRED_PCA)*100))
print(classification_report(Y_PCA_Test,PRED_PCA))
sns.heatmap(CM_SVM_PCA,annot=True,xticklabels=True,yticklabels=True,linewidths=.5,cmap='tab20c_r',fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### k- Fold CROSS VALIDATION 

In [None]:
# Applying k-Fold Cross Validation 
from sklearn.model_selection import cross_val_score,KFold
kfold_pca=KFold(shuffle=True,n_splits=10,random_state=23)
accuracies = cross_val_score(estimator = SVM3, X = X_PCA_DF, y = Y_PCA_DF, cv = kfold_pca) 
accuracies
print("K Fold score mean:{}".format(accuracies.mean()*100))
print("K Fold score standard deviation:{}".format(accuracies.std()*100))

### Inference : 
1. First, we did SVM on this dataset with 14 attributes and we got a model score of 93% on test data, Through KFOLD Cross   validation we evaluated that our model1 could give us accuracy of range 63.5% to 68.26% when exposed to unseen data.

2. Secondly, we did PCA on the raw data. Identified that almost just 7 attributes could capture about 96% of varience in the data. This time , we perfomred the same SVM algorithm on this principal components.

3. We could see that on test data on PCM components, our model gave a score of 94%. However, on K fold cross validation we could see that on unseen data our model could perform with a accuracy range 90.71 % to 97.45%. This is a really a great score.

PCA plays a vital role. Ignoring the less impact variables it focuses on variables that can display high varience present in data and enhancing performance of our model.


### Conclusion- "When the Rubber meets the Road"

1. Though the Train and test scores of both Raw and PCA data through SVM were not that different, Evaluating their performance on unseen data using K Fold cross validation gave a phenomenol impact on  model bulit on PCA dataset. Furthermore, On large datasets PCA can be helpful minimizing the computaional cost, complexity of the model.
2. Raw Data Cross Validation Range -  63.5% to 68.26%.
3. PCA Data Cross Validation Range -  90.71%to 97.45% with a confidence intravel of 95%

