# Unsupervised Learning

## Data Description
The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. 
This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

## Domain
Object Recognition

## Context
The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

#### Import neccessary libaries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist

from sklearn.decomposition import PCA
from sklearn.model_selection import KFold,cross_val_score
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

#### Import dataset

In [None]:
data = pd.read_csv('../input/vehicle2/vehicle-2.csv')
data.shape

- We have **846 rows** and <font color='red'>**19 columns**</font> including target column(*class*).

#### Lets look at the sample of the data

In [None]:
data.sample(10)

- All the attibutes seems to be quantitative in nature only the target varibale(class) is categorical in nature.

In [None]:
data.info() # Check for the info, to get overall understanding what are the datatypesof featured and presence of null values.

- We can clearly see that there are null values in cicularity and other features also.
- class has dtype as object, which genrally represent string values.

#### Description of features of data

In [None]:
data_info=pd.read_csv('../input/attributes-vehicle-silhouettecsv/attributes_vehicle_silhouette.csv')
data_info['name']=data.columns
data_info

### PreProcessing

#### Checking and removing NAN values

In [None]:
# Checking if null values are present
data.isnull().sum()

In [None]:
# Filling the NAN values
# Fill the values by median of the data.
for col in data.columns[0:-1]:
    data[col].fillna(value=data[col].median(), inplace=True)

- We do know that we have a categorical target variable, we need to convert it into numeric value.
- We have different ways to do that, one-hot encoding, label enconding and few others.
    - Let go with label enconding, as it doesn't increase the dataset size and we already have 19 columns.

In [None]:
# convert target variable to label encoding
label_encoder=LabelEncoder()
data['class'] = label_encoder.fit_transform(data['class'])
data['class'].value_counts()

- 0 is Car
- 1 is Bus
- 2 is Van

Let check head of the data with null value removed and target variable as numeric values.

In [None]:
data.head()

### Understanding the features

In [None]:
data.describe().transpose()

- scaled_variance.1 has high stdandard deviation **(176)**.
- We can explore all other variables properties later on.

### Correlation Analysis

In [None]:
corr = data.corr()

In [None]:
sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 3.5})
plt.figure(figsize=(18,7))

#create a mask so we only see the correlation values onces
#making a array of shape corr with all values as 0
mask = np.zeros_like(corr)

# marking the all elements from one diagonal above the main diagonal as True
mask[np.triu_indices_from(mask,1)] = True

#mask of sns.heatmap
#If passed, data will not be shown in cells where ``mask`` is True.
#Cells with missing values are automatically masked.
a= sns.heatmap(corr, mask=mask, annot=True, fmt='.2f')
rotx= a.set_xticklabels(a.get_xticklabels(), rotation=90)
roty= a.set_yticklabels(a.get_yticklabels(), rotation=30)

1. Elongatedness is highly correleated to compactness, cicularity, distance_circularity, radius_ratio, scatter_ratio, px.axis_rectangularity, max.length_rectuangularity, scaled_variance, scaled_variance.1, scaled_radius_of_gyration
2. Hollows_ratio is highly correlated to skweness_about_2.
3. Target variable doesn't have high correlation to any other variable.

#### Features to be explored more
1. **Elongatedness**: This feature is highly correlated to at least 10 other features, that means the variance explained by all these other features is covered by this single feature. It is highly correlated to target variable in respect to all others.
2. **Hollows Ratio**
3. **pr.axis_aspect_ratio**
4. **max.length_aspect_ratio**
5. **scaled_radius_of_gyration.1**
6. **skewness_about**
7. **skewness_about.1**

In [None]:
features= ['pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'elongatedness',
       'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1',
       'hollows_ratio']

In [None]:
#helper function
def i_j_counter(rows,columns):
    i=0
    j=0
    indexes=[]
    while(1>0):
        indexes.append([i,j])
        if((j+1)%n_columns==0):
            j=0
            i=i+1
            if(i%n_rows == 0):
                break;
        else:
            j=j+1
    return indexes;

### Detecting Outliers

In [None]:
# using a box plot to see the presence of outliers.
# are only seeing the box plot of the features we think is important to us.
plt.figure(figsize=(18,7))
sns.boxplot(data=data[features])

- **Even though the data is not scaled, we can clearly see the there are outlier to columns which we are concern with. We need to clean these.**

### Cleaning outliers

In [None]:
# Assigning outliers to there wiskers postion
for col in data.columns:
    Q1=data[col].quantile(0.25)
    Q3=data[col].quantile(0.75)
    IQR = Q3-Q1
    IQR
    c1=Q1-(1.5*IQR)
    c2=Q3+(1.5*IQR)
    data.loc[data[col] < c1, col] = c1
    data.loc[data[col] > c2, col] = c2

In [None]:
#Distribution plots help us to understand the distribution
skew = pd.DataFrame(data[features].skew(),columns=['value'])
n_rows=4
n_columns=2
fig, axes = plt.subplots(n_rows,n_columns,figsize=(20,15))
for col,index,skew in zip(features,i_j_counter(n_rows,n_columns),skew['value']):
    sns.distplot(data[col],ax=axes[index[0],index[1]],color='c', label=f'skew : {skew : .2f}')
    axes[index[0],index[1]].legend(loc ='upper right')

1. **pr.axis_aspect_ratio** : Normally distributed graph with postive skewness **(0.26)**.
2. **max.length_aspect_ratio**: Bi-modal values distribution with postive skewness **(0.28)**.
3. **elongatedness** : Clear bi-modal value peaks at 32 and 45 approx with postive skewness **(0.05)**.
4. **scaled_radius_of_gyration.1**: Mostly normally distributed with a slight high on right hand side with postive skewnesss **(0.56)**.
5. **skewness_about** : Bi-modal values with psotive skewness **(0.71)**. 
6. **skewness_about.1** : Normally distributed with postive skewness **(0.69)**.
7. **Hollows Ratio** : Bi-modal values with negativeskewness **(-0.23)**.

### Groupby Analysis on orginal given groups

We have four groups
- Cars (two groups further inside)
- Bus
- Vans

In [None]:
groupby=data.groupby('class')[features]

In [None]:
groupby.mean()

In [None]:
groupby.skew()

In [None]:
n_rows=2
n_columns=4
fig, axes = plt.subplots(n_rows,n_columns,figsize=(20,10))
for col,index in zip(features,i_j_counter(n_rows,n_columns)):
    sns.kdeplot(data[data['class']==0][col],ax=axes[index[0],index[1]],legend=False)
    sns.kdeplot(data[data['class']==1][col],ax=axes[index[0],index[1]],legend=False)
    sns.kdeplot(data[data['class']==2][col],ax=axes[index[0],index[1]],legend=False)
    axes[index[0],index[1]].set_xlabel(f'{col}')
print('Blue is Cars');
print('Orange is Bus');
print('Green is Van');

1. **pr.axis_aspect_ratio** - buses have a normal distribution with a mena near 60, cars and van show bi-modal values.
2. **max.length_aspect_ratio** - multi modal value for cars might be due very types are in data, buses and vans have almost same variance.
3. **elongatedness** - buses have a lower value but cars and vans share same peak value around 45.
4. **scaled_radius_of_gyration.1** - highest count hold by buses cars still shows that two cluster should be there.
5. **skewness_about** - cars have highest count around value of 5, all other vechiles also share there peak near same point.
6. **skewness_about.1** - buses have bi-modal values, car and vans share similar distribution
7. **hollows_ratio** - car have two peaks near 185 and 205, buses and vans have similar value where occurs near 185 and 200.

### Scaling the data

In [None]:
y=data['class']
X=data.drop(['class'],axis=1)
scaler=StandardScaler()
scaled_data=pd.DataFrame(scaler.fit_transform(X),columns=X.columns)

In [None]:
sns.pairplot(data, diag_kind='kde')

- There are many features which shows linear releationship with other features or we can say high collinearity.
- We have already identified those and store the features which can explain the maximum variance of data.
- We also see that most of the diagonal have atleast 2 peaks in them and suggesting 2 there are two clusters atleast.
- we have our target variable with three peaks.

### Splitting in test and train data

In [None]:
# Spliting the whole scaled data with 18 columns into train and test data with test size as 30%.
# Keep the random state same in the future split same to get exactly same records division.
X_train,X_val, y_train,y_val = train_test_split(scaled_data,y,test_size=0.3,random_state=1)

### Training a SVM

In [None]:
svc= SVC()
svc.fit(X_train,y_train)
predict= svc.predict(X_val)
print(f'Train set acccuracy {svc.score(X_train,y_train) *100 : .4f}%')
print(f'Test set accuracy {svc.score(X_val,y_val)*100: .4f}%')

- We see very high train and test score with SVM.

### Cross-Validation

In [None]:
kfold = KFold(n_splits=10, random_state=1, shuffle=True)
result = cross_val_score(svc,scaled_data,y,cv=kfold,scoring='accuracy')
print(f'Mean KFold accuracy score: {result.mean()*100 : .4f}%')

- The KFold on whole data also gives the similar result as the SVC for a particular set of data.

### PCA [Principal Component Analysis]

In [None]:
# We want PCA to extract feature which can explain 95% of the variance of data.
pca = PCA(n_components=.95)
pca.fit(scaled_data)
feature_ratio = pd.DataFrame(pca.explained_variance_ratio_*100, columns=['Variance Explained %(precentage)'])
#print(pca.explained_variance_ratio_)
Xpca95 = pd.DataFrame(pca.transform(scaled_data))
feature_ratio

In [None]:
sns.pairplot(Xpca95, diag_kind='kde')

### Train SVM with PCA

In [None]:
# Spliting the data into train and test data with test size=30%
X_train_pca, X_val_pca, y_train_pca, y_val_pca = train_test_split(Xpca95,y, test_size=0.3, random_state=1)
svc_pca=SVC()
svc_pca.fit(X_train_pca,y_train_pca)
predict= svc_pca.predict(X_val_pca)
print(f'Train set acccuracy {svc_pca.score(X_train_pca,y_train_pca)*100: .4f}%')
print(f'Test set accuracy{svc_pca.score(X_val_pca,y_val_pca)*100: .4f}%')

### Cross-validation with PCA

In [None]:
kfold = KFold(n_splits=10, random_state=1, shuffle=True)
result = cross_val_score(svc,Xpca95,y,cv=kfold,scoring='accuracy')
print(f'Mean KFold accuracy is: {result.mean()*100: .4f}%')

There is loss of approx 4% in accuracy of the data while using the PCA method, but we have reduced the no. of feature to be considered from 19 to 7, with 95% variance explained.

### Optional Task

### KMeans Clustering

In [None]:
# KMeans clustering


# Using elbow method to determine the best number of clusters.
# 1.Select the range in which you want to search for clusters, from the above analysis for orginal data we have 2-3 cluster 
# and in the Xpca95 data we have seen there could be 2 clusters. So, let a take range from 2-5.
clusters = np.arange(2,6);
mean_distortions=[]
for cluster in clusters:
    kmeans = KMeans(n_clusters=cluster)
    kmeans.fit(Xpca95)
    mean_distortions.append(sum(np.min(cdist(Xpca95,kmeans.cluster_centers_),axis=1))/Xpca95.shape[0])

In [None]:
plt.plot(clusters,mean_distortions,'bx-')
plt.title('Elbow Estimatior')
plt.xlabel('No. of clusters')
plt.ylabel('Mean Distortion')

- From the above graph we can cleary see that elbow point exist on **3**. So, we will go with 3 as no. of clusters.

### KMean clustering with n_clusters=3

In [None]:
kmean_pca95 = KMeans(n_clusters=3)
kmean_pca95.fit(X_train_pca)
Xpca_train_labels = kmean_pca95.labels_
Xpca_val_labels = kmean_pca95.predict(X_val_pca)

### Training the SVM to evaluate with new labels from clustering

In [None]:
svc_k = SVC()
svc_k.fit(X_train_pca, Xpca_train_labels)
predict = svc_k.predict(X_val_pca)
train_score= svc_k.score(X_train_pca, Xpca_train_labels)
test_score = svc_k.score(X_val_pca, Xpca_val_labels)
print(f'Traing score is : {train_score*100: .4f}% Test score is: {test_score*100: .4f}%')

- The score has has a signinficant improvement against the orignial data labels.
- It also implies that cluster formed by KMeans algorithm are pretty good, but there could be a difference between what they contain from the real world, the real world cluster might share some overlapping, but that might be less while identifying via algorthim as it only looks at the very specific features given to us. Instead classification might required more than that, in real life Maruti Omni could be car for someone and van some other or can be overlapping even in the features given.

## The END