# Distance Comparison in Clustering

##### In this notebook, we will be looking into two types of different distance metrics used for clustering in unsupervised learning:
- Euclidean
- Cosine

##### We will be using two clustering algorithms:
- K-Means
- Agglomerative Heirarchical Clustering

We will be using a dataset from MATLAB called carbig.txt that includes various measurements of cars from 1970 to 1982.

In [2]:
# importing all the libraries used
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set()
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_samples, silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
from sklearn.cluster import AgglomerativeClustering
from scipy.stats import mode
from nltk.cluster.kmeans import KMeansClusterer
from nltk.cluster.util import euclidean_distance, cosine_distance
from sklearn.metrics import accuracy_score
import matplotlib.cm as cm
from scipy.spatial.distance import pdist

##### Loading the data

In [6]:
df = pd.read_csv("carbig.txt",comment="%", sep="\s+", header=None)
df.columns = ["Acceleration", "Cylinders", "Displacement", "Horsepower", "MPG", "Weight"]
df.shape, df.head()

((406, 6),
    Acceleration  Cylinders  Displacement  Horsepower   MPG  Weight
 0          12.0        8.0         307.0       130.0  18.0  3504.0
 1          11.5        8.0         350.0       165.0  15.0  3693.0
 2          11.0        8.0         318.0       150.0  18.0  3436.0
 3          12.0        8.0         304.0       150.0  16.0  3433.0
 4          10.5        8.0         302.0       140.0  17.0  3449.0)

After some data preprocessing and preliminary data exploratory, we will treat this dataset as unsupervised dataset (that is without labels) and try to cluster the 'Cylinders'

##### Preprocessing and segmentation of data

In [5]:
print(df['Cylinders'].unique())

[8. 4. 6. 3. 5.]


We have 5 cylinder types: 3,4,5,6,8 cylinders. To make things simpler, lets segment the into only three potential clusters:4,6,8 cylinders. we will drop the rest.

In [7]:
df = df[(df['Cylinders'] != 3) & (df['Cylinders'] != 5)]
print(df['Cylinders'].unique())

[8. 4. 6.]


Now we are only left with the data required for the 3 clusters. Lets check for any missing values.

In [9]:
print(df.isna().sum().sort_values(ascending=False))

MPG             8
Horsepower      6
Weight          0
Displacement    0
Cylinders       0
Acceleration    0
dtype: int64


There are a few missing values in MPG and Horsepower. For the purpose of this exercise, we will remove the records with any missing values

In [10]:
# removing na records
df.dropna(how='any', inplace=True)
df.shape

(385, 6)

Now our data should have no missing values. We have lost about 21 records after removing records with any NA values. Let's check the percentage of each class in this dataset

In [11]:
for i in [4,6,8]:
    class_percent = np.sum(df['Cylinders'] == i)/len(df['Cylinders'])
    print("Percentage of {} Cylinder Cars: {:.2%}.".format(i,class_percent))   

Percentage of 4 Cylinder Cars: 51.69%.
Percentage of 6 Cylinder Cars: 21.56%.
Percentage of 8 Cylinder Cars: 26.75%.
