# Prediction using Unsupervised Machine Learning
GRIP @ The Sparks Foundation  
Data Science & Business Analytics Intern
### by: Pham Quang Chi

## Objective:
Given the 'Iris' dataset of three species of Iris (Iris Setosa, Iris Virginica and Iris Versicolor):
- Finding the optimal number of clusters that accurately represents the speccies.
- Create visual demostration of the research.

## Approach:
- Use the `KMeans` algorithm from `sklearns` to defind the clusters.
- Use the `Plotly Express` library for data visualization.

### Dataset:
- [Iris](https://bit.ly/3kXTdox)

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
# load the data
iris = pd.read_csv(r"C:\Users\Steven\Downloads\Iris.csv", index_col=0)
iris.info()
iris.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 1 to 150
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  150 non-null    float64
 1   SepalWidthCm   150 non-null    float64
 2   PetalLengthCm  150 non-null    float64
 3   PetalWidthCm   150 non-null    float64
 4   Species        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 7.0+ KB


Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa


## Exploratory Data Analysis

In [None]:
# pair plot with plotly express
fig = px.scatter_matrix(iris, dimensions=["PetalWidthCm", "SepalLengthCm", "PetalLengthCm", "SepalWidthCm"], color="Species")
fig.update_layout(font=dict(size=8))
fig.show()

Feature Variance

In [3]:
# check for variance in the features
fig = px.bar(x=iris.iloc[:,[0, 1, 2, 3]].var(),
             y=iris.columns[:-1],
             title=" Feature Variance")
fig.update_layout(xaxis_title='variance', yaxis_title='Feature')

fig.show()

`PetalLengthCm` seems to have quite high variance. Lets check if there is any outliers

In [4]:
# boxplot for `PetalLengthCm` using plotly
px.box(iris, y='PetalLengthCm', title='Petal Length')

It seems to be no outliers present, no need to trimm the `PetalLengthCm` data

In [5]:
# histogram plots of the features
fig = px.histogram(iris
                   , x=["PetalWidthCm", "SepalLengthCm", "PetalLengthCm", "SepalWidthCm"]
                   , labels={"value": "cm", "variable": "Feature"}
                   , histnorm="percent")
fig.show()

## Build Model

Create the feature dataset

In [6]:
# remove the `Species` column
data = iris.drop('Species', axis=1)
data.head()

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,5.1,3.5,1.4,0.2
2,4.9,3.0,1.4,0.2
3,4.7,3.2,1.3,0.2
4,4.6,3.1,1.5,0.2
5,5.0,3.6,1.4,0.2


In [7]:
# scale the data
ss = StandardScaler()
X_scaled = ss.fit_transform(data)   
# Put `X_scaled` into DataFrame
data_scaled = pd.DataFrame(X_scaled, columns=data.columns)
print("data_scaled shape:", X_scaled.shape)
data_scaled.head()

data_scaled shape: (150, 4)


Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,-0.900681,1.032057,-1.341272,-1.312977
1,-1.143017,-0.124958,-1.341272,-1.312977
2,-1.385353,0.337848,-1.398138,-1.312977
3,-1.506521,0.106445,-1.284407,-1.312977
4,-1.021849,1.26346,-1.341272,-1.312977


Finding optimal hyperarameter **k**

In [8]:
n_clusters = range(1,11)
inertia_errors = []
silhouette_scores = []

# Add `for` loop to train model and calculate inertia, silhouette score.
for k in n_clusters:
    model = KMeans(n_clusters=k, init='k-means++', random_state=42)
    model.fit(data_scaled)
    inertia_errors.append(model.inertia_)
    if k > 1:    
        silhouette_scores.append(silhouette_score(data, model.labels_))

print("Inertia:", inertia_errors[:3])
print()
print("Silhouette Scores:", silhouette_scores[:3])

Inertia: [600.0000000000001, 223.73200573676348, 140.965816630747]

Silhouette Scores: [0.6863930543445408, 0.5059312160513932, 0.3596055632178885]


In [9]:
# Create line plot of `inertia_errors` vs `n_clusters`
fig = px.line(data_scaled, x=n_clusters, y=inertia_errors, title='K-Means Model: Inertia vs Number of Clusters')
fig.update_layout(xaxis_title='Cluster', yaxis_title='Inertia')
fig.show()

Optimal **k** based on *minimum* `Inertia Errors`: 4

In [13]:
# Create a line plot of `silhouette_scores` vs `n_clusters`
fig = px.line(data_scaled, x=n_clusters[1:], y=silhouette_scores, title='K-Means Model: Silhouette Score vs Number of Clusters')
fig.update_layout(xaxis_title='Cluster', yaxis_title='Silhouette Score')

fig.show()

Optimal **k** based on *maximum* `Silhouette Scores`: 2

-> The **k** number of clusters which optimizes for both the `Inertia Errors` and `Silhouette Scores` will be the median: **3** 

### Final model

In [14]:
model = KMeans(n_clusters=3, random_state=42)
model.fit(data_scaled)

## Result Communication

In [15]:
# Create a DataFrame `xgb` that contains the mean values of the features in `data` for each of the clusters in the final model
xgb = data.groupby(model.labels_).mean()
xgb

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,6.780851,3.095745,5.510638,1.97234
1,5.006,3.418,1.464,0.244
2,5.801887,2.673585,4.369811,1.413208


In [16]:
# Create side-by-side bar chart of `xgb`
fig = px.bar(xgb,
    barmode='group',
    title="Average Measurements by Cluster")
fig.update_layout(xaxis_title='Cluster', yaxis_title='cm')

fig.show()

### Principal Component Analysis (**PCA**)
Our data has 4 dimensions, lets reduce it to 2 dimensions in order to visualize our Scatter-plot.

In [17]:
# transform the data to a 2D matrix

# Instantiate transformer
pca = PCA(n_components=2, random_state=42)

# Transform `data`
data_t = pca.fit_transform(data)

# Put `data_t` into DataFrame
data_pca = pd.DataFrame(data_t, columns=['PC1', 'PC2'])

print("data_pca shape:", data_pca.shape)
data_pca.head()

data_pca shape: (150, 2)


Unnamed: 0,PC1,PC2
0,-2.684207,0.326607
1,-2.715391,-0.169557
2,-2.88982,-0.137346
3,-2.746437,-0.311124
4,-2.728593,0.333925


In [18]:
# Create the centroids of the PCA transformed data
model.fit(data_pca)
centroids = model.cluster_centers_

In [20]:
# Create scatter plot of transformed data
fig = px.scatter(data_pca, x='PC1', y='PC2',
                 color=model.labels_.astype(str),
                 labels={'color': 'Cluster'},
                 title="PCA Representation of Iris Clusters")
fig.add_scatter(x=centroids[:, 0], y=centroids[:, 1], mode="markers", name='CentroidS', marker=dict(color="black", symbol="x", size=12))
fig.show()