# GRIP: The Sparks Foundation

## Name: Badavath Tharun

## _Data Science and Business Analytics Intern_

### Task 2 : Prediction Using Unsupervised ML

● From the given ‘Iris’ dataset, predict the optimum number of clusters
  and represent it visually.

● Dataset : https://bit.ly/3kXTdox

## Step 1: Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
plt.style.use('ggplot')

In [2]:
df = pd.read_csv("./data/Iris (1).csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df = df.drop(["Species", "Id"], axis=1)

In [None]:
df.head()

In [None]:
df.describe()

## Step 2: Data Visualizations

In [None]:
sns.pairplot(df)

In [None]:
fig = px.scatter(df, x="SepalWidthCm", y="SepalLengthCm",
                 size='PetalLengthCm', hover_data=['PetalWidthCm'])
fig.show()

In [None]:
fig = px.line(df)
fig.show()

In [None]:
for column in df.columns:
    fig = px.box(df,x=column)
    fig.show()

In [None]:
df[(df["SepalWidthCm"]> 4.0) | (df["SepalWidthCm"]<= 2.0)]

In [None]:
drop_index = df[(df["SepalWidthCm"]> 4.0) | (df["SepalWidthCm"]<= 2.0)].index
drop_index

In [None]:
df = df.drop(drop_index, axis=0)

In [None]:
for column in df.columns:
    fig = px.box(df,x=column)
    fig.show()

In [None]:
sns.heatmap(df.corr(), annot=True, cmap="viridis")

## Step 3: Prepare the data for model

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
scaled_df = scaler.fit_transform(df)

### Creating and Fitting a KMeans Model

Note of our method choices here:

* fit(X[, y, sample_weight])
    * Compute k-means clustering.

* fit_predict(X[, y, sample_weight])
    * Compute cluster centers and predict cluster index for each sample.

* fit_transform(X[, y, sample_weight])
    * Compute clustering and transform X to cluster-distance space.

* predict(X[, sample_weight])
    * Predict the closest cluster each sample in X belongs to.

In [None]:
from sklearn.cluster import KMeans

In [None]:
model = KMeans(n_clusters=2)

In [None]:
cluster_labels = model.fit_predict(scaled_df)

In [None]:
cluster_labels

In [None]:
df['Cluster'] = cluster_labels

In [None]:
df

In [None]:
fig = px.scatter(df, x="SepalWidthCm", y="SepalLengthCm",color= "Cluster",
                 size='PetalLengthCm', hover_data=['PetalWidthCm'])
fig.show()

## Choosing K Value

In [None]:
ssd = []

for k in range(2,10):
    
    model = KMeans(n_clusters=k)
    
    
    model.fit(scaled_df)
    
    #Sum of squared distances of samples to their closest cluster center.
    ssd.append(model.inertia_)

In [None]:
plt.plot(range(2,10),ssd,'o-')
plt.xlabel("K Value")
plt.ylabel(" Sum of Squared Distances")

### We then look for a K value where rate of reduction in SSD begins to decline.

#### Here when k becomes 3 there is a sudden descrease in ssd and rate of reduction of ssd begins to decline

In [None]:
model = KMeans(n_clusters=3)

In [None]:
cluster_labels = model.fit_predict(scaled_df)

In [None]:
cluster_labels

In [None]:
df['Cluster'] = cluster_labels

In [None]:
df

In [None]:
fig = px.scatter(df, x="SepalWidthCm", y="SepalLengthCm",color= "Cluster",
                 size='PetalLengthCm', hover_data=['PetalWidthCm'])
fig.show()

# DBSCAN

In [None]:
from sklearn.cluster import DBSCAN

## Charting reasonable Epsilon values

In [None]:
outlier_percent = []
number_of_outliers = []

for eps in np.linspace(0.001,10,100):
    
    # Create Model
    dbscan = DBSCAN(eps=eps)
    dbscan.fit(df)
    
    # Log Number of Outliers
    number_of_outliers.append(np.sum(dbscan.labels_ == -1))
    
    # Log percentage of points that are outliers
    perc_outliers = 100 * np.sum(dbscan.labels_ == -1) / len(dbscan.labels_)
    
    outlier_percent.append(perc_outliers)

In [None]:
sns.lineplot(x=np.linspace(0.001,10,100),y=outlier_percent)
plt.ylabel("Percentage of Points Classified as Outliers")
plt.xlabel("Epsilon Value")

In [None]:
sns.lineplot(x=np.linspace(0.001,10,100),y=number_of_outliers)
plt.ylabel("Number of Points Classified as Outliers")
plt.xlabel("Epsilon Value")
plt.xlim(0,1)

### Do we want to think in terms of percentage targeting instead?

If so, you could "target" a percentage, like choose a range producing 1%-5% as outliers.

In [None]:
sns.lineplot(x=np.linspace(0.001,10,100),y=outlier_percent)
plt.ylabel("Percentage of Points Classified as Outliers")
plt.xlabel("Epsilon Value")
plt.ylim(0,5)
plt.xlim(0,2)
plt.hlines(y=1,xmin=0,xmax=2,colors='red',ls='--')

In [None]:
def display_categories(model,data):
    labels = model.fit_predict(data)
    sns.scatterplot(data=data,x='SepalWidthCm',y='SepalLengthCm',hue=labels,palette='Set1')

In [None]:
model = DBSCAN(eps=0.9)

In [None]:
display_categories(model, df)

## So the number of optimal clusters for this data set is 3