## Task 2 : Prediction using Unsupervised ML

* From the given ‘Iris’ dataset, predict the optimum number of clusters and represent it visually.<br>

> ##### **By:** Rutuja Vaidya
> ##### **Technique used:** UnSupervised ML: K-Means
> ##### **Language used:** Python

### Importing libraries and Data set

In [None]:
#importing libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans

import warnings
warnings.filterwarnings('ignore')

In [None]:
# loading data
try:
    iris_data = pd.read_csv('../input/iris-dataset/Iris.csv')
    print("Data loaded Successfully!!\n")
    iris_data.info()
except:
    print("Can't Load data")

We can conclude from above information:
1.   All Columns are Filled i.e. There is no Null value present
2.   Iris Data contains 6 columns out of which :<br>
  **`Id`** column is unique.<br>
  **`Species`** is the Target<br>
  **`SepalLengthCm`, `SepalWidthCm`, `PetalLengthCm`, `PetalWidthCm`** are Features<br>

In [None]:
# check the data by printing first 5 lines
iris_data.head()

> Target **`Species`** has categorical values

Let's check its unique values

In [None]:
iris_data['Species'].value_counts()

In [None]:
iris_data['Species']

### Exploratory Analysis

In [None]:
# Let's first see the features
iris_data.describe()

In [None]:
plt.figure(figsize=(10,6))
ax = sns.boxplot(data=iris_data.drop('Id',axis=1), orient="h", palette="Set2")

In [None]:
# let's chechk correlation between numeric columns
corr = iris_data.drop('Id',axis=1).corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
heatmap = sns.heatmap(corr, mask=mask, vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);

> The Sepal Width and Length are not correlated The Petal Width and Length are highly correlated

In [None]:
# Correlation of Sepal Length-Width
plt.figure(figsize=(12,8))
sns.scatterplot(x='SepalLengthCm',y='SepalWidthCm',hue='Species',data=iris_data)

In [None]:
# Correlation of Petal Length-Width
plt.figure(figsize=(12,8))
sns.scatterplot(x='PetalLengthCm',y='PetalWidthCm',hue='Species',data=iris_data)

From above Two graphs, we can see that
* Sepal Length and Width have low correlation   
* Petal Length and Width have high correlation


In [None]:
# Let's bivariate relation between each pair of features by Ploting the PairPlot
sns.pairplot(iris_data, hue="Species", size=3.2)

>From the pairplot, we can see that the `Iris-setosa` species is separataed from the other two across all feature combinations

In [None]:
def ViolinPlot(X,Y1,Y2,data):
  plt.figure(figsize=(15,10))
  plt.subplot(1,2,1)
  sns.violinplot(x=X,y=Y1,data=iris_data)
  plt.title(Y1)
  plt.subplot(1,2,2)
  sns.violinplot(x=X,y=Y2,data=iris_data)
  plt.title(Y2)

In [None]:
ViolinPlot("Species","PetalLengthCm","PetalWidthCm",iris_data)
ViolinPlot("Species","SepalLengthCm","SepalWidthCm",iris_data)

Some Violin Plot is long, there might be outlier. <br>
Let's check Box plot

In [None]:
plt.figure(figsize=(15,10))
plt.subplot(2,2,1)
sns.boxplot(x='Species',y='PetalLengthCm',data=iris_data)
plt.subplot(2,2,2)
sns.boxplot(x='Species',y='PetalWidthCm',data=iris_data)
plt.subplot(2,2,3)
sns.boxplot(x='Species',y='SepalLengthCm',data=iris_data)
plt.subplot(2,2,4)
sns.boxplot(x='Species',y='SepalWidthCm',data=iris_data)

> We can see some Outliers

#### Label Encoding of Target Variable

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

iris_data['Species'] = le.fit_transform(iris_data['Species'])
iris_data['Species']

### Predicting Optimal Values for K

In [None]:
# As given problem is of classification problem, we can use K-Means Algorithm for finding the Optimal k value

from sklearn.cluster import KMeans

x = iris_data.iloc[:, [0, 1, 2, 3, 4]].values
wcss = []

for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(13,8))
plt.plot(range(1, 11), wcss,marker='o')
plt.title('The elbow method',size=15)
plt.xlabel('Number of clusters',size=12)
plt.ylabel('WCSS',size=12) #within cluster sum of squares
plt.show()


>From K= 1 to K= 2, there is large drop<br>
>From K= 2 to K= 3, there is slight drop<br>
> After K= 3, slop is almost constant

Hence, value of **`K=3`**  implies an Optimal Value of K-Clusters

In [None]:
# Predicting the values using Kmeans Algorithm
kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
predictions = kmeans.fit_predict(x)

In [None]:
#Predicted Values
predictions

In [None]:
#centroids
kmeans.cluster_centers_

In [None]:
#visualising the predicted clusters on basic all 4 features
Features = ['Sepal Length','Sepal Width','Petal Length','Petal Width']
plt.figure(figsize=(18,14))
for i in range(1,5):
    plt.subplot(2,2,i)
    plt.scatter(x[predictions == 0,0], x[predictions == 0,i], s=50, c = '#c718f2', label = 'Iris-setosa' )
    plt.scatter(x[predictions == 1,0], x[predictions == 1,i], s=50, c = '#2140ed', label = 'Iris-vergiscolor' )
    plt.scatter(x[predictions == 2,0], x[predictions == 2,i], s=50, c = '#2cb510', label = 'Iris-virginica' )
    #centroids of the clusters
    plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,i], s = 120, c = 'red', label = 'Centroids')
    plt.title(Features[i-1],size=16)
    plt.xlabel('Id',size=12)
    plt.ylabel(iris_data.columns[i],size=12)
    plt.legend()
plt.suptitle('Clusters w.r.t Features',fontsize=20)

#### From above graphs and Elbow Curve, we can see at `K = 3`, we get Optimal Clusters. 