# Task 02: Prediction using Unsupervised ML

## Submitted By: Yashuv Baskota
### Language- Python
### Dataset: https://bit.ly/3kXTdox

#### Description:
To predict the optimum number of clusters from the given __'Iris'__ dataset and represent it visually, I chose the `K-means clustering` algorithm. K-means clustering is an unsupervised learning algorithm that is used to partition a dataset into a predefined number of clusters (k). It works by iteratively assigning each data point to the nearest cluster center and updating the cluster centers to be the mean of the points assigned to each cluster.

<center><img src="image/iris_flowers.png" width="500px"> <br>
    <u>Image source</u>: Wikipedia
</center>

* To use K-means clustering for the *Iris dataset*, we first need to decide on the number of clusters we want to find (k). Once determined the optimal value of k, we can use the K-means algorithm to cluster the Iris dataset into k clusters. 
* To visualize the results, you can use a scatter plot to plot the data points, coloring each point according to the cluster it belongs to. This will allow us to see the clusters that the algorithm has identified and get a sense of how well the data has been partitioned.

## 1. Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.preprocessing import StandardScaler

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score

## 2. Load the Iris dataset

In [None]:
# load the Iris dataset from a CSV file
data = pd.read_csv('data/Iris.csv')

## 3. EDA

In [None]:
data.info()

In [None]:
data.head()

In [None]:
data.isnull().sum()

In [None]:
data[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']].describe()

### Unique Species

In [None]:
data["Species"].unique()

### Frequency distribution of species

In [None]:
data["Species"].value_counts()

### Box Plot

In [None]:
sns.boxplot(x="Species",y="PetalLengthCm",data=data)
plt.show()

## 4. Data Preprocessing

### Extract Feature Columns

In [None]:
X = data.iloc[:, [1, 2, 3, 4]].values

### Feature Scaling

In [None]:
# # standardize the features
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X)

## 5. K-Means

In [None]:
# fit K-means to the dataset for different values of k and compute the WCSS
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

### Determine the optimal number of clusters using the elbow method

In [None]:
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

### Implementing K-Means Clustering

In [None]:
# fit K-means to the dataset with the optimal number of clusters
kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 42)
predictions = kmeans.fit_predict(X)

### Visualising the clusters

In [None]:
plt.scatter(X[predictions == 1, 0], X[predictions == 1, 1], s = 90, c = 'red', label = 'Iris-setosa')
plt.scatter(X[predictions == 0, 0], X[predictions == 0, 1], s = 90, c = 'green', label = 'Iris-versicolour')
plt.scatter(X[predictions == 2, 0], X[predictions == 2, 1], s = 90, c = 'blue', label = 'Iris-virginica')

# plot the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c = 'yellow', label = 'Centroids')

plt.title("K-means Cluster: Iris Flower Species")
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.legend()

### Performance Evaluation

In [None]:
# evaluate the performance of the clustering
print(f'Within-cluster sum of squares: {kmeans.inertia_:.4f}')
print(f'Silhouette score: {silhouette_score(X, predictions):.4f}')
print(f'Calinski-Harabasz score: {calinski_harabasz_score(X, predictions):.4f}')

## 6. Making Predictions

In [None]:
# predict the cluster for a new data point

new_data_point1 = [[5.7,2.5,5.0,2.0]]
new_data_point2 = [[5.0,3.6,1.4,0.2]]
new_data_point3 = [[7.2,3.2,6.0,1.8]]

def make_prediction(new_data_point):
    
    # create a DataFrame with column names
    new_data_point_df = pd.DataFrame(new_data_point, columns=[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']])

    # predict the cluster for the new data point
    prediction = kmeans.predict(new_data_point_df)[0]
    print(f'Predicted cluster for new data point: {prediction}')

In [None]:
make_prediction(new_data_point1)
make_prediction(new_data_point2)
make_prediction(new_data_point3)


---
__Thank You!__