# Species Segmentation with Cluster Analysis

The Iris flower dataset is one of the most popular ones for machine learning. You can read a lot about it online and have probably already heard of it: https://en.wikipedia.org/wiki/Iris_flower_data_set

We didn't want to use it in the lectures, but believe that it would be very interesting for you to try it out (and maybe read about it on your own).

There are 4 features: sepal length, sepal width, petal length, and petal width.

Start by creating 2 clusters. Then standardize the data and try again. Does it make a difference?

Use the Elbow rule to determine how many clusters are there.


## Import the relevant libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set the styles to Seaborn
sns.set()
# Import the KMeans module so we can perform k-means clustering with sklearn
from sklearn.cluster import KMeans

## Load the data

Load data from the csv file: <i> 'iris_dataset.csv'</i>.

In [None]:
data=pd.read_csv('iris-dataset.csv')
data

## Plot the data

For this exercise, try to cluster the iris flowers by the shape of their sepal. 

<i> Use the 'sepal_length' and 'sepal_width' variables.</i> 

In [None]:
plt.scatter(data['sepal_length'],data['sepal_width'])
plt.ylabel('sepal width')
plt.xlabel('sepal length')

# Clustering (unscaled data)

Separate the original data into 2 clusters.

In [None]:
x=data.copy()

In [None]:
kmeans=KMeans(2)
kmeans.fit(x)

In [None]:
clusters=x.copy()
clusters['clusters_pred']=kmeans.fit_predict(x)

In [None]:
plt.scatter(clusters['sepal_length'],clusters['sepal_width'],c=clusters['clusters_pred'],cmap='rainbow')
plt.ylabel('sepal width')
plt.xlabel('sepal length')

# Standardize the variables

Import and use the <i> method </i> function from sklearn to standardize the data. 

In [None]:
from sklearn import preprocessing
x_scaled=preprocessing.scale(x)
x_scaled

# Clustering (scaled data)

In [None]:
kmeans_scaled=KMeans(3)
kmeans_scaled.fit(x_scaled)
clusters_scaled=x.copy()
clusters_scaled['clusters_pred']=kmeans_scaled.fit_predict(x_scaled)
clusters_scaled

In [None]:
plt.scatter(clusters_scaled['sepal_length'],clusters_scaled['sepal_width'],c=clusters_scaled['clusters_pred'],cmap='rainbow')
plt.ylabel('sepal width')
plt.xlabel('sepal length')

## Take Advantage of the Elbow Method

### WCSS

In [None]:
# Createa an empty list
wcss =[]

# Create all possible cluster solutions with a loop
# We have chosen to get solutions from 1 to 9 clusters; you can ammend that if you wish
for i in range(1,10):
    # Cluster solution with i clusters
    kmeans = KMeans(i)
    # Fit the STANDARDIZED data
    kmeans.fit(x_scaled)
    # Append the WCSS for the iteration
    wcss.append(kmeans.inertia_)
    
# Check the result
wcss

### The Elbow Method

In [None]:
# Plot the number of clusters vs WCSS
plt.plot(range(1,10),wcss)
# Name your axes
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')

How many clusters are there?

## Compare your solutions to the original iris dataset

The original (full) iris data is located in <i>iris_with_answers.csv</i>. Load the csv, plot the data and compare it with your solution. 

Obviously there are only 3 species of Iris, because that's the original (truthful) iris dataset.

The 2-cluster solution seemed good, but in real life the iris dataset has 3 SPECIES (a 3-cluster solution). Therefore, clustering cannot be trusted at all times. Sometimes it seems like x clusters are a good solution, but in real life, there are more (or less).

In [None]:
real_data = pd.read_csv('iris-with-answers.csv')
real_data['species'].unique()

In [None]:
# We use the map function to change any 'yes' values to 1 and 'no'values to 0. 
real_data['species'] = real_data['species'].map({'setosa':0, 'versicolor':1 , 'virginica':2})

## Scatter plots (which we will use for comparison)
### real data

In [None]:
plt.scatter(real_data['sepal_length'], real_data['sepal_width'], c= real_data ['species'], cmap = 'rainbow')

In [None]:
plt.scatter(real_data['petal_length'], real_data['petal_width'], c= real_data ['species'], cmap = 'rainbow')

### Our clustering solution data

In [None]:
plt.scatter(clusters_scaled['sepal_length'],clusters_scaled['sepal_width'],c=clusters_scaled['clusters_pred'],cmap='rainbow')
plt.ylabel('sepal width')
plt.xlabel('sepal length')

In [None]:
plt.scatter(clusters_scaled['petal_length'],clusters_scaled['petal_width'],c=clusters_scaled['clusters_pred'],cmap='rainbow')
plt.ylabel('petal width')
plt.xlabel('petal length')