## KMeans, StandardScaler and ARI in sklearn

In [None]:
# sklearn imports
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score

# pandas and numpy imports
import pandas as pd
import numpy as np

# plotting imports
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns

# set sns theme and set pandas to display all rows and columns
sns.set_theme()

### Load Iris dataset - data exploration and preprocessing


- The Iris dataset is a classic and widely used dataset in machine learning and statistics. 

- The dataset consists of measurements of four attributes of three different species of iris flowers: 
  - setosa
  - versicolor
  - virginica
  
 </br>
  
- The four attributes measured for each flower are sepal length, sepal width, petal length, and petal width, all in centimeters. 
  
- The dataset contains 150 observations, with 50 observations for each of the three species.


In [None]:
# Load iris dataset
iris = load_iris()

print(iris.keys())

In [None]:
print(iris['data'])

In [None]:
print(iris['feature_names'])

In [None]:
print(iris['target'])

In [None]:
print(iris['target_names'])

### Data analysis

In [None]:
# Create dataset df
iris_df = pd.DataFrame(
    iris['data'],
    columns=iris['feature_names']
)

iris_df.head()

In [None]:
iris_df.shape

In [None]:
iris_df.isna().sum().T

In [None]:
# Summary statistics
iris_df.describe().T

In [None]:
# Plot features
plt.figure()
iris_df.hist()
plt.suptitle("Iris histograms")

In [None]:
# Data boxplots
sns.boxplot(iris_df)

### Simplify the dataset

To demonstrate a simple clustering example, we will retain only the 'petal length (cm)' and 'petal width (cm)' features and focus on the Setosa and Virginica species from the Iris dataset. We will keep only these two dimensions to easily visualize the clustering results in a 2D plot.

In [None]:
# Add label to the dataset
iris_df['label'] = [iris['target_names'][target] for target in iris['target']]

# Remove versicolor class
iris_df = iris_df[iris_df['label'] != 'versicolor']

# Keep only petal length and petal width
iris_df = iris_df.filter(
    items=[
        'petal length (cm)', 
        'petal width (cm)', 
        'label'
    ]
)

In [None]:
# Scatterplot
sns.scatterplot(
    iris_df, 
    x='petal length (cm)', 
    y='petal width (cm)', 
    hue='label'
)

plt.title('Data subset - raw features')

## Cluster the dataset without data scaling

Even though scaling data is important preprocessing step for K-means clustering, following cell demonstrates that given the nautre of the data, K-means clustering can work just fine even without data scaling.

In [None]:
# Prepare K-means clustering input
cluster_data = iris_df[['petal length (cm)', 'petal width (cm)']]

# Run K-means clustering with k=2
kmeans = KMeans(n_clusters=2, n_init='auto')
kmeans.fit(cluster_data)

# Extract cluster id for each data point
iris_df['clusters'] = kmeans.predict(cluster_data)

# Plot clustering
sns.scatterplot(
    iris_df, 
    x='petal length (cm)', 
    y='petal width (cm)', 
    hue='clusters'
)

### Adjusted rand index for clustering comparison

Sklearn **adjusted_rand_score** function can be used to compare clustering and labels even when clustering and labels have different formats (e.g. cluster ids are integers while label ids are strings).

In [None]:
# Label values
iris_df['label'].to_numpy()

In [None]:
# Cluster values
iris_df['clusters'].to_numpy()

In [None]:
# Adjusted rand index - value of 1 means that original labels and clustering results match perfectly.
adjusted_rand_score(iris_df['label'].to_numpy(), iris_df['clusters'].to_numpy())

## Clustering with data scaling

In this section, we will perform clustering on scaled data.

Standard scaling, also known as Z-score normalization, is a data preprocessing technique that transforms features by centering them around the mean and scaling them to have a standard deviation of one, ensuring that all features contribute equally to the analysis and mitigating the influence of large variations in the original data.

In [None]:
# Prepare the input data
cluster_data = iris_df[['petal length (cm)', 'petal width (cm)']]

# Scale the data
standard_scaler = StandardScaler()
standard_scaler.fit(cluster_data)
cluster_data = standard_scaler.transform(cluster_data)
cluster_data = pd.DataFrame(
    cluster_data, 
    columns = ['petal length (cm)', 'petal width (cm)']
)

# Run K-means clustering with k=2
kmeans = KMeans(n_clusters=2, n_init='auto')
kmeans.fit(cluster_data)


# Extract cluster assignment for each data point
cluster_data['clusters'] = kmeans.predict(cluster_data)

# Plot clustering
sns.scatterplot(
    cluster_data, 
    x='petal length (cm)', 
    y='petal width (cm)', 
    hue=cluster_data['clusters']
)

plt.title('Scaled data')

In [None]:
# Adjusted rand index - value of 1 means that original labels and clustering results match perfectly.
adjusted_rand_score(
    iris_df['label'].to_numpy(), 
    iris_df['clusters'].to_numpy()
)