# Hierarchical Clustering with Iris Dataset

In this project, we're using the Iris dataset, a classic dataset in machine learning, to perform hierarchical clustering and visualize the results. The Iris dataset consists of 150 samples from three different species of Iris flowers (Iris-setosa, Iris-versicolor, and Iris-virginica), with four features for each sample: sepal length, sepal width, petal length, and petal width.

We aim to cluster these samples based on their feature values, identify patterns, and visualize the results using dimensionality reduction techniques.

## Key Concepts Used:

- Hierarchical Clustering: A method of unsupervised learning that builds nested clusters based on the distances between data points.

- Dendrogram: A visualization that shows the hierarchy of clusters and where to cut them to form meaningful groups.

- PCA (Principal Component Analysis): A dimensionality reduction technique that simplifies data for easier visualization and interpretation.

- Convex Hull: A geometric boundary around points in a cluster to visualize the spread and coverage of each group.

## Why Hierarchical Clustering?

Hierarchical clustering is useful for discovering the structure of data, especially when you don’t know the number of clusters in advance. Unlike methods like K-means, which require specifying the number of clusters, hierarchical clustering provides a full picture of how the data points group together at different levels.

## Project Flow:

- Data Loading: Load the Iris dataset.

- Clustering: Perform hierarchical clustering using the Ward method.

- Dendrogram: Visualize the dendrogram to explore the cluster hierarchy.

- PCA: Reduce data dimensions to make the results more interpretable.

- Cluster Visualization: Plot the clusters in a 2D plane.

- Convex Hulls: Optionally draw boundaries around clusters.

## Preparing the Environment

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.datasets import make_blobs
import ipywidgets as widgets
from IPython.display import display
from scipy.spatial import ConvexHull
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings("ignore")



In [2]:
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target


## Agglomerative Hierarchical clustering

In [13]:
# Perform hierarchical clustering using the Ward method
linked = linkage(X, method='ward')

# Function to interactively plot clusters and dendrogram
def plot_clusters(num_clusters):
    plt.figure(figsize=(12, 6))  # Set figure size
    plt.subplot(1, 2, 1)  # Define the layout for the dendrogram
    dendrogram(linked, truncate_mode='lastp', p=num_clusters, show_contracted=True)  # Plot the dendrogram
    plt.title("Dendrogram")  # Title for the dendrogram
    plt.xlabel('Index or (Size)')  # X-axis label
    plt.ylabel('Distance')  # Y-axis label

    # Apply PCA to reduce dimensions to 2D
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X)

    plt.subplot(1, 2, 2)  # Define the layout for the cluster plot
    # Determine the clusters based on the specified number of clusters
    labels = fcluster(linked, num_clusters, criterion='maxclust')
    unique_labels = np.unique(labels)  # Find unique cluster labels

    # Plot each cluster with a unique color and draw convex hulls
    for label in unique_labels:
        points = X_pca[labels == label]  # Extract points belonging to the current cluster after PCA
        plt.scatter(points[:, 0], points[:, 1], s=30, label=f'Cluster {label}')  # Plot cluster points
        if points.shape[0] > 2:  # Draw convex hulls only if there are more than two points
            hull = ConvexHull(points)
            for simplex in hull.simplices:
                plt.plot(points[simplex, 0], points[simplex, 1], 'g--')  # Draw lines between the vertices of the convex hull

    plt.title(f'Clusters at step {num_clusters}')  # Title for the cluster plot
    plt.xlabel('PCA Feature 1')  # X-axis label for cluster plot
    plt.ylabel('PCA Feature 2')  # Y-axis label for cluster plot
    plt.legend()  # Show legend
    plt.tight_layout()  # Adjust layout
    plt.show()

# Interactive widget to choose the number of clusters
num_clusters_slider = widgets.IntSlider(min=1, max=20, step=1, value=3, description='Number of Clusters')
widgets.interactive(plot_clusters, num_clusters=num_clusters_slider)  # Create an interactive slider widget


interactive(children=(IntSlider(value=3, description='Number of Clusters', max=20, min=1), Output()), _dom_cla…

## Hierarchical Clustering

This demonstrates hierarchical clustering using synthetic data, providing an interactive way to visualize the clustering process through dendrograms and convex hulls. Here’s how each part of the code contributes to the functionality:

### Hierarchical Clustering
- **`linkage`**: This function from `scipy.cluster.hierarchy` performs hierarchical/agglomerative clustering. The 'ward' method is specified, which minimizes the variance of the clusters being merged. This method is effective for creating more compact and balanced clusters.

- **`dendrogram`**: Visualizes the results of the linkage as a dendrogram, which provides a tree-like diagram of the merging clusters. It helps in understanding the sequence of cluster unions and the distance at which each union occurred.

### Interactive Visualization
- **Plotting Clusters with Convex Hulls**:
  - **`fcluster`**: Extracts cluster labels for a given number of clusters from the hierarchical clustering data. This function allows for dynamically choosing the number of clusters directly from the dendrogram.
  
  - **`ConvexHull`**: Calculates and plots convex hulls around the points in each cluster if the cluster has more than two points. Convex hulls are useful for visually encapsulating the extent of each cluster, emphasizing its coherence and separation from other clusters.
  - **Scatter Plot**: Displays the data points, color-coded by their cluster labels, which helps in visualizing the distribution and grouping of data points in two-dimensional space.

### Widgets
- **`IntSlider`**: Allows interactive selection of the number of clusters to display. Adjusting the slider dynamically updates the dendrogram and the scatter plot to reflect the chosen number of clusters.

## Divisive hierarchical clustering

In [17]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from scipy.spatial import ConvexHull
import ipywidgets as widgets
from IPython.display import display

# Generate synthetic data for clustering demonstration
data = load_iris()
X = data.data
y = data.target

# Apply PCA to reduce the dataset to 2D
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

def divisive_clustering(X, num_clusters):
    n_samples = X.shape[0]
    labels = np.zeros(n_samples, dtype=int)  # Initialize labels for each point
    clusters = [np.arange(n_samples)]  # Start with indices of all points in one cluster

    # Loop to split clusters until the desired number of clusters is reached
    while len(clusters) < num_clusters:
        # Find the largest cluster by the number of points
        largest_cluster_idx = np.argmax([len(cluster) for cluster in clusters])
        largest_cluster = clusters.pop(largest_cluster_idx)  # Remove and capture the largest cluster

        if len(largest_cluster) > 1:   # Check if the cluster has more than one point to split
            # Use KMeans to split the largest cluster into two
            kmeans = KMeans(n_clusters=2, random_state=42)
            new_labels = kmeans.fit_predict(X[largest_cluster])
            # Create new clusters from the split
            cluster1 = largest_cluster[new_labels == 0]
            cluster2 = largest_cluster[new_labels == 1]
            clusters.append(cluster1)
            clusters.append(cluster2)
        else:
            clusters.append(largest_cluster)  # Add back if not splittable

    # Assign final labels for plotting
    for label_index, cluster in enumerate(clusters):
        labels[cluster] = label_index

    return labels

# Function to plot the results of divisive clustering
def plot_divisive(num_clusters):
    labels = divisive_clustering(X_reduced, num_clusters)
    plt.figure(figsize=(12, 6))
    unique_labels = np.unique(labels)
    colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))

    # Plot each cluster with points and their convex hull
    for i, color in zip(unique_labels, colors):
        points = X_reduced[labels == i]
        if len(points) > 2:  # ConvexHull needs at least 3 points
            hull = ConvexHull(points)
            for simplex in hull.simplices:
                plt.plot(points[simplex, 0], points[simplex, 1], 'k-')
        plt.scatter(points[:, 0], points[:, 1], color=color, label=f'Cluster {i+1}')

    plt.title(f'Divisive Clustering with {num_clusters} Clusters (PCA Reduced)')
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.legend()
    plt.show()

# Interactive widget to select the number of clusters
num_clusters_slider = widgets.IntSlider(min=2, max=10, step=1, value=3, description='Number of Clusters')
widgets.interactive(plot_divisive, num_clusters=num_clusters_slider)


interactive(children=(IntSlider(value=3, description='Number of Clusters', max=10, min=2), Output()), _dom_cla…

## Divisive Clustering

This Python demonstrates an interactive visualization of divisive clustering, which is a "top-down" approach in hierarchical clustering. Here's a breakdown of its functionality:

- **Divisive Clustering Function**:
  - Initializes all data points in a single cluster.
  - Iteratively splits the largest cluster using `KMeans` clustering until the desired number of clusters is achieved.
  - Utilizes `KMeans` from `sklearn.cluster` to find two sub-clusters within the largest cluster at each step, ensuring a methodical division based on data similarity.

- **Visualization**:
  - For each cluster identified, plots data points and their convex hulls using `ConvexHull` from `scipy.spatial`, which helps to visually encapsulate the cluster boundaries.
  - Allows interactive exploration through an IPython widget slider, enabling the user to change the number of clusters dynamically and observe how the clustering structure changes.