# Clustering Exercises
This exercise uses [Wholesale customer](https://archive.ics.uci.edu/dataset/292/wholesale+customers) dataset from UCI ML repository. The data represents annual spending of wholesale distributor on diverse product categories.

You may install the repo package to access various dataset

```bash
pip install ucimlrepo
```

Then, import the package in your python code

```python
from ucimlrepo import fetch_ucirepo 
```

To complete the exercise, you may refer to [Data Manipulation Tutorial](data_manipulation_tutorial.ipynb) and [K-Means clustering](k-means.ipynb)

In [1]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans

In [2]:
# import dataset
custDf = pd.read_csv('../dataset/wholesale_customers_data.csv')
custDf

Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,2,3,12669,9656,7561,214,2674,1338
1,2,3,7057,9810,9568,1762,3293,1776
2,2,3,6353,8808,7684,2405,3516,7844
3,1,3,13265,1196,4221,6404,507,1788
4,2,3,22615,5410,7198,3915,1777,5185
...,...,...,...,...,...,...,...,...
435,1,3,29703,12051,16027,13135,182,2204
436,1,3,39228,1431,764,4510,93,2346
437,2,3,14531,15488,30243,437,14841,1867
438,1,3,10290,1981,2232,1038,168,2125


## Exercise 1
Try to get basic information of the dataset and statistical summary

<details>
  <summary>Click for answer</summary>
    
  ```python
  # display basic info
  custDf.info()
  
  # statistical summary with 2 decimal places
  custDf.describe().round(2)

## Exercise 2
Group the dataset based on `Channel` and `Region` with aggregation using `min`, `mean`, `max`.

<details>
  <summary>Click for answer</summary>
    
  ```python
  # grouping and aggregation
  custDf.groupby(['Channel','Region']).agg(['min','mean','max']).round(2)  

The table maybe is too long to scroll down. Now try to transpose the result.

<details>
  <summary>Click for answer</summary>
    
  ```python
  # transpose dataframe
  custDf.groupby(['Channel','Region']).agg(['min','mean','max']).round(2).T

## Exercise 3
Make a box plot for each product. Refer to https://seaborn.pydata.org/examples/horizontal_boxplot.html

<details>
  <summary>Click for answer</summary>
    Replace `Fresh` with another product to get corresponding plot
    
  ```python
  sns.boxplot(x=custDf['Fresh'])

Put all boxplots into one figure

<details>
  <summary>Click for answer</summary>
    
  ```python
    cm = 1/2.54
    textsize_labels = 6.5
    textsize_ticks = 6
    products = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen']

    fig = plt.figure(figsize=(15*cm,12*cm), dpi=300)
    fig.tight_layout()

    # grid specs containing 3 rows and 2 columns
    gs = plt.GridSpec(3, 2, hspace=0.5, wspace=0.2)

    # create axis for subplot
    axes = [fig.add_subplot(gs[i]) for i in range(len(products))]

    # iterate through products list to create subplot of boxplot
    for index, prod in enumerate(products):
        sns.boxplot(x=custDf[prod], ax=axes[index], linewidth=0.5, fliersize=3)
        axes[index].tick_params(labelsize = textsize_ticks)
        axes[index].set_xlabel(prod, fontsize=textsize_labels)


## Exercise 4
Fit K-means clustering model using product categories as features with initial 2 clusters

<details>
  <summary>Click for answer</summary>
        
  ```python
    features = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen']
    X = custDf[features]
    
    # fit k-means model
    km = KMeans(n_clusters=2, random_state=42)
    km.fit(X)
    
    y_pred = km.predict(X)
    y_pred

Plot the clusters with x-axis refers to `Fresh` and y-axis refers to `Milk`

<details>
  <summary>Click for answer</summary>
        
  ```python
    # make a copy dataframe of X
    df = X.copy(deep=True)
    df['cluster'] = y_pred
    
    # plot cluster results
    plt.figure(figsize=(8, 6))
    plt.scatter(df['Fresh'], df['Milk'], c=df['cluster'], cmap='viridis', s=50)
    plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], c='red', marker='X', s=200, label='Centroids')
    plt.xlabel('Fresh')
    plt.ylabel('Milk')
    plt.title("K-means Clustering on Wholesale Dataset")
    plt.legend()
    plt.grid(True)
    plt.show()

## Exercise 5
Evaluate the cluster with Silhouette score

<details>
  <summary>Click for answer</summary>
        
  ```python
    from sklearn import metrics
    score = metrics.silhouette_score(X, y_pred)
    score

Find a good number of clusters using Elbow method

<details>
  <summary>Click for answer</summary>
        
  ```python
    # Calculate WCSS (Within-Cluster Sum of Squares) for Elbow method
    wcss = [] 
    
    for i in range(1, 11):
        kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
        kmeans.fit(X)
        wcss.append(kmeans.inertia_)

    # plot elbow score
    plt.figure(figsize=(10,5))
    sns.lineplot(x=range(1, 11), y=wcss, marker='o',color='red')
    plt.title('Elbow')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')
    plt.show()

## Exercise 6
Fit K-means model with another number of cluster, then plot the cluster results

<details>
  <summary>Click for answer</summary>
        
  ```python
    # fit k-means model with 4 clusters
    km = KMeans(n_clusters=4, random_state=42)
    km.fit(X)
    
    y_pred = km.predict(X)
    y_pred
    
    # make a copy dataframe of X
    df = X.copy(deep=True)
    df['cluster'] = y_pred
    
    # plot cluster results
    plt.figure(figsize=(8, 6))
    plt.scatter(df['Fresh'], df['Milk'], c=df['cluster'], cmap='viridis', s=50)
    plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], c='red', marker='X', s=200, label='Centroids')
    plt.xlabel('Fresh')
    plt.ylabel('Milk')
    plt.title("K-means Clustering on Wholesale Dataset - 4 clusters")
    plt.legend()
    plt.grid(True)
    plt.show()

Fit K-means model with selected number of features, then plot the cluster results

<details>
  <summary>Click for answer</summary>
        
  ```python
    # create new variable with two selected features: 'Fresh', 'Detergents_Paper'
    X_selected_features = X[['Fresh', 'Detergents_Paper']]
    
    # fit k-means model with 4 clusters
    km = KMeans(n_clusters=4, random_state=42)
    km.fit(X_selected_features)
    
    y_pred = km.predict(X_selected_features)
    y_pred
    
    # make a copy dataframe of X
    df = pd.DataFrame(X_selected_features)
    df['cluster'] = y_pred
    
    # plot cluster results
    plt.figure(figsize=(8, 6))
    plt.scatter(df['Fresh'], df['Detergents_Paper'], c=df['cluster'], cmap='viridis', s=50)
    plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], c='red', marker='X', s=200, label='Centroids')
    plt.xlabel('Fresh')
    plt.ylabel('Detergents_Paper')
    plt.title("K-means Clustering on Wholesale Dataset - 4 clusters")
    plt.legend()
    plt.grid(True)
    plt.show()