# 2110446 DATA SCIENCE AND DATA ENGINEERING

## **Unit 03:** Traditional Machine Learning

- **Problem:** Clustering (`03_ml_02_2025s3`)
- **Author:** Worralop Srichainont
- **Year:** 2025 (Semester 2)

# Dependencies

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Data Resources

File URL

In [2]:
FILE_PATH = "https://drive.google.com/uc?id=13uYRVh1_724jV4ey-0wUuznNLgsgRcdY"

Load Files

In [3]:
df = pd.read_csv(FILE_PATH)

Display `DataFrame`

In [4]:
df.head()

Unnamed: 0,id,label,cap-shape,cap-surface,bruises,odor,gill-attachment,gill-spacing,gill-size,stalk-shape,...,ring-number,ring-type,spore-print-color,population,habitat,cap-color-rate,gill-color-rate,veil-color-rate,stalk-color-above-ring-rate,stalk-color-below-ring-rate
0,1,p,x,s,t,p,f,c,n,e,...,o,p,k,s,u,1.0,3.0,1.0,1.0,1.0
1,2,e,x,s,t,a,f,c,b,e,...,o,p,n,n,g,2.0,3.0,1.0,1.0,1.0
2,3,e,b,s,t,l,f,c,b,e,...,o,p,n,n,m,3.0,1.0,1.0,1.0,1.0
3,4,p,x,y,t,p,f,c,n,e,...,o,p,k,s,u,3.0,1.0,1.0,1.0,1.0
4,5,e,x,s,f,n,f,w,b,t,...,o,e,n,a,g,4.0,3.0,1.0,1.0,1.0


Get brief insight of `DataFrame`

In [5]:
df.describe()

Unnamed: 0,id,cap-color-rate,gill-color-rate,veil-color-rate,stalk-color-above-ring-rate,stalk-color-below-ring-rate
count,5824.0,5797.0,5703.0,5762.0,5793.0,5762.0
mean,2912.5,3.327411,5.88848,1.019264,2.310375,3.632419
std,1681.388315,1.856785,2.812418,0.14722,1.664648,2.370531
min,1.0,1.0,1.0,1.0,1.0,1.0
25%,1456.75,1.0,4.0,1.0,1.0,1.0
50%,2912.5,4.0,6.0,1.0,2.0,5.0
75%,4368.25,5.0,9.0,1.0,3.0,5.0
max,5824.0,10.0,12.0,3.0,9.0,9.0


# Problem `Q1`

Please do the following operations
- Load `ModifiedEdibleMushroom.csv` data.
- Choose edible mushroom only.
- Only the variable below have been selected to describe the distinctive characteristics of edible mushrooms: 
    - `cap-color-rate`
    - `stalk-color-above-ring-rate`
- Provide a proper data preprocessing as follows:
    - Fill missing value with mean.
    - Standardize variables with standard scaler.

After that, please answer a shape after standardize variables.

## Filter Rows

Filter only the rows with edible mushrooms (`label` variable equals to `e`).

In [6]:
df = df[df["label"] == "e"]

## Filter Columns

Filter only these columns.
- `cap-color-rate`
- `stalk-color-above-ring-rate`

In [7]:
df = df[["cap-color-rate", "stalk-color-above-ring-rate"]]

Display modified `DataFrame`

In [8]:
df.head()

Unnamed: 0,cap-color-rate,stalk-color-above-ring-rate
1,2.0,1.0
2,3.0,1.0
4,4.0,1.0
5,2.0,1.0
6,3.0,1.0


## Fill Missing Data

Display missing value on the `DataFrame`

In [9]:
display_indices = df.isna().any(axis=1)
df[display_indices].sample(n=10, random_state=67)

Unnamed: 0,cap-color-rate,stalk-color-above-ring-rate
62,4.0,
2897,,1.0
82,4.0,
80,1.0,
74,1.0,
68,1.0,
2916,,6.0
79,4.0,
64,5.0,
73,1.0,


Fill missing value with mean.

In [10]:
df = df.fillna(df.mean())

Checking if the missing value has filled.

In [11]:
df[display_indices].sample(n=10, random_state=67)

Unnamed: 0,cap-color-rate,stalk-color-above-ring-rate
62,4.0,2.410405
2897,3.467588,1.0
82,4.0,2.410405
80,1.0,2.410405
74,1.0,2.410405
68,1.0,2.410405
2916,3.467588,6.0
79,4.0,2.410405
64,5.0,2.410405
73,1.0,2.410405


## Standard Scaler

Before we are training a model, we need to normalize all variables using `StandardScaler`.

`StandardScaler` uses the Z-score normalization
$$z = \frac{x - \mu}{\sigma}$$
- $z$ is the normalized data.
- $x$ is the original data.
- $\mu$ is the mean.
- $\sigma$ is the standard deviation.

**NOTE**: `fit_transform()` returns `np.ndarray` type, not `pd.DataFrame` type.

In [12]:
# Initialize StandardScaler object.
scaler = StandardScaler()

# Data normalization.
normalized_arr = scaler.fit_transform(df)

# Convert the numpy array back to DataFrame.
df = pd.DataFrame(normalized_arr, columns=df.columns, index=df.index)

Display `DataFrame` after normalization.

In [13]:
df[display_indices].sample(n=10, random_state=67)

Unnamed: 0,cap-color-rate,stalk-color-above-ring-rate
62,0.2666451,0.0
2897,-2.22411e-16,-0.747608
82,0.2666451,0.0
80,-1.23583,0.0
74,-1.23583,0.0
68,-1.23583,0.0
2916,-2.22411e-16,1.902724
79,0.2666451,0.0
64,0.7674701,0.0
73,-1.23583,0.0


## `DataFrame` Shape

In [14]:
print(f"The shape of the DataFrame is {df.shape}")
print(f"- It has {df.shape[0]} rows.")
print(f"- It has {df.shape[1]} columns.")

The shape of the DataFrame is (2104, 2)
- It has 2104 rows.
- It has 2 columns.


# Problem `Q2`

Please do the following operations
- Do all operations from `Q1`
- Implement K-means clustering with 5 clusters with following:
    - Set `n_cluster` to 5
    - Set `random_state` to 0
    - Set `n_init` to `"auto"`
- Show the maximum centroid of 2 features in two decimal places.
    - `cap-color-rate`
    - `stalk-color-above-ring-rate`

## K-Means Clustering

Initialize a K-means clustering model.

In [15]:
model = KMeans(n_clusters=5, random_state=0, n_init="auto")

Begin clustering on the dataset.

In [16]:
model.fit(df)

## Centroids

Get centroids of each cluster from the model.

In [17]:
print("Centroid points of 5 clusters:")
centroids = model.cluster_centers_
print(centroids)

Centroid points of 5 clusters:
[[ 0.69701187  0.3582104 ]
 [-1.21998586  2.30140606]
 [ 0.30323849 -0.65383856]
 [ 2.5059661  -0.74760802]
 [-1.22906193 -0.29354739]]


Find maximum centroid value of 2 features rounded into 2 decimal places.

In [18]:
max_centroid = np.max(centroids, axis=0)
max_centroid = np.round(max_centroid, 2)

print("Maximum value of each feature:")
print(max_centroid)

Maximum value of each feature:
[2.51 2.3 ]


# Problem `Q3`

Please do the following operations
- Do all operations from `Q1` and `Q2`.
- Convert the centrioid value to the original scale.

Show the minimum centroid of 2 features rounded in 2 decimal places.

## Inverse Scaling

To interpret the data, we need to convert the normalized data back into the original scale using the same `StandardScaler` object.

`StandardScaler` scale the normalized data back into the original scale using the process called **Inverse Transform** using the formula.
$$x = (z \cdot \sigma) + \mu$$
- $x$ is the reconstructed original data.
- $z$ is the normalized data.
- $\mu$ is the mean.
- $\sigma$ is the standard deviation.

**NOTE**: `inverse_transform()` return `np.ndarray` type, not `pd.DataFrame`.

Display the normalized centroid points.

In [19]:
print("Normalized cluster centroid points")
centroids = model.cluster_centers_
print(centroids)

Normalized cluster centroid points
[[ 0.69701187  0.3582104 ]
 [-1.21998586  2.30140606]
 [ 0.30323849 -0.65383856]
 [ 2.5059661  -0.74760802]
 [-1.22906193 -0.29354739]]


Use `inverse_transform()` to scale back into the original value.

In [20]:
centroids = scaler.inverse_transform(centroids)

Display centroid points in original scale.

In [21]:
print("Cluster centroid points in original scale")
print(centroids)

Cluster centroid points in original scale
[[4.85931559 3.08618871]
 [1.03163575 6.75213675]
 [4.07306613 1.17690136]
 [8.47126437 1.        ]
 [1.01351351 1.85661095]]


Find minimum centroid value of 2 features rounded into 2 decimal places.

In [22]:
min_centroid = np.min(centroids, axis=0)
min_centroid = np.round(min_centroid, 2)

print("Minimum value of each feature:")
print(min_centroid)

Minimum value of each feature:
[1.01 1.  ]


# Grader

## Solution

Writing all problem inside the `Clustering` class, and then send it to the grader.

In [23]:
class Clustering:
    def __init__(self, file_path):
        # Initialization Attributes
        self.file_path = file_path
        self.df = pd.read_csv(file_path)

        # Additional Attributes
        self.scaler = None
        self.model = None
        self.centroids = None
        self.normalized_centroids = None

    def Q1(self):
        """
        Q1: Please do the following operations
            -   Choose edible mushroom only.
            -   Only the variable below have been selected to describe
                the distinctive characteristics of edible mushrooms:
                -   cap-color-rate
                -   stalk-color-above-ring-rate
            -   Provide a proper data preprocessing as follows:
            -   Fill missing value with mean.
            -   Standardize variables with standard scaler.
            After these operations, please answer a shape after standardize variables.
        """
        # Select rows labeled as edible mushroom.
        self.df = self.df[self.df["label"] == "e"]

        # Select columns
        self.df = self.df[["cap-color-rate", "stalk-color-above-ring-rate"]]

        # Fill missing value with mean
        self.df = self.df.fillna(self.df.mean())

        # Initialize StandardScaler object
        self.scaler = StandardScaler()

        # Data normalization
        normalized_arr = self.scaler.fit_transform(self.df)

        # Convert the numpy array back to DataFrame
        self.df = pd.DataFrame(
            normalized_arr, columns=self.df.columns, index=self.df.index
        )

        # Return the shape of the normalized DataFrame
        return self.df.shape

    def Q2(self):
        """
        Q2: Please do the following operations
            -   Do all operations from Q1
            -   Implement K-means clustering with 5 clusters with following:
                -   Set n_cluster to 5
                -   Set random_state to 0
                -   Set n_init to "auto"
            Show the maximum centroid of 2 features in two decimal places.
            -   cap-color-rate
            -   stalk-color-above-ring-rate
        """
        # Ensure that all operations in Q1 are executed.
        self.Q1()

        # Initialize K-means clustering model.
        self.model = KMeans(n_clusters=5, random_state=0, n_init="auto")

        # Begin clustering on the dataset.
        self.model.fit(self.df)

        # Calculate the maximum normalized cluster centroids.
        self.normalized_centroids = self.model.cluster_centers_
        max_centroids = np.max(self.normalized_centroids, axis=0)

        # Rounded to 2 decimal places
        max_centroids = np.round(max_centroids, 2)

        # Return maximum centroids.
        return max_centroids

    def Q3(self):
        """
        Q3: Please do the following operations
        -   Do all operations from `Q1` and `Q2`.
        -   Convert the centrioid value to the original scale.
        Show the minimum centroid of 2 features rounded in 2 decimal places.
        """
        # Ensure that all operations in Q1 to Q2 are executed.
        self.Q2()

        # Scale back to the original scale.
        self.centroids = self.scaler.inverse_transform(self.normalized_centroids)

        # Calculate the minimum cluster centroids.
        min_centroids = np.min(self.centroids, axis=0)

        # Rounded to 2 decimal places
        min_centroids = np.round(min_centroids, 2)

        # Return minimum centroids
        return min_centroids

## Run Test

Create a utility class `ClusteringTest` to check if each method in the class works correctly.

Use `assert`, `try` and `except` to check if the result matches.

In [24]:
class ClusteringTest:
    @staticmethod
    def run_tests():
        TESTCASES = (
            {"func_name": "Q1", "key": (2104, 2)},
            {"func_name": "Q2", "key": np.array([2.51, 2.30])},
            {"func_name": "Q3", "key": np.array([1.01, 1.00])},
        )

        for testcase in TESTCASES:
            func_name = testcase["func_name"]
            key = testcase["key"]
            if isinstance(key, np.ndarray):
                ClusteringTest.run_np_test(func_name, key)
            else:
                ClusteringTest.run_test(func_name, key)

    @staticmethod
    def run_test(func_name, key):
        try:
            model = Clustering(FILE_PATH)
            result = getattr(model, func_name)()
            assert result == key, f"Expected {key} but got {result}"
        except Exception as e:
            print(f"{func_name} failed with {e}")
            return
        print(f"{func_name} passed.")

    @staticmethod
    def run_np_test(func_name, key):
        try:
            model = Clustering(FILE_PATH)
            result = getattr(model, func_name)()
            assert (result == key).all(), f"Expected {key} but got {result}"
        except Exception as e:
            print(f"{func_name} failed with {e}")
            return
        print(f"{func_name} passed.")

In [25]:
ClusteringTest.run_tests()

Q1 passed.
Q2 passed.
Q3 passed.
