![](../img/330-banner.png)

# Tutorial 6

UBC 2025-26

## Outline

During this tutorial, you will practice clustering and interpretation of clustering results.

All questions can be discussed with your classmates and the TAs - this is not a graded exercise!

In [None]:
import os
import random
import sys
import time

import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb

sys.path.append(os.path.join(os.path.abspath(".."), "code"))

import mglearn
#plt.style.use("seaborn")

from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler


## The dataset

ðŸ‘» It's Halloween ðŸ‘»!

Therefore, we will work with a new dataset including information about different types of popular Halloween candy. You can download the dataset from [Kaggle](https://www.kaggle.com/datasets/fivethirtyeight/the-ultimate-halloween-candy-power-ranking). We also recommend taking a moment to read the Attribute Information included in this page, which will explain how the data was collected and the features included in the dataset.

This tutorial was inspired by the [Halloween Candy Data Visualizations](https://github.com/dkhundley/halloween-candy-visualizations/blob/main/notebooks/halloween-candy-visualizations.ipynb) notebook by  David Hundley (particularly the two dataset visualization that we are going to use)

Use the cell below to read the dataset and check the first few rows (make sure the path matches the location on your computer).

In [None]:
candy_df = pd.read_csv("data/candy-data.csv")
candy_df.head()

## EDA

Let's start with familiarizing with the dataset and see if we can start identifying possible groups of similar candy. You may use familiar functions to do so (`describe()`), as well as the summary plots provided below.

In [None]:
candy_df.describe()

In [None]:
# From https://github.com/dkhundley/halloween-candy-visualizations/blob/main/notebooks/halloween-candy-visualizations.ipynb

# Establishing a 3x3 grid to place all our countplots
figure, axes = plt.subplots(3, 3, figsize = (14, 14))

# Establishing x- and y-coordinate values
x_coord, y_coord = 0, 0

df_binary = candy_df.select_dtypes(int)
df_binary = df_binary.replace({0: 'no', 1: 'yes'})

# Displaying all countplots for each variable appropriately
for feature in df_binary.columns:
    
    # Displaying the countplot for the respective feature
    countplot = sb.countplot(x = df_binary[feature], order = ['no', 'yes'], ax = axes[x_coord, y_coord]);
    
    # Adjusting the y-value limit
    countplot.set_ylim(0, 80)
    
    # Setting the x- and y-label name and font-size
    countplot.set_ylabel('Counts', fontsize = 12)
    countplot.set_xlabel(feature, fontsize = 12)
    
    # Incrementing the coordinate values
    x_coord += 1
    if x_coord == 3:
        y_coord += 1
        x_coord = 0
        
# Adding title to the holistic visualization
figure.suptitle('Visualizing the Counts of Each Binary Feature as Count Plots', fontsize = 18);
figure.subplots_adjust(top = .95)

In [None]:
# From https://github.com/dkhundley/halloween-candy-visualizations/blob/main/notebooks/halloween-candy-visualizations.ipynb

# Establishing a 2x1 grid to place both our scatter plots
plt.figure(figsize=(12, 12)) 

# Creating scatter plot visualization
scatterplot = sb.scatterplot(data = candy_df, x = 'winpercent', y = 'pricepercent');
    
    
# Iterating through each candy to add its respective label to the scatter plot
for i, candy in enumerate(list(candy_df['competitorname'])):
        
    # Plotting each candy label appropriately
    scatterplot.text(x = candy_df['winpercent'][i] + 0.02, y = candy_df['pricepercent'][i] + 0.02,
                     s = candy, fontsize = 'x-small')
    
# Setting the location of the legend and axis labels
scatterplot.set(xlabel = 'Win Percentage', ylabel = 'Price Percentage')
    
# Adding title to the holistic visualization
figure.suptitle('Visualizing the Relationship between Price and Wins', fontsize = 16);
figure.subplots_adjust(top = .95)

Based on this information, do you think some type of candy will be likely to cluster together?

## Preprocessing

We will use K-Means to cluster this dataset, which as you know is sensitive to feature scales. The dataset requires some basic preprocessing to ensure that all features are in comparable ranges. 

In [None]:
to_scale = ["sugarpercent", "pricepercent", "winpercent"]
drop = ["competitorname"]

passthrough = list(
    set(candy_df.columns)
    - set(to_scale)
    - set(drop)
)

ct = make_column_transformer(
    (StandardScaler(), to_scale),
    ("passthrough", passthrough),  # no transformations on the binary features    
    ("drop", drop),  
)

In [None]:
column_names = (
    to_scale
    + passthrough
)

In [None]:
candy_transformed = ct.fit_transform(candy_df)

pd.DataFrame(candy_transformed, columns=column_names)

## Clustering with K-Means

Now that the dataset has been processed, let's cluster it using K-Means. Helping yourself with the cells below, complete the following steps:
- Use the elbow method to determine the appropriate number of clusters
- Use the Silhouette Method to evaluate the resulting clusters (you may want to try a couple of neighbouring values around the one found using the elbow method)
- Assign cluster labels to the original dataframe (before preprocessing)
- Describe the resulting clusters; to do this, you may check the `cluster_centers_`, compute means (or medians) for each feature grouped by cluster, or try some other visualization.

What do you think of theresulting clusters? Do they match your prediction?

In [None]:
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
from yellowbrick.cluster import SilhouetteVisualizer

...

*Write your comments here*