# **CSCI 566 - Spring 2025 - Homework 1**
# **Problem 1: Kmeans and KNN Algorithms (32 points)**
Adapted from a Kaggle Blog https://www.kaggle.com/code/orkunaran/detailed-eda-k-prototypes-clustering. The 'TODO' blocks will be the blanks to fill. The original data analytics process is very useful and insightful, so we keep most of the analytics even it has less to do with the algorithm implementation.

# Market Customer Personality Analysis

## **Objective**
Customer Personality Analysis is a detailed examination of a company’s ideal customers. It helps businesses understand customer preferences, behaviors, and concerns, allowing them to modify products and marketing strategies accordingly.

By analyzing different customer segments, businesses can allocate resources efficiently, targeting only those segments most likely to buy a specific product rather than marketing to the entire customer base.

---

## **Steps in This Analysis**

### **1. Data Processing (You don't need to do anything)**
- Load and clean the dataset.
- Handle missing values and perform necessary transformations.
- Normalize numerical features and encode categorical variables.

### **2. Clustering Using K-Means Algorithm (You will implement the details of the K-Means Algorithm)**
- Choose an appropriate number of clusters using methods such as the Elbow Method or Silhouette Score.
- Apply the K-Means clustering algorithm to segment customers.
- Analyze the customer groups and their characteristics.

### **3. Clustering Using K-Prototype Algorithm (You don't need to implement the details of the Algorithm, but you need to fulfill some evaluation functions and choose parameters for the algorithm)**
- Combine categorical and numerical data for better clustering.
- Use the K-Prototype algorithm to cluster customers based on both categorical and numerical attributes.
- Compare results with K-Means clustering for better segmentation.

### **4. Classifying Using KNN Algorithm (You will implement the details of the KNN Algorithm)**
- Use the numerical data.
- Choose `Education` as the label to predict.
- Use PCA library to analysis. ATTENTION!! There is a question about the performance of KNN, please remember to reply to that within the same markdown block.

In [None]:
# Connect to google drive and nagivate to the current folder and pip install from requirements.txt
# If you run it locally, you don't need to run this code block
from google.colab import drive
drive.mount('/content/drive')

# Navigate to the folder containing requirements.txt in your Google Drive
directory_name = "CSCI566-S25-Material/HW1/Part1"
%cd /content/drive/MyDrive/{directory_name}/

# Install required packages
!pip install -r ../requirements.txt

In [None]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

# visuals
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from kmodes.kprototypes import KPrototypes


In [None]:
market = pd.read_csv('marketing_campaign.csv', delimiter="\t")
market = market.drop(['ID','Dt_Customer'], axis=1)

# Descriptive Statistics

In [None]:
market.head(n=10)


In [None]:
market.shape

In [None]:
market.info()

In [None]:
market.describe()

# **Introduction to Data**

## **Dataset Overview**
- The dataset contains **2,240 rows** and **28 columns**.
- Each row represents a unique customer, with **no duplicate entries**.
- No missing values or negative values were found in the dataset.

## **Understanding the Columns**
Some columns require further clarification, such as:
- **AcceptedCmp**
- **Recency**
- **Z_CostContact**
- **Response**

Understanding these columns is essential for accurate analysis.

## **Key Observations**
- **Income Distribution:**
  - The highest income value is **6,666,666**, which appears to be an outlier.
  - Most income values fall between **35k and 68k**.

- **Website Visits:**
  - The **average number of website visits per month** is **5** for **2,000 customers**.
  - The maximum recorded value is **20**.

- **AcceptedCmp Columns:**
  - These columns contain **binary values (0 or 1)**, indicating a yes/no response.

# Creating New Columns

In [None]:
# total number of children
market['no_children'] = market['Kidhome'] + market['Teenhome']
# total items bought
market['total_items_bought'] = market['MntWines'] + market['MntFruits'] + market['MntMeatProducts'] + market[
    'MntFishProducts'] + market['MntSweetProducts'] + market['MntGoldProds']
# total number of purchases
market['total_nbr_purchases'] = market['NumDealsPurchases'] + market['NumWebPurchases'] + market[
    'NumCatalogPurchases'] + market['NumStorePurchases']
#customer age
market['age'] = 2021 - market['Year_Birth']

market = market.drop(['Year_Birth'], axis=1)

In [None]:
market.describe()

In [None]:
market[market['Income']>600000]

In [None]:
market[market['age']>90]

In [None]:
market = market[market['Income']<600000]
market = market[market['age']<90]

# Visuals

In [None]:
def hist_with_vline(column):
    """
    Plots a histogram for the given column with 100 bins.
    Draws vertical lines for both the mean and median values.

    Parameters:
    column (str): The name of the column to be visualized.

    Returns:
    None
    """

    # Set figure size
    plt.figure(figsize=(12, 6))

    # Plot histogram with 100 bins
    sns.histplot(market[column], bins=100)

    # Set plot title
    plt.title(f'Histogram of {column}')

    # Get the y-axis limits
    _, y_lim = plt.ylim()

    # Compute mean and median values
    mean_value = market[column].mean()
    median_value = market[column].median()

    # Draw a vertical line for the mean (red)
    plt.axvline(mean_value, color='r', linestyle='--', label=f'Mean: {mean_value:.2f}')
    plt.text(mean_value * 1.1, y_lim * 0.95, f"Mean: {mean_value:.2f}", color='r')

    # Draw a vertical line for the median (blue)
    plt.axvline(median_value, color='b', linestyle='--', label=f'Median: {median_value:.2f}')
    plt.text(median_value * 1.1, y_lim * 0.90, f"Median: {median_value:.2f}", color='b')

    # Display legend for better understanding
    plt.legend()

    # Show the plot
    plt.show()

# Take a look at the income

In [None]:
hist_with_vline('Income')

In [None]:
# The customers earn more than 120k are probably outliers, we will remove them.
market = market[market['Income']<120000]

In [None]:
# Most of the customers earn between 40k to 70k.
hist_with_vline('Income')

In [None]:
hist_with_vline('total_items_bought')

In [None]:
columns = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
titles = ['Wines Sold', 'Fruits Sold', 'Meat Products Sold', 'Fish Products Sold', 'Sweets Sold', 'Gold Sold']
colors = ['blue', 'green', 'darkblue', 'red', 'orange', 'yellow']

fig, ax = plt.subplots(2, 3, figsize=(16, 10))
# All the sold products histograms are right skewed. Majority of the customers buys items lower than certain amounts.
# On the other hand, Wines are the most sold items (675k) and Meat produces follow with 364k, while the Fruit and Sweet products are the least sold items (58k and 59k respectively).
for i in range(len(columns)):
    sns.histplot(market[columns[i]], bins=100, ax=ax[i // 3, i % 3], color=colors[i])
    ax[i // 3, i % 3].set_title('Distribution of ' + titles[i])
    ax[i // 3, i % 3].set_xlabel(titles[i])
    ax[i // 3, i % 3].text(s=f"Total number of \n{columns[i]} sold are {market[columns[i]].sum()} ",
                           x=market[columns[i]].max() / 3.5, y=200)

In [None]:
plt.figure(figsize=(12, 6))
# There is a linear relation with income and number of items bought.
_ = sns.scatterplot(x='Income', y='total_items_bought', data=market)
_ = plt.title('Income vs Total Items Bought')
_ = plt.ylabel('Total Items Bought')

In [None]:
px.scatter(market,
           x='Income',
           y='total_items_bought',
           color='Education',
           title='Income According to Educational Status', )

In [None]:
market.Education.value_counts()

In [None]:
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(12, 6))
_ = sns.barplot(x='Education', y='Income', data=market, ax=ax0)
ax0.set_title('Income According to Education')
_ = sns.barplot(x='Education', y='total_items_bought', data=market, ax=ax1)
ax1.set_title('Total Items Bought By Custormers by Their Educational Status')
# Customers with a PhD earn and spend more than any other customers with different educational background. And, not so surprisingly Basic level educated customers earn and spend the least amount of money. And when we investigate the number of customers in each group, it is wise to investigate what customers buy with different educational backgrounds.
_ = ax0.text(s=f"n :{market.Education.value_counts()[0]}", x=-0.35, y=10000)
_ = ax0.text(s=f"n :{market.Education.value_counts()[1]}", x=0.75, y=10000)
_ = ax0.text(s=f"n :{market.Education.value_counts()[2]}", x=1.75, y=10000)
_ = ax0.text(s=f"n :{market.Education.value_counts()[3]}", x=2.75, y=10000)
_ = ax0.text(s=f"n :{market.Education.value_counts()[4]}", x=3.75, y=10000)

In [None]:
px.scatter(market,
           x='Income',
           y='total_items_bought',
           color='Marital_Status',
           title='Income According to Marital Status', )

In [None]:
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(15, 6))
_ = sns.barplot(x='Marital_Status', y='Income', data=market, ax=ax0)
_ = ax0.set_title('Income According to Marital Status')
_ = sns.barplot(x='Marital_Status', y='total_items_bought', data=market, ax=ax1)
_ = ax1.set_title('Total Items Bought By Custormers by Their Marital Status')

# Individuals labeled as “absurd” have the highest earnings, followed by widows. However, categories like “Alone,” “absurd,” and “YOLO” might not be relevant for analysis, so it’s better to focus on the other groups and their purchasing behavior.
#

_ = ax0.text(s=f"n = \n{market.Marital_Status.value_counts()[0]}", x=-0.35, y=10000)
_ = ax0.text(s=f"n = \n{market.Marital_Status.value_counts()[1]}", x=0.75, y=10000)
_ = ax0.text(s=f"n = \n{market.Marital_Status.value_counts()[2]}", x=1.75, y=10000)
_ = ax0.text(s=f"n = \n{market.Marital_Status.value_counts()[3]}", x=2.75, y=10000)
_ = ax0.text(s=f"n = \n{market.Marital_Status.value_counts()[4]}", x=3.75, y=10000)
_ = ax0.text(s=f"n = \n{market.Marital_Status.value_counts()[5]}", x=4.75, y=10000)
_ = ax0.text(s=f"n = \n{market.Marital_Status.value_counts()[6]}", x=5.75, y=10000)
_ = ax0.text(s=f"n = \n{market.Marital_Status.value_counts()[7]}", x=6.75, y=10000)

In [None]:
# Since the number of individuals in each group varies significantly, using the total sum of purchases (e.g., Wine, Meat, etc.) wouldn’t be ideal. Instead, calculating the mean for each category would provide a clearer comparison. Let’s first examine the data using the groupby function.
market.groupby(['Education'])[market.select_dtypes(include=['number']).columns].agg(['mean', 'sum'])


# Analysis of Product Sales

## Education and Product Preferences
- Wine Consumption: There is a strong correlation between education level and wine purchases.
  - PhD holders buy the most wine.
  - Customers with only basic education (likely those who did not complete secondary school) purchase the least.
  - Other education levels have similar purchasing patterns (an A/B test, such as the Kruskal-Wallis test followed by the Mann-Whitney U test, could provide more insights).
- Other Products:
  - Graduates tend to buy more fruits, meat, and gold products.
  - Customers with a secondary education (2nd Cycle) purchase more fish and sweet products.

---

## Sales Channels: How Are Products Sold?
We analyze product purchases across different sales channels: deals, web, catalog, and in-store purchases.

### Purchases via Deals
- Customers with a Master’s degree frequently buy products through deals (possibly while pursuing their PhD).
- However, Graduates and PhD holders also take advantage of deals significantly.

### Online Shopping Trends
- PhD holders and Graduates use online shopping slightly more than those with a Master's degree.

### Catalog Purchases
- Again, PhD holders and Graduates are the most likely to purchase via catalogs.

### In-Store Purchases
- PhD holders seem to prefer shopping in stores the most.
- Customers with a Master’s or Graduate degree also visit physical stores frequently.

### Website Visits
- Customers with basic education levels visit the store’s website more than any other group.
- This suggests that they actively follow deals and promotions but might face challenges in completing purchases.
- A potential strategy for store owners: Introduce special deals or incentives to encourage these customers to make purchases.

In [None]:
# Does Children effect market shopping?
# The families with no children earn and spend more than families with children.
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(12, 6), sharex=True)
_ = sns.barplot(x=market.no_children, y=market.Income, ax=ax0)
_ = sns.barplot(x=market.no_children, y=market.total_items_bought, ax=ax1)
ax0.text(s=f"n:{market[market['no_children'] == 0]['no_children'].count()}", x=-0.25, y=20000)
ax0.text(s=f"n:{market[market['no_children'] == 1]['no_children'].count()}", x=0.75, y=20000)
ax0.text(s=f"n:{market[market['no_children'] == 2]['no_children'].count()}", x=1.75, y=20000)
ax0.text(s=f"n:{market[market['no_children'] == 3]['no_children'].count()}", x=2.75, y=20000)

ax1.text(s=f"Mean Sales: \n{market[market['no_children'] == 0]['total_items_bought'].mean():.2f}", x=-0.35, y=50)
ax1.text(s=f"Mean Sales: \n{market[market['no_children'] == 1]['total_items_bought'].mean():.2f}", x=0.65, y=50)
ax1.text(s=f"Mean Sales: \n{market[market['no_children'] == 2]['total_items_bought'].mean():.2f}", x=1.65, y=50)
ax1.text(s=f"Mean Sales: \n{market[market['no_children'] == 3]['total_items_bought'].mean():.2f}", x=2.65, y=50)

# KMeans

In [None]:
from sklearn.metrics import silhouette_score
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


##################################################
# StandardScaler class
##################################################
class StandardScaler:
    def __init__(self):
        self.mean_ = None
        self.std_ = None

    def fit(self, X):
        """
        Calculate the mean and standard deviation for each column.

        Parameters:
            X: numpy array or DataFrame, shape (n_samples, n_features)
        """
        # If the input is a Pandas DataFrame, convert it to a numpy array
        # Prevent division by zero for features with zero standard deviation
        # 2 Points
        # TODO ===== YOUR CODE HERE =====

        # TODO ==========================
        return self

    def transform(self, X):
        """
        Standardize the data using the computed mean and standard deviation.

        Parameters:
            X: numpy array or DataFrame

        Returns:
            Standardized numpy array
        """
        if isinstance(X, pd.DataFrame):
            X = X.values
        # 2 Points
        # TODO ===== YOUR CODE HERE =====
        return 
        # TODO ==========================

    def fit_transform(self, X):
        """
        Fit to the data, then transform it.
        """
        self.fit(X)
        return self.transform(X)


##################################################
# KMeans class
##################################################
class KMeans:
    def __init__(self, n_clusters=8, max_iter=300, tol=1e-4, random_state=None):
        """
        Parameters:
            n_clusters: Number of clusters.
            max_iter: Maximum number of iterations.
            tol: Convergence threshold based on center movement.
            random_state: Random seed.
        """
        self.n_clusters = n_clusters
        self.max_iter = max_iter
        self.tol = tol
        self.random_state = random_state
        self.cluster_centers_ = None
        self.labels_ = None
        self.inertia_ = None  # Sum of Squared Errors (SSE)

    def _initialize_centers(self, X):
        """
        Randomly select (use random_state) n_clusters samples as the initial cluster centers.
        """
        np.random.seed(self.random_state)
        indices = np.random.choice(X.shape[0], self.n_clusters, replace=False)
        return X[indices]

    def fit(self, X):
        """
        Train the KMeans model.

        Parameters:
            X: numpy array or DataFrame, shape (n_samples, n_features)
        """
        if isinstance(X, pd.DataFrame):
            X = X.values
        n_samples, n_features = X.shape

        # 1. Initialize centers
        # 2 Points
        # TODO ===== YOUR CODE HERE =====

        # TODO ==========================

        for i in range(self.max_iter):
            # 2. Assign each sample to the nearest center (using Euclidean distance)
            distances = np.linalg.norm(X[:, np.newaxis] - centers, axis=2)  # shape: (n_samples, n_clusters)
            labels = np.argmin(distances, axis=1)

            # 3. Update each cluster center
            new_centers = np.zeros((self.n_clusters, n_features))
            for k in range(self.n_clusters):
                # If there are samples in cluster k, compute the mean; otherwise, randomly reinitialize the center
                # 2 Points
                # TODO ===== YOUR CODE HERE =====
                
                # TODO ==========================

            # 4. Check for convergence: stop if all centers move less than tol
            # 2 Points
            # TODO ===== YOUR CODE HERE =====
            
            # TODO ==========================

        # 5. Final assignment and computation of Sum of Squared Errors (SSE)
        distances = np.linalg.norm(X[:, np.newaxis] - centers, axis=2)
        labels = np.argmin(distances, axis=1)
        inertia = np.sum((X - centers[labels]) ** 2)

        self.cluster_centers_ = centers
        self.labels_ = labels
        self.inertia_ = inertia

        return self
    # You actually won't use this method in this assignment, but once a K-Means algorithm has been finalized, it can be used for the new data.
    def predict(self, X):
        """
        Predict the cluster labels for new data using Euclidean Distance.

        Parameters:
            X: numpy array or DataFrame, shape (n_samples, n_features)

        Returns:
            Array of cluster labels
        """
        if isinstance(X, pd.DataFrame):
            X = X.values
        # 2 Points
        # TODO ===== YOUR CODE HERE =====

        # TODO ==========================
        return labels

    def fit_predict(self, X):
        """
        Fit the data and return the cluster labels.
        """
        self.fit(X)
        return self.labels_



# List columns to exclude (all non-numeric columns)
exclude_cols = [
    'Education',
    'Marital_Status',
    'AcceptedCmp3',
    'AcceptedCmp4',
    'AcceptedCmp5',
    'AcceptedCmp1',
    'AcceptedCmp2',
    'Complain',
    'Z_CostContact',
    'Z_Revenue',
    'Response'
]

# Exclude these columns from 'market' and copy the result to df_kmeans.
# Assume that 'market' is a preloaded DataFrame. Load data if necessary.
df_kmeans = market.drop(columns=exclude_cols, axis=1).copy()

# Standardize the numeric columns (including the recently encoded ones)
scaler = StandardScaler()
df_kmeans_scaled = scaler.fit_transform(df_kmeans)

# Use the Elbow Method to select the appropriate number of clusters (k)
cost_kmeans = []
K_range = range(2, 6)
# 2 Points
# TODO ===== YOUR CODE HERE =====

# TODO ==========================

# Plot the elbow curve
plt.figure(figsize=(6, 4))
sns.lineplot(x=list(K_range), y=cost_kmeans, marker='o')
plt.title('Elbow Method for KMeans')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia / SSE')
plt.show()



In [None]:
# Choose the appropriate number of clusters and perform the final clustering
# 2 Points
# TODO ===== YOUR CODE HERE =====

# TODO ==========================
kmeans_final = KMeans(n_clusters=k_final, random_state=0)
clusters_kmeans = kmeans_final.fit_predict(df_kmeans_scaled)

# Add the new cluster labels back to the original DataFrame
market['kmeans_clusters'] = clusters_kmeans

# Visualization (example: Income vs total_items_bought)
plt.figure(figsize=(12, 6))
sns.scatterplot(
    x='Income',
    y='total_items_bought',
    data=market,
    hue='kmeans_clusters',
    palette='Set2'
)
plt.title('Income vs. Total Items Bought (KMeans Clusters)')
plt.show()


In [None]:

# View statistical descriptions for each cluster (numeric columns)
cluster_stats_kmeans = (
    market
    .groupby('kmeans_clusters')[market.select_dtypes('number').columns]
    .agg(['mean', 'sum'])
)
print(cluster_stats_kmeans)



In [None]:
##################################################
# Compute Sum of Squared Errors (SSE) and Silhouette Coefficient
##################################################

# Print the final Sum of Squared Errors (SSE) saved in our custom KMeans model
print("Final SSE (Inertia):", kmeans_final.inertia_)

# Compute the Silhouette Coefficient
# Note: Silhouette score is based on sample distances, so interpret cautiously for high-dimensional data
silhouette_coef = silhouette_score(df_kmeans_scaled, clusters_kmeans)
print("Silhouette Coefficient:", silhouette_coef)

In [None]:

fig, ax = plt.subplots(2, 3, figsize=(24, 12))
_ = sns.scatterplot(x='Income', y='MntWines', data=market, hue='kmeans_clusters', palette='Set2', ax=ax[0, 0])
_ = sns.scatterplot(x='Income', y='MntFruits', data=market, hue='kmeans_clusters', palette='Set2', ax=ax[0, 1])
_ = sns.scatterplot(x='Income', y='MntMeatProducts', data=market, hue='kmeans_clusters', palette='Set2', ax=ax[0, 2])
_ = sns.scatterplot(x='Income', y='MntFishProducts', data=market, hue='kmeans_clusters', palette='Set2', ax=ax[1, 0])
_ = sns.scatterplot(x='Income', y='MntSweetProducts', data=market, hue='kmeans_clusters', palette='Set2', ax=ax[1, 1])
_ = sns.scatterplot(x='Income', y='MntGoldProds', data=market, hue='kmeans_clusters', palette='Set2', ax=ax[1, 2])

_ = ax[0, 0].set_title('Amount of Wine Products Bought by kmeans_clusters')
_ = ax[0, 1].set_title('Amount of Fruits Bought by kmeans_clusters')
_ = ax[0, 2].set_title('Amount of Meat Products Bought by kmeans_clusters')
_ = ax[1, 0].set_title('Amount of Fish Products Bought by kmeans_clusters')
_ = ax[1, 1].set_title('Amount of Sweet Products Bought by kmeans_clusters')
_ = ax[1, 2].set_title('Amount of Gold Products Bought by kmeans_clusters')

# K-Prototype

In [None]:
ss = StandardScaler()
ss_market = market.copy()

# Select all numeric columns first
numeric_cols = ss_market.select_dtypes(include='number').columns

# Then filter out the columns to exclude (even if they are numeric, they are treated as categorical and not used for scaling)
cols_to_scale = [col for col in numeric_cols if col not in exclude_cols]

# Standardize the selected columns
ss_market[cols_to_scale] = ss.fit_transform(ss_market[cols_to_scale])

In [None]:
# getting categorical columns and their indices.
catColumnsPos = [ss_market.columns.get_loc(col) for col in exclude_cols if col in ss_market.columns]
print('Categorical columns           : {}'.format(exclude_cols))


In [None]:
print('Categorical columns position  : {}'.format(catColumnsPos))

In [None]:
dfMatrix = ss_market.to_numpy()

In [None]:
cost = []
for x in range(2, 6):
    kprototype = KPrototypes(n_jobs=-1, n_clusters=x, init='Huang', random_state=0)
    clusters_kprototype = kprototype.fit_predict(dfMatrix, categorical=catColumnsPos)
    cost.append(kprototype.cost_)
    print('kprototye_clusters initiation: {}'.format(clusters_kprototype))

In [None]:
# Converting the results into a dataframe and plotting them
df_cost = pd.DataFrame()
df_cost['kprototye_clusters'] = range(2, 6)
df_cost['cost'] = cost

In [None]:
# elbow method for number of clusters
sns.lineplot(x='kprototye_clusters', y='cost', data=df_cost)

In [None]:
# Kmeans tuned, choose the appropriate number of clusters and perform the final clustering
# 2 Points
# TODO ===== YOUR CODE HERE =====
kprototype = KPrototypes(n_jobs=-1, n_clusters=, init='Huang', random_state=0)
# TODO ==========================
clusters_kprototype = kprototype.fit_predict(dfMatrix, categorical=catColumnsPos)

In [None]:
market['kprototye_clusters'] = clusters_kprototype

In [None]:
plt.figure(figsize=(12, 6))

# Create a scatter plot for Income vs. Total Items Bought with clusters labeled by 'kprototye_clusters'
_ = sns.scatterplot(
    x='Income',
    y='total_items_bought',
    data=market,
    hue='kprototye_clusters',
    palette='Set2'
)
_ = plt.title('Income vs Total Items Bought')
_ = plt.ylabel('Total Items Bought')

def compute_sse(df_numeric, clusters):
    """
    Compute the Sum of Squared Errors (SSE) for clusters using only numeric data. Mimic the process for K-Means

    Parameters:
        df_numeric (numpy.ndarray): 2D array containing numeric data.
        clusters (numpy.ndarray): Array of cluster labels corresponding to the rows of df_numeric.

    Returns:
        float: The total SSE across all clusters.
    """
    sse = 0.0
    # 2 Points
    # TODO ===== YOUR CODE HERE =====

    # TODO ==========================
    return sse

##################################################
# Compute Within-Cluster SSE and Silhouette Coefficient
##################################################

# 1. Get the indices of numeric columns by excluding all categorical column positions
numeric_indices = [i for i in range(dfMatrix.shape[1]) if i not in catColumnsPos]

# 2. Extract the numeric data from dfMatrix
df_numeric = dfMatrix[:, numeric_indices]

# 3. Compute the SSE for each cluster using the defined function
sse_kprototype = compute_sse(df_numeric, clusters_kprototype)
print("Final SSE (numeric columns only):", sse_kprototype)

# 4. Compute the Silhouette Coefficient
# Note: The Silhouette score relies on pairwise distances between samples, so caution is needed when interpreting
# results for high-dimensional data.
silhouette_coef_kprototype = silhouette_score(df_kmeans_scaled, clusters_kprototype)
print("Silhouette Coefficient:", silhouette_coef_kprototype)

In [None]:
# Now our clusters are set, we need to visualize and get descriptive of the data according to clusters.
# However, from what I got we can conclude that; there are 4 clusters of the customers.
# Regular (cluster2), Bronze (cluster 1), Premium (cluster 0),and Gold Customers (cluster 3) (couldn't come up with a better idea)
# Let's see specifications of these clusters:
market.groupby(['kprototye_clusters'])[market.select_dtypes(include=['number']).columns].agg(['mean', 'sum'])


In [None]:
fig, ax = plt.subplots(2, 3, figsize=(24, 12))
_ = sns.scatterplot(x='Income', y='MntWines', data=market, hue='kprototye_clusters', palette='Set2', ax=ax[0, 0])
_ = sns.scatterplot(x='Income', y='MntFruits', data=market, hue='kprototye_clusters', palette='Set2', ax=ax[0, 1])
_ = sns.scatterplot(x='Income', y='MntMeatProducts', data=market, hue='kprototye_clusters', palette='Set2', ax=ax[0, 2])
_ = sns.scatterplot(x='Income', y='MntFishProducts', data=market, hue='kprototye_clusters', palette='Set2', ax=ax[1, 0])
_ = sns.scatterplot(x='Income', y='MntSweetProducts', data=market, hue='kprototye_clusters', palette='Set2',
                    ax=ax[1, 1])
_ = sns.scatterplot(x='Income', y='MntGoldProds', data=market, hue='kprototye_clusters', palette='Set2', ax=ax[1, 2])

_ = ax[0, 0].set_title('Amount of Wine Products Bought by Kprototye_Clusters')
_ = ax[0, 1].set_title('Amount of Fruits Bought by Kprototye_Clusters')
_ = ax[0, 2].set_title('Amount of Meat Products Bought by Kprototye_Clusters')
_ = ax[1, 0].set_title('Amount of Fish Products Bought by Kprototye_Clusters')
_ = ax[1, 1].set_title('Amount of Sweet Products Bought by Kprototye_Clusters')
_ = ax[1, 2].set_title('Amount of Gold Products Bought by Kprototye_Clusters')

# Observations

## Scatters
- **Cluster 3 (Gold Customers)** clearly earn the highest among all clusters, even though they are not the top buyers for some items (e.g., Fruits, Gold products).
- In almost every product category, the highest buyers are Gold or Premium customers.

### Suggestions:
1. **Retain Gold and Premium Customers:**
   - Continue serving their purchasing needs with exclusive deals.
   - For example, offer special promotions such as deals on high- or moderate-quality wine paired with cheese.

2. **Attract Lower-Tier Customers (Regular and Bronze):**
   - Consider expanding the product range to include a spectrum of quality levels (low, medium, and high) so that products are accessible to everyone.
   - Alternatively, create targeted deals to attract low- to middle-income customers.

## Grouped Table - Number of Purchases
- **Cluster 1** predominantly responds to deals.
  - It is advisable to monitor their purchasing behavior closely and adjust deals as their buying patterns change.
- Bronze, Gold, and Premium customers prefer online shopping.
  - To reach these groups, consider establishing email subscriptions, Instagram accounts (if not already in place), and other digital marketing channels.
  - Ensure that current customers are informed about deals, discounts, and promotions.
- **Premium and Gold customers** also tend to purchase through catalogs.
  - Since catalog buying involves ordering via email and shipping items to customers, offering shipping deals or free shipping could be very attractive.
  - Additionally, ensure that items are delivered on time.
- In-store purchases show similar trends, so the same suggestions could be applied.

## Website Visits and Online Shopping Insights
- **Regular Customers:**
  - They visit the website frequently, yet they do not always complete purchases.
- **Premium and Gold Customers:**
  - They tend to visit the website less, likely because they know what they want and spend less time online.
- Regular customers, despite their high web traffic, are predominantly buying wine and meat products, which suggests potential for improving conversion rates for these segments.


# KNN
In this section, you will implement the kNN method using the dataset and predict the customers' education level.

Remember to also answer, in short response, the analysis section after you finish implementing kNN.

In [None]:
from collections import Counter
class KNN:
    def __init__(self, k=5):
        """
        Parameters:
            n_neighbors: Number of nearest neighbors to consider.
        """
        self.n_neighbors = k
        self.X_train = None
        self.Y_train = None

    def fit(self, X, Y):
        """
        Store the training data.

        Parameters:
            X: numpy array or DataFrame, shape (n_samples, n_features)
            Y: numpy array or Series, shape (n_samples,)
        """
        if isinstance(X, pd.DataFrame):
            X = X.values
        if isinstance(Y, pd.Series):
            Y = Y.values

        self.X_train = X
        self.Y_train = Y
        return self

    def _compute_distances(self, X):
        """
        Compute the Euclidean distance between X and training samples.

        Parameters:
            X: numpy array, shape (n_samples, n_features)

        Returns:
            Distances: numpy array, shape (n_samples, n_train_samples)
        """
        # 2 Points
        # TODO ===== YOUR CODE HERE =====
        return 
        # TODO ==========================


    def predict(self, X):
        """
        Predict class labels for the given input.

        Parameters:
            X: numpy array or DataFrame, shape (n_samples, n_features)

        Returns:
            Predicted labels: numpy array, shape (n_samples,)
        """
        if isinstance(X, pd.DataFrame):
            X = X.values
        # 2 Points
        # TODO ===== YOUR CODE HERE =====

        # TODO ==========================
        return predictions

    def score(self, X, Y):
        """
        Compute accuracy of the classifier.

        Parameters:
            X: numpy array or DataFrame, shape (n_samples, n_features)
            Y: numpy array or Series, shape (n_samples,)

        Returns:
            Accuracy score (float)
        """
        if isinstance(Y, pd.Series):
            Y = Y.values

        predictions = self.predict(X)
        return np.mean(predictions == Y)

    def fit_predict(self, X, Y):
        """
        Fit the data and return predictions.
        """
        self.fit(X, Y)
        return self.predict(X)


In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

scal = StandardScaler()  # Using your custom StandardScaler
le = LabelEncoder()

knn_market = market.copy()

exclude_c = [
    'Education',
    'Marital_Status',
    'AcceptedCmp3',
    'AcceptedCmp4',
    'AcceptedCmp5',
    'AcceptedCmp1',
    'AcceptedCmp2',
    'Complain',
    'Z_CostContact',
    'Z_Revenue',
    'Response',
    'kprototye_clusters',
    'kmeans_clusters'
]

# Encode the target variable (Education)
knn_market['Education'] = le.fit_transform(knn_market['Education'])
label_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
inverse_label_mapping = {v: k for k, v in label_mapping.items()}  # Reverse mapping

# Select numeric features only (excluding categorical)
numeric_cols = knn_market.select_dtypes(include='number').columns
X = knn_market.drop(columns=exclude_c)  # Drop Education from features
Y = knn_market['Education']  # Target variable

# Remove excluded columns (categorical or unwanted numeric)
cols_to_scale = [col for col in X.columns if col not in exclude_c]

# Standardize the selected numeric columns using the custom StandardScaler
X[cols_to_scale] = scal.fit_transform(X[cols_to_scale])

# Split into train and test sets (80% train, 20% test)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)

# Display dataset shapes
print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
print(f"Y_train shape: {Y_train.shape}, Y_test shape: {Y_test.shape}")


In [None]:
# Instantiate the kNN model
# 2 Points
# TODO ===== YOUR CODE HERE =====

# TODO ==========================

# Train the model on the training data
# 2 Points
# TODO ===== YOUR CODE HERE =====

# TODO ==========================

# Compute accuracy
# 2 Points
# TODO ===== YOUR CODE HERE =====

# TODO ==========================
print(f"Prediction Accuracy: {accuracy:.4f}")


In [None]:
from sklearn.decomposition import PCA

# Apply PCA to reduce the test set to 2D
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)  # Fit on train set
X_test_pca = pca.transform(X_test)  # Transform test set

import matplotlib.pyplot as plt
import seaborn as sns

# Create scatter plot with actual labels
plt.figure(figsize=(10, 6))
Y_test_labels = np.array([inverse_label_mapping[label] for label in Y_test])
sns.scatterplot(x=X_test_pca[:, 0], y=X_test_pca[:, 1], hue=Y_test_labels, palette='coolwarm', edgecolor='k')

# Labels and title
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('Visualization of Testing Set with Correct Education Labels')
plt.legend(title="Education Level", bbox_to_anchor=(1.05, 1), loc='upper left')

plt.show()


# Analysis (2 Points)
#### In **3-4 sentences**, describe the performance of kNN in this situation and why it faces issues when encountering high dimensionality.