# Credit Approval Data Clustering with k-means, k-modes, and k-prototypes

*Objective:*
This assignment involves clustering the Credit Approval Dataset using k-means, k-modes, and k-prototypes algorithms, implemented from scratch. The goal is to group credit applicants into clusters based on numeric, categorical, and mixed attributes. Each algorithm will be applied to the dataset to understand how clustering performance is influenced by the different types of features.

I deliberately chose an algorithm where we have the target variable available. So we can compare the performance of the algorithms.

[Here](https://archive.ics.uci.edu/dataset/27/credit+approval) is a detailed look at the data

## Step 1: Data Loading and Preprocessing

Objective: Load the dataset, handle missing values, and encode categorical features manually for use in the k-means, k-modes, and k-prototypes algorithms.

As you can see no column names are provided. So you have no idea what these variables stand for. Here is the information that is provided:

    A1:	b, a.
    A2:	continuous.
    A3:	continuous.
    A4:	u, y, l, t.
    A5:	g, p, gg.
    A6:	c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
    A7:	v, h, bb, j, n, z, dd, ff, o.
    A8:	continuous.
    A9:	t, f.
    A10:	t, f.
    A11:	continuous.
    A12:	t, f.
    A13:	g, p, s.
    A14:	continuous.
    A15:	continuous.
    A16: +,-         (class attribute)
  

Your task is:

1.  Read in the dataset, label the columns appropriately, make sure you label the target column as "target" to avoid any confusion.
2. Deal with the missing values. For this dataset, the missing values are encoded as "?". You can tell `pd.read_csv()` how missing values are encoded using the `na_values=` argument. Then it will read the missing values correctly. To keep things simple, you should drop the rows with missing values since there are not that many of them.



In [114]:
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
pd.options.mode.copy_on_write = True
np.set_printoptions(suppress=True,precision=4)

In [115]:
# Load the Credit Approval Dataset using file import
from google.colab import files
uploaded = files.upload()

In [116]:
credit = pd.read_csv('/content/crx.data')

In [117]:
# add the column names
credit.columns = ["A" + str(x) for x in range(1, 17)]
credit.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
1,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
2,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+
4,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360,0,+


In [118]:
# fill code to drop missing rows
credit = credit.dropna()

In [119]:
credit.replace('?', np.nan, inplace=True)
credit['A2'] = credit['A2'].astype(float)
credit['A14'] = credit['A14'].str.strip().astype(float)

In [120]:
# lets reset the index since we dropped some rows
credit.reset_index(inplace=True, drop=True)

In [121]:
# fill code to check missing values
credit.tail()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
684,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,260.0,0,-
685,a,22.67,0.75,u,g,c,v,2.0,f,t,2,t,g,200.0,394,-
686,a,25.25,13.5,y,p,ff,ff,2.0,f,t,1,t,g,200.0,1,-
687,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,280.0,750,-
688,b,35.0,3.375,u,g,c,h,8.29,f,f,0,t,g,0.0,0,-


In [122]:
# check number of missing values in credit
credit.isnull().sum()

Unnamed: 0,0
A1,12
A2,12
A3,0
A4,6
A5,6
A6,9
A7,9
A8,0
A9,0
A10,0


In [123]:
credit = credit.dropna()

## Step 2: Basic Data Exploration [5 points]

Use the methods and techniques we learned in class and previous assignments to do some basic EDA. This should involve some summary tables, some charts, and some insights that you gather from the exploration.

In [124]:
# lets look at the continuous variables, describe them
credit_numeric = credit[['A2', 'A3', 'A8', 'A11', 'A14', 'A15']]
credit_numeric.describe()

Unnamed: 0,A2,A3,A8,A11,A14,A15
count,652.0,652.0,652.0,652.0,652.0,652.0
mean,31.504847,4.83694,2.245821,2.504601,180.326687,1015.315951
std,11.847327,5.027369,3.373483,4.971962,168.423883,5257.161359
min,13.75,0.0,0.0,0.0,0.0,0.0
25%,22.58,1.04,0.165,0.0,72.25,0.0
50%,28.375,2.855,1.0,0.0,160.0,5.0
75%,38.25,7.5,2.625,3.0,272.0,400.0
max,76.75,28.0,28.5,67.0,2000.0,100000.0


In [125]:
# rename the 'A16' columne to target
credit.rename(columns={'A16': 'target'}, inplace=True)
credit.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,target
0,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
1,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
2,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+
4,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360.0,0,+


In [126]:
# Map the target variable so that it is 0/1
credit['target'] = credit['target'].map({'+': 1, '-': 0})
credit.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,target
0,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
1,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,1
2,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1
4,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360.0,0,1


In [127]:
# Find the correlations between the numeric variables
correlation_matrix = credit_numeric.corr()
correlation_matrix

Unnamed: 0,A2,A3,A8,A11,A14,A15
A2,1.0,0.217752,0.41765,0.198141,-0.084608,0.029062
A3,0.217752,1.0,0.300398,0.269598,-0.217043,0.119557
A8,0.41765,0.300398,1.0,0.327233,-0.064728,0.052076
A11,0.198141,0.269598,0.327233,1.0,-0.116051,0.058324
A14,-0.084608,-0.217043,-0.064728,-0.116051,1.0,0.073425
A15,0.029062,0.119557,0.052076,0.058324,0.073425,1.0


In [128]:
credit.head()


Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,target
0,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
1,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,1
2,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1
4,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360.0,0,1


### The scale of the variables seems to be very different. We will need to do some scaling.




In [129]:
# fill in code to standardize the numeric variables
num_features = ['A2', 'A3', 'A8', 'A11', 'A14', 'A15']

for feature in num_features:
    mean = credit[feature].mean()
    std_dev = credit[feature].std()
    credit[feature] = (credit[feature] - mean) / std_dev

credit.head()


Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,target
0,a,2.292935,-0.074978,u,g,q,h,0.235418,t,t,0.703022,f,g,-0.815364,-0.086609,1
1,a,-0.59126,-0.862666,u,g,q,h,-0.221083,t,f,-0.503745,f,g,0.5918,-0.036391,1
2,b,-0.310184,-0.655798,u,g,w,v,0.445883,t,t,0.501894,t,g,-0.476932,-0.192559,1
3,b,-0.956743,0.156754,u,g,w,v,-0.158833,t,f,-0.503745,f,s,-0.358184,-0.19313,1
4,b,0.048547,-0.166477,u,g,m,v,0.075346,t,f,-0.503745,t,g,1.066792,-0.19313,1


In [130]:
credit.describe()

Unnamed: 0,A2,A3,A8,A11,A14,A15,target
count,652.0,652.0,652.0,652.0,652.0,652.0,652.0
mean,8.718316e-17,6.538737e-17,-2.179579e-17,-2.179579e-17,0.0,0.0,0.452454
std,1.0,1.0,1.0,1.0,1.0,1.0,0.498116
min,-1.498637,-0.9621215,-0.6657275,-0.503745,-1.070672,-0.19313,0.0
25%,-0.7533216,-0.7552539,-0.6168167,-0.503745,-0.641695,-0.19313,0.0
50%,-0.2641817,-0.3942301,-0.369298,-0.503745,-0.120688,-0.192179,0.0
75%,0.5693397,0.5297124,0.1124,0.09963848,0.544301,-0.117043,1.0
max,3.819018,4.607392,7.782514,12.97182,10.804129,18.828542,1.0


Do any other EDA that you think will help you understand the data better

## Step 3: Implementing K-Means Algorithm [20 points]

### Now lets implement the Euclidean distance [3 points]

In [131]:
def euclidean_distance(x: np.array, y: np.array) -> float:
    """Calculates the Euclidean distance between two points.

    Args:
        x (numpy.ndarray): The first point.
        y (numpy.ndarray): The second point.

    Returns:
        float: The Euclidean distance between the two points.
    """
    # fill code here
    return np.sqrt(np.sum((x - y) ** 2))

### Initialize the centroids [2 points]

At first you just pick initial random points as your centroids. Write code to do that below

In [132]:
def initialize_centroids(
    X: pd.DataFrame, k: int, random_state: int = 12
) -> pd.DataFrame:
    """Randomly selects k initial centroids from the dataset.

    Args:
        X: The dataset from which to select centroids.
        k (int): The number of centroids to initialize.

    Returns:
        pd.DataFrame The initial centroids.
    """
    # set seed to reproduce results.
    np.random.seed(random_state)
    # Fill code here
    initial_centroids = X.sample(n=k).reset_index(drop=True)
    return initial_centroids

### Assign clusters to each point (each person in the data) [5 points]

Now that we have chosen initial points, lets write a function that takes in the dataset, and for each person (row) in the dataframe, it assigns a cluster.

In [133]:
def assign_clusters(X: pd.DataFrame, centroids: pd.DataFrame) -> pd.DataFrame:
    """Calculates each data point's distance all centroids based on Euclidean distance.
    Assumes that the first row in centroids is the 0th cluster, the second row
    is the 1st cluster, and so on

    Args:
        X (pd.DataFrame): The dataset where each row is a data point.
        centroids (pd.DataFrame): The current centroids.

    Returns:
        pd.DataFrame: X with the appended columns of cluster to which it belongs
    """
    # lets first convert them into numpy arrays, this will make calculations faster
    # if not set(X.columns).issubset(set(centroids.columns)):
    #     centroids = centroids[X.columns]  # Filter to ensure matching columns
    # X_numpy = X.values
    # centroids_numpy = centroids.values

    # make sure to return the dataframe with a new column called "cluster"
    # Check if 'cluster' column exists, if so, drop it
    if 'cluster' in X.columns:
        X = X.drop(columns=['cluster'])

    # Create a copy of X to avoid altering the original DataFrame
    X_copy = X.copy()
    X_numpy = X_copy.values
    centroids_numpy = centroids.values
    clusters = []

    # Calculate distances and assign clusters
    for point in X_numpy:
        distances = [euclidean_distance(point, centroid) for centroid in centroids_numpy]
        clusters.append(np.argmin(distances))

    # Add the 'cluster' column to the DataFrame
    X_copy['cluster'] = clusters
    return X_copy

### Update the centroids based on the clusters formed [5 points]

The new centroid is just the mean of all the points in that cluster. Note that this new centroid may not be a row in the dataframe, but some new values for all your features that don't exist in the credit dataframe

In [134]:
def update_centroids(X: pd.DataFrame, k: int) -> pd.DataFrame:
    """Updates the centroids based on the mean of the points assigned to each cluster.

    Args:
        X (pd.DataFrame): The dataset where each row is a data point.
        k (int): The number of clusters.

    Returns:
        pd.DataFrame: The updated centroids with the right columnnames.
    """
    # fill code here
    return X.groupby('cluster').mean()

### Run the K-Means algorithmn [5 points]

In [135]:
def k_means(X: pd.DataFrame, k: int, max_iters: int = 100) -> tuple:
    """Performs k-means clustering.

    Args:
        X (pd.DataFrame): The dataset where each row is a data point.
        k (int): The number of clusters.
        max_iters (int, optional): The maximum number of iterations. Defaults to 100.

    Returns:
        tuple: A tuple containing the final centroids and the cluster assignments.
    """
    # first initialize the centroids
    # centroids = initialize_centroids(X, k)
    # then write a loop. within that loop:
    # 1. assign_clusters
    # 2. update centroids
    # 3. Keep repeating until the centroids dont change OR you hit max_iters.
    # fill code here

    X_original = X.copy()
    centroids = initialize_centroids(X, k)

    for _ in range(max_iters):
        # Reset X to the original DataFrame
        X = X_original.copy()

        # Assign clusters based on current centroids
        X = assign_clusters(X, centroids)

        # Update centroids based on the new cluster assignments
        centroids = update_centroids(X, k)

        # Check for convergence (if centroids remain the same)
        if centroids.equals(update_centroids(X, k)):
            break

    return centroids, X

Since k-means works only on the numeric data, lets create a new dataframe that contains only numeric

In [136]:
# Lets run the k-means algorithm!
credit_numeric = credit[['A2', 'A3', 'A8', 'A11', 'A14', 'A15']]

In [137]:
centroids_formed, clusters_formed = k_means(credit_numeric, k=2)

Now we will check how well the algorithm performed. We have our original target variable, the source of truth, that we coded as 0/1.

We now have clustered our dataset into two different clusters. We have called them clusters 0 and 1, but these are not the same as the 0/1 from the target variable.

So check how accurate we are, we need to check if we have mistakenly called taret 0 as cluster 1. So I create a cluster2 variable, and then calculate accuracy score with both cluster, and cluster2

In [138]:
# lets add on to this the original target variable
clusters_formed['target'] = credit['target']
clusters_formed['cluster2'] = clusters_formed['cluster']
clusters_formed['cluster2'] = clusters_formed['cluster2'].map({0: 1, 1: 0})

In [139]:
accuracy_score(clusters_formed.target, clusters_formed.cluster)

0.6165644171779141

In [140]:
accuracy_score(clusters_formed.target, clusters_formed.cluster2)

0.3834355828220859

This does not look very accurate, and heavily imbalanced. Lets see if we can improve by adding categorical variables.

## Step 4: k-prototypes Clustering (Mixed Data) (25 points)

 Implement k-prototypes from scratch, using numeric and categorical columns. Remember, instead of clusters, we call them prototypes, and the centroid of a prototype has two components: the numeric component of means, and the categorical component of modes.

### Define a function to calculate simple matching dissimilarity [1 points]

In [141]:
def simple_matching_dissimilarity(x: np.array, y: np.array) -> float:
    """Calculates the simple matching dissimilarity between two categorical samples.

    Args:
        x (numpy.ndarray): the categorical features of the first sample.
        y (numpy.ndarray): the categorical features of the second sample.

    Returns:
        int: The number of mismatches between the two points.
    """
    # fill code here
    return np.sum(x != y)


### Define a function to calculate mixed dissimilarity [4 points]


In [142]:
def mixed_dissimilarity(
    x_num: np.array, y_num: np.array, x_cat: list, y_cat: list, gamma: float
) -> float:
    """Calculates the mixed dissimilarity between two sample with numeric and categorical features.
    This would be given by  num_diff + gamma * cat_diff

    Args:
        x_num (numpy.ndarray): The numeric part of the first sample.
        y_num (numpy.ndarray): The numeric part of the second sample.
        x_cat (numpy.ndarray): The categorical part of the first sample.
        y_cat (numpy.ndarray): The categorical part of the second sample.
        gamma (float): The weighting factor for categorical features.

    Returns:
        float: The combined dissimilarity score.
    """
    return euclidean_distance(x_num, y_num) + gamma * simple_matching_dissimilarity(x_cat, y_cat)

### Initialize the prototypes [2 point]

At first you just pick initial random points as your prototypes. Write code to do that below:


In [143]:
def initialize_centroids_mixed(
    X: pd.DataFrame, k: int, random_state: int = 12
) -> pd.DataFrame:
    """Randomly selects k initial centroids from the dataset.

    Args:
        X: The dataset from which to select centroids.
        k (int): The number of centroids to initialize.

    Returns:
        pd.DataFrame The initial centroids.
    """
    # set seed to reproduce results.
    np.random.seed(random_state)
    random_indices = np.random.choice(X.index, size=k, replace=False)
    centroids = X.loc[random_indices].reset_index(drop=True)
    return centroids


### Assign prototypes to each point (each person in the dataset) [6 points]

This function assigns each sample to the nearest prototype based on mixed dissimilarity.



In [144]:
def assign_prototypes_mixed(
    X: pd.DataFrame,
    cat_features: list,
    num_features: list,
    centroids: pd.DataFrame,
    gamma: float,
) -> pd.DataFrame:
  """Assigns each sample to the nearest prototype based on mixed dissimilarity.

  Args:
      X (pd.DataFrame): The dataset containing all your data
      cat_features (list): list of all categorical features
      num_features (list): list of all num_features features
      centroids (pd.DataFrame): A dataframe with k rows, corresponding to the
        centroid of each prototype
      gamma (float): The weighting factor for categorical features.

  Returns:
      pd.DataFrame: X with the appended columns of cluster to which it belongs
  """
  # Lets split the data into categorical and numerical numpy arrays
  X_num = X[num_features].values
  X_cat = X[cat_features].values

  # Fill code here
  centroids_num = centroids[num_features].values
  centroids_cat = centroids[cat_features].values
  clusters = []
  for i in range(len(X)):
      distances = [mixed_dissimilarity(X_num[i], centroids_num[j], X_cat[i], centroids_cat[j], gamma) for j in range(len(centroids))]
      clusters.append(np.argmin(distances))
  X = X.copy()
  X['prototype'] = clusters
  return X

### Update the centroids based on the prototypes formed [6 points]

The new centroid is just the mean of all the numeric featurse in that prototypes, and the mode of all the categorical features in the prototypes.

Note that this new centroid may not be a row in the dataframe, but some new values for all your features that don't exist in the credit dataframe

In [146]:
def update_centroids_mixed(
    X: pd.DataFrame, cat_features: list, num_features: list, k: int
) -> pd.DataFrame:
  """Updates the centroids for each prototype in k-prototypes.

  Args:
      X (pd.DataFrame): The dataset containing all your data
      cat_features (list): list of all categorical features
      num_features (list): list of all num_features features
      k (int): The number of prototypes.

  Returns:
      pd.DataFrame: The updated centroids with the right columnnames.
  """
  new_centroids_num = X.groupby('prototype')[num_features].mean()
  new_centroids_cat = X.groupby('prototype')[cat_features].agg(lambda x: x.mode()[0])

  new_centroids = pd.concat([new_centroids_num, new_centroids_cat], axis=1)
  return new_centroids.reset_index(drop=True)

### Implement the k-prototypes [6 points]

In [147]:
def k_prototypes(
    X: pd.DataFrame,
    cat_features: list,
    num_features: list,
    k: int,
    gamma: float,
    max_iters: int = 100,
) -> tuple:
    """Performs k-prototypes clustering.

    Args:
        X (numpy.ndarray): The numeric part of the dataset.
        k (int): The number of clusters.
        gamma (float): The weighting factor for categorical features.
        max_iters (int, optional): The maximum number of iterations. Defaults to 100.

    Returns:
        tuple: A tuple containing the final prototypes and the cluster assignments.
    """
    centroids = initialize_centroids_mixed(X, k)
    centroids = centroids[num_features + cat_features]

    for _ in range(max_iters):
        X = assign_prototypes_mixed(X, cat_features, num_features, centroids, gamma)
        updated_centroids = update_centroids_mixed(X, cat_features, num_features, k)

        # Use np.allclose for convergence check on numeric part
        if np.allclose(updated_centroids[num_features].values, centroids[num_features].values) and \
            updated_centroids[cat_features].equals(centroids[cat_features]):
            break

        centroids = updated_centroids

    return centroids, X

In [148]:
num_features = ['A2', 'A3', 'A8', 'A11', 'A14', 'A15']
cat_features = ['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']

In [149]:
X = credit.drop('target', axis=1)

In [150]:
initial_centroids = initialize_centroids_mixed(X, 2)

In [151]:
centroids, prototypes = k_prototypes(X, cat_features, num_features, 2, gamma=0.5)

In [152]:
# lets add on to this the original target variable
prototypes['target'] = credit['target']
prototypes['prototype2'] = prototypes['prototype']
prototypes['prototype2'] = prototypes['prototype2'].map({0: 1, 1: 0})

In [153]:
accuracy_score(prototypes.target, prototypes.prototype)

0.799079754601227

In [154]:
accuracy_score(prototypes.target, prototypes.prototype2)

0.200920245398773

*Wow*, we went up a lot in accuracy! Not too bad with a simple unsupervised learning algorithm.