In [None]:
#EX 1
#a)
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

data = pd.read_csv('accounts.csv', delimiter=',')
data.head()
X_raw = data.iloc[:, :8]
#Removing duplicates and null values
X_raw = X_raw.drop_duplicates()
X_raw = X_raw.dropna()
X = pd.get_dummies(X_raw, drop_first=True) # Convert categorical variables to dummy/indicator variables
k_values= [2,3,4,5,6,7,8]
MAX_ITER = 500
RANDOM_STATE = 42

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

sse = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, max_iter=MAX_ITER, random_state=RANDOM_STATE)
    kmeans.fit(X_scaled)
    sse.append(kmeans.inertia_)
    print("inertia: ", kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(k_values, sse, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Sum of Squared Errors (SSE)')
plt.title('SSE vs Number of Clusters')
plt.grid(True)
plt.show()



### Explanation ex1 b)
Based on the SSE (Sum of Squared Errors) vs. Number of Clusters (k) plot, we should look for the point where the rate of SSE reduction starts to slow down. This point suggests the optimal number of clusters, as adding more clusters beyond this point does not significantly improve model performance but increases its complexity.

As we can see in the plot, there is a sharp drop in SSE when moving from k = 2 to k = 3 (-1572 SSE), indicating that the transition from 2 to 3 clusters greatly improves clustering quality. Similarly, the transition from k = 3 to k = 4, with a decrease of 2197, shows a significant drop in SSE. From k = 4 onwards, the rate of SSE decrease slows down, not exceeding 865 (from k = 4 to k = 5), and continues to decrease more steadily for higher k values, without abrupt changes.

Based on the "elbow finding" method, we conclude that the inflection point seems to be at k = 4 due to the abrupt change in the rate of SSE decrease. Thus, from this value onward, adding more clusters results in little gain relative to the increased complexity. This represents a good balance between reducing inertia and avoiding unnecessary complexity, which could even lead to overfitting.

### Explanation ex1 c)

When using k-means, the algorithm is based on calculating distances between points in multidimensional space, usually using Euclidean distance. This works well for numerical variables, where differences between values have real meaning (e.g., the difference between 30 and 50 years). However, for categorical variables (such as "job type" or "education level"), using k-means can be problematic because it requires transforming these categories into numerical values, for example, through get_dummies(), which increases dimensionality.

Moreover, calculating means (used by k-means to update centroids) does not make sense for categorical variables, because it assumes continuity in the data that does not exist in this case.

On the other hand, k-modes works with the mode of the categories, i.e., it selects the most frequent category in each cluster. This makes much more sense when dealing with categorical variables, as it does not require calculating means but rather directly compares categories within each feature. However, its calculation is not suitable for numerical variables since we lose the notion of proximity between values in Euclidean space.

Therefore, since the subset we are analyzing is mostly composed of categorical features, such as job, marital, and education, among others, and two numerical features (age and balance), we could say that k-modes would be a more suitable clustering method, as it works better with categorical variables, directly comparing categories without distorting distances.
However, not knowing the exact importance of each feature, we cannot completely disregard the numerical variables (as would happen with k-modes).

Thus, it is probably better to keep k-means with the use of get_dummies() and also apply normalization, to make the behavior of categorical variables closer to continuous variables for calculations.

In conclusion, despite the greater computational complexity and the issues mentioned above regarding k-means, the problems of using k-modes may prove more detrimental, so we cannot say that it would be better to use k-modes over k-means.

In [None]:
#EX 2 a)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

scaler_std = StandardScaler()
X_scaled_std = scaler_std.fit_transform(X)

pca = PCA(n_components=2)
pca.fit(X_scaled_std)
X_pca = pca.transform(X_scaled_std)

total_variance_explained = pca.explained_variance_ratio_[0] + pca.explained_variance_ratio_[1]
percentage_variance_explained = total_variance_explained * 100
print(f"Total variance explained by top 2 components: {percentage_variance_explained:.2f}%")




In [None]:
# b)
from sklearn.cluster import KMeans
import seaborn as sns

k_means = KMeans(n_clusters=3, random_state=42)
assigned_clusters = k_means.fit_predict(X_scaled_std)

plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=assigned_clusters, palette='bright')
plt.title("K-means Clustering (k=3) on First 2 Principal Components")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend(title="Cluster")
plt.show()



### Explanation ex2 b)
Analyzing the plot, we can notice a relative separation between the 3 clusters and a tendency for the points to move to the right along the cluster. However, there are several regions of the plot where points from different clusters are almost overlapping; that is, if we drew an ellipse around the points of each cluster, these ellipses would overlap. So, although some separation is visible, we cannot say that the clusters are clearly separated. This happens because the two principal components explain only 22.76% of the variance, which is much lower than the recommended 85% to ensure we do not lose part of the data's complexity when reducing dimensionality. Thus, we conclude that the two principal components used are not sufficient to maintain the data's complexity throughout the transformation, which leads to not being able to clearly separate the clusters in the plot.


In [None]:
#2c)
import seaborn as sns
X = X_raw.copy()
df = pd.DataFrame({
     'job': X['job'],
     'education': X['education'],
     'cluster': assigned_clusters  # Cluster assignment from previous question
})

# Plot for the "job" feature

sns.displot(data=df, x="job", hue="cluster",
            multiple="dodge", stat="density",
            shrink=0.8, common_norm=False, aspect=2)
plt.title('Density Plot of Job by Cluster', fontsize=14)
plt.xlabel("Job", fontsize=13)
plt.ylabel("Density", fontsize=13)
plt.xticks(rotation=45) 
plt.show()

# Plot for the "education" feature
sns.displot(data=df, x="education", hue="cluster",
            multiple="dodge", stat="density",
            shrink=0.8, common_norm=False, aspect=2)
plt.title('Density Plot of Education by Cluster')
plt.xlabel("Education", fontsize=13)
plt.ylabel("Density", fontsize=13)
plt.xticks(rotation=45) 
plt.show()


### Explanation ex2 c)
By analyzing the job density plot by cluster, we notice that the "retired" category belongs exclusively to cluster 2, meaning all observations in cluster 2 have the job category 'retired'. Looking at the education density plot by cluster, we infer that observations with 'retired' as job predominantly have lower or intermediate education levels, i.e., primary or secondary education.

Regarding the other clusters, they show a similar distribution in the frequencies of different jobs, with the main differences being that cluster 0 has a significantly higher density of 'blue-collar' workers compared to cluster 1, while cluster 1 has a significantly higher density of 'student' and 'technician' than cluster 0.
For the education plot, in clusters 0 and 1, there is a lower frequency of 'primary' education and a higher frequency of 'secondary' and 'tertiary' compared to cluster 2. However, cluster 0 has a higher density in 'secondary' compared to cluster 1, while cluster 1 has a higher density in 'tertiary'.
On the other hand, cluster 2 shows the highest density for 'primary' education.

In conclusion, we can observe that observations with the job 'retired' are more likely to belong to cluster 2, meaning they have lower or intermediate education (primary or secondary). The categories 'student' and 'technician' are more likely to belong to cluster 1, meaning they have more advanced education (secondary or tertiary), and finally, the 'blue-collar' category is more likely to belong to cluster 0, which has more intermediate education (mostly secondary, with a smaller difference between tertiary and primary).
It is worth noting that, except for cluster 2, the distinctions are not very significant, because, as stated in the previous exercise, the principal components are not able to explain much of the data's complexity.