# Credit Card Dataset for Clustering

In this project, we'll use the 'Credit Card Dataset for Clustering' provided by Kaggle.


Dataset description : This dataset was derived and simplified for learning purposes. It includes usage behaviour of about 9000 active credit card holders during 6 months period. This case requires to develop a customer segmentation to define marketing strategy.

➡️ Dataset link 

https://i.imgur.com/gAT5gVg.jpg

- **Columns explanation :** 

- CUST_ID: Identification of Credit Card holder (Categorical)
- BALANCE_FREQUENCY: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
- PURCHASES: Amount of purchases made from account 
- CASH_ADVANCE: Cash in advance given by the user
- CREDIT_LIMIT: Limit of Credit Card for user 
- PAYMENTS: Amount of Payment done by user 



### Instructions

- Import you data and perform basic data2.  exploration phase
- Perform the necessary data preparation steps ( Corrupted and missing values handling, a encoding, outl3. iers handling ... )
- Perform hierarchical clustering to identify the inherent groupings within your data. Then, plot the clusters. (use only 2 features. For example, try to cluster the customer base with respect to 'PURCHASES'4.  and 'credit limit')
- Perform partitional clustering using the K-means algorithm. Th5. en, plot the clusters
- Find the best k value and pl6. ot the clusters again.
- Interpret the results

In [None]:
# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import necessary libraries
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
import seaborn as sns  
from sklearn.preprocessing import StandardScaler  
from sklearn.cluster import AgglomerativeClustering  
from scipy.cluster.hierarchy import dendrogram, linkage  

In [None]:
df = pd.read_csv("Credit_card_dataset.csv")

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df = df.dropna()

In [None]:
df.duplicated().sum()

In [None]:
df.info()

There are no categorical columns to encode

In [None]:
# Extract the relevant features
X = df[["PURCHASES", "CREDIT_LIMIT"]]

In [None]:
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
linked = linkage(X_scaled, method='ward')

In [None]:
# Create a new figure with a specific size for the dendrogram
plt.figure(figsize=(10, 7))

dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)

# Add a title to the plot to describe it as a dendrogram for agglomerative clustering
plt.title('Dendrogram for Agglomerative Clustering')

# Label the x-axis as 'Data Points' since the horizontal axis represents the individual data points or clusters
plt.xlabel('Data Points')

# Label the y-axis as 'Euclidean Distance' because the vertical axis represents the distance between merged clusters
plt.ylabel('Euclidean Distance')

# Display the plot
plt.show()

In [None]:
agglom = AgglomerativeClustering(n_clusters=6, metric='euclidean', linkage='ward')

In [None]:
clusters = agglom.fit_predict(X_scaled)

In [None]:
df['Cluster'] = clusters

In [None]:
# Visualize the clusters using a scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df['PURCHASES'], y=df['CREDIT_LIMIT'], hue=df['Cluster'], palette='Set1')
plt.title('Credit Card holder Segmentation based on Purchases and Credit Limit (Agglomerative Clustering)')
plt.xlabel('Purchases')
plt.ylabel('Credit Limit')
plt.legend()
plt.show()

### K means Clustering

In [None]:
# Import necessary libraries
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
import seaborn as sns  
from sklearn.cluster import KMeans 

In [None]:
df = pd.read_csv("Credit_card_dataset.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df = df.dropna()

In [None]:
# Extract the relevant features
X = df[["PURCHASES", "CREDIT_LIMIT"]]

In [None]:
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
inertia = []  # List to store the inertia values for each number of clusters

for i in range(1, 11):  # Loop over cluster numbers from 1 to 10
    kmeans = KMeans(n_clusters=i, random_state=42)  # Initialize KMeans with the current number of clusters (i)
    kmeans.fit(X_scaled)  # Fit the KMeans model on the standardized data
    inertia.append(kmeans.inertia_)  # Append the inertia (sum of squared distances) to the list

# Plot the Elbow curve
plt.figure(figsize=(10, 6))  # Set the figure size for better visibility
plt.plot(range(1, 11), inertia, 'ro-')  # Plot the number of clusters against inertia with red markers and lines
# 'r': Specifies the color of the plot, in this case, red (r stands for red).
# 'o': Specifies the marker style, which in this case is a circle (o).
# '-': Specifies the line style, which in this case is a solid line (-).

plt.title('Elbow Method for Optimal Number of Clusters')  # Add a title to the plot
plt.xlabel('Number of Clusters')  # Label the x-axis as 'Number of Clusters'
plt.ylabel('Inertia')  # Label the y-axis as 'Inertia' (within-cluster sum of squares)
plt.show()  # Display the plot

In [None]:
# Fit K-Means with 5 clusters
kmeans = KMeans(n_clusters=6, random_state=42) 
kmeans.fit(X_scaled)

In [None]:
df['Cluster'] = kmeans.labels_

In [None]:
# Visualize the clusters using a scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df['PURCHASES'], y=df['CREDIT_LIMIT'], hue=df['Cluster'], palette='Set1')
plt.title('Credit Card holder Segmentation based on Purchases and Credit Limit (Agglomerative Clustering)')
plt.xlabel('Purchases')
plt.ylabel('Credit Limit')
plt.legend()
plt.show()