### **Problem Statement**
Belum memiliki strategi yang tepat untuk menawarkan jenis produk yang sesuai dengan segmen calon nasabah yang akan direkrut.

### **Objective**
Membuat sebuah model clustering untuk mengetahui kepemilikan produk berdasarkan demografi nasabah yang saat ini sudah menggunakan layanan FundFusion dengan silhouette score >0.7

### **Variable Yang Tersedia**
Dari dataset yang dimiliki terdapat beberapa data yang tersedia:

1. GCIF: Unique Identifier Nasabah
2. Area: Lokasi Nasabah (Jakarta, Bogor, Bandung, Surabaya, Jogja, Solo)
3. Jalur_Pembukaan: Touch Points Nasabah Membuka produk (Cabanng, Telemarketing, Aplikasi Digital, Internet Banking)
4. Vintage: Durasi menjadi nasabah (Sejak membuka akun)
5. Usia: Usia nasabah
6. Jenis_kelamin: Laki-laki(1) & Perempuan (1)
7. Status_Perkawinan : Belum menikah(0), Menikah(1), Cerai(2), Janda/Duda(3)
8. Jumlah_Anak : Jumlah Anak Nasabah (Numerik)
9. Pendidikan : Status Pendidikan Terakhir (Tidak Memiliki pendidikan formal[0], SD[1], SMP[2], SMA[3], Sarjana[4], Magister[5], Doktor[6])
10. Produk_Tabungan: Status Kepemilikan Produk(Yes/1, No/0)
11. Produk_Deposito :Status kepemilikan Produk(Yes/1, No/0)
12. Produk_Kartu_Kredit :Status kepemilikan Produk(Yes/1, No/0)
13. Produk_Kredit_Rumah :Status kepemilikan Produk(Yes/1, No/0)
14. Produk_Kredit_Kendaraan :Status kepemilikan Produk(Yes/1, No/0)
15. Produk_Kredit_Dana_Tunai :Status kepemilikan Produk(Yes/1, No/0)
16. Total_Kepemilikan_Produk : Jumlah Produk yang dimiliki(Penjumlahan dari produk-produk)
17. Pendapatan_Tahunan : Rata-rata pendapatan dalam setahun
18. Total_Relationship_balance : Total Asset nasabah dalan cutoff bulan observasi

### **Experiment**
Point Of View:
1. Dikelompokkan berdasarkan demografis untuk dicari pattern kepemilikan produk
2. Dikelompokkan berdasarkan kepemilikan produk untuk dicari pattern berdasarkan demografisnya

### **Import Package**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
pd.set_option('display.max_columns', None)

### **Data For Clustering**

In [None]:
path_1 = ("clustering_data.csv")
data = pd.read_csv(path_1)
data.dtypes

### **Data Understanding**

In [None]:
data.groupby(('Area'))['Area'].count()

In [None]:
# Select only numeric columns for aggregation
numeric_cols = data.select_dtypes(include=[np.number]).columns
data.groupby('Area')[numeric_cols].mean()

In [None]:
data.groupby(('Vintage'))['Vintage'].count()

In [None]:
data.groupby('Vintage')[numeric_cols].mean()

In [None]:
data.groupby('Jalur_Pembukaan')['Jalur_Pembukaan'].count()

In [None]:
data.groupby('Jalur_Pembukaan')[numeric_cols].mean()

In [None]:
data.groupby(('Status_Perkawinan'))['Status_Perkawinan'].count()


In [None]:
data.groupby('Status_Perkawinan')[numeric_cols].mean()

### **Data Preparation**
Pengecekan data duplikat dan Missing data

In [None]:
data.isnull().sum()

In [None]:
data = data.dropna()

In [None]:
data.isnull().sum()

In [None]:
data.duplicated().sum()

### Pengecekan Data Outlier

In [None]:
data.count()

In [None]:
%pip install scipy

from scipy import stats
z_scores = stats.zscore(data[['Usia', 'Pendapatan_Tahunan', 'Total_Relationship_Balance']])
data = data[(z_scores < 3).all(axis=1)]

In [None]:
data.count()

### Filtering Data Telemarketing Only

In [None]:
data0 = data[data['Jalur_Pembukaan'] == 'Telemarketing']
data0

In [None]:
data0 = data0.drop(columns=['GCIF', 'Jalur_Pembukaan']).reset_index(drop=True)

In [None]:
data0 = data0.reset_index()
data0

### **Pembagian Dataset Experiment**
1. Experiment 0 = Semua Variable Digunakan
2. Experiment 1 = Menggunakan Demographics
3. Experiment 2 = Menggunakan Financial Related Variable

In [None]:
data1 = data0.iloc[:, 1:8]
data1

In [None]:
data2 = data0.iloc[:, 8:17]
data2

### **Melakukan Encoding Untuk Data Category**

In [None]:
data1 = pd.get_dummies(data1, columns = ['Area', 'Jenis_Kelamin', 'Status_Perkawinan', 'Pendidikan', 'Vintage'])
data1

### **Standarisasi Data Numerik**

In [None]:
predname_num = data2.columns
predname_num

In [None]:
from sklearn.preprocessing import StandardScaler
pt = StandardScaler()
X_num = pd.DataFrame(pt.fit_transform(data2))
X_num.head()

In [None]:
X_num.columns = predname_num
X_num.head()

### **Pengecekan Korelasi**

In [None]:
corrtest1 = data1.corr().abs()
corrtest2 = X_num.corr().abs()

In [None]:
upper = corrtest1.where(np.triu(np.ones(corrtest1.shape), k=1).astype(bool))
upper1 = corrtest2.where(np.triu(np.ones(corrtest2.shape), k=1).astype(bool))

to_drop = [column for column in upper.columns if any(upper[column] > 0.7)]
to_drop1 = [column for column in upper1.columns if any(upper1[column] > 0.7)]

data1 = data1.drop(to_drop, axis=1)
data2 = data2.drop(to_drop1, axis=1)

In [None]:
data1

In [None]:
X_num

In [None]:
# Menggabungkan keduanya
data_combined = pd.concat([data1, X_num], axis=1, join='inner')
data_combined

### **Modeling & Evaluation**
Pembangunan model akan menggunakan 2 algoritma:
1. K-Means
2. K-Medoid
dengan pengecekan silhoutte score


### **K-Means**

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Experiment 0
for n_clusters in range(3, 6):
    kmeans = KMeans(n_clusters=n_clusters, random_state=0)
    cluster_labels =kmeans.fit_predict(data_combined)
    silhouette_avg = silhouette_score(data_combined, cluster_labels)
    print(f"silhouette score (K-Means) - " +str(n_clusters)+":"+str(silhouette_avg))
    data0['Clustering_KMeans_Exp0_' + str(n_clusters)] = cluster_labels

In [None]:
# Experiment 1
for n_clusters in range(3,6):
    kmeans = KMeans(n_clusters= n_clusters, random_state=0)
    cluster_labels = kmeans.fit_predict(data1)
    silhouette_avg = silhouette_score(data1, cluster_labels)
    print(f"silhouette Score (K-Means) - " +str(n_clusters)+":"+str(silhouette_avg))
    data0['Clustering_KMeans_Exp1_'+str(n_clusters)] = cluster_labels

In [None]:
#  Experiment 2
for n_clusters in range(3,6):
    kmeans = KMeans(n_clusters = n_clusters, random_state=0)
    cluster_labels = kmeans.fit_predict(X_num)
    silhouette_avg = silhouette_score(X_num, cluster_labels)
    print(f"silhouette score (K_means) -" +str(n_clusters)+":"+str(silhouette_avg))
    data0['Clustering_KMeans_Exp2_'+ str(n_clusters)] = cluster_labels

In [None]:
pip install numpy==1.26.4 scikit-learn==1.5.0 scikit-learn-extra==0.3.0


In [None]:
from sklearn_extra.cluster import KMedoids


In [None]:
# Experiment 0
for n_clusters in range(3, 6):
    kmedoids = KMedoids(n_clusters=n_clusters, random_state=0)
    cluster_labels = kmedoids.fit_predict(data_combined)
    silhouette_avg = silhouette_score(data_combined, cluster_labels)
    print(f"silhouette score (K-Medoids) - " +str(n_clusters)+":"+str(silhouette_avg))
    data0['Clustering_KMedoids_Exp0_' + str(n_clusters)] = cluster_labels

In [None]:
# Experiment 1
for n_clusters in range(3,6):
    kmedoids = KMedoids(n_clusters=n_clusters, random_state=0)
    cluster_labels = kmedoids.fit_predict(data1)
    silhouette_avg = silhouette_score(data1, cluster_labels)
    print(f"silhouette score (K-Medoids) - " +str(n_clusters)+":"+str(silhouette_avg))
    data0['Clustering_KMedoids_Exp1_' + str(n_clusters)] = cluster_labels

In [None]:
# Experiment 2
for n_clusters in range(3,6):
    kmedoids = KMedoids(n_clusters=n_clusters, random_state=0)
    cluster_labels = kmedoids.fit_predict(X_num)
    silhouette_avg = silhouette_score(X_num, cluster_labels)
    print(f"silhouette score (K-Medoids) - " +str(n_clusters)+":"+str(silhouette_avg))
    data0['Clustering_KMedoids_Exp2_' + str(n_clusters)] = cluster_labels

### **Analisa Hasil**

In [None]:
sns.scatterplot(data=data0, x='Usia', y='Total_Kepemilikan_Produk', hue='Clustering_KMeans_Exp1_3', palette='Set1')
plt.title('Scatter Plot Hasil Clustering')
plt.show()

In [None]:
sns.scatterplot(data=data0, x='Total_Relationship_Balance', y='Total_Kepemilikan_Produk', hue='Clustering_KMeans_Exp1_3', palette='Set1')
plt.title('Scatter Plot Hasil Clustering')
plt.show()

In [None]:
cluster_means = data0[['Usia','Jumlah_Anak','Produk_Tabungan','Produk_Deposito','Produk_Kartu_Kredit','Produk_Kredit_Rumah','Produk_Kredit_Kendaraan','Produk_Kredit_Dana_Tunai','Total_Kepemilikan_Produk','Total_Relationship_Balance','Clustering_KMeans_Exp1_3']].groupby('Clustering_KMeans_Exp1_3').mean()
print(cluster_means)
pd

### **Group 0**

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==0][['Usia','Jumlah_Anak','Total_Kepemilikan_Produk','Total_Relationship_Balance']].describe(include="all")

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==0]['Area'].value_counts(normalize=True)

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==0]['Vintage'].value_counts(normalize=True)

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==0]['Pendidikan'].value_counts(normalize=True)

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==0]['Jenis_Kelamin'].value_counts(normalize=True)

### **Group 1**

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==1][['Usia','Jumlah_Anak','Total_Kepemilikan_Produk','Total_Relationship_Balance']].describe(include="all")

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==1]['Area'].value_counts(normalize=True)

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==1]['Vintage'].value_counts(normalize=True)

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==1]['Pendidikan'].value_counts(normalize=True)

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==1]['Jenis_Kelamin'].value_counts(normalize=True)

### **Group 2**

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==2][['Usia','Jumlah_Anak','Total_Kepemilikan_Produk','Total_Relationship_Balance']].describe(include="all")

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==1]['Area'].value_counts(normalize=True)

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==1]['Vintage'].value_counts(normalize=True)

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==1]['Pendidikan'].value_counts(normalize=True)

In [None]:
data0[data0['Clustering_KMeans_Exp1_3']==1]['Jenis_Kelamin'].value_counts(normalize=True)