Lakukan dan jelaskan clustering pada dataset mall_customer.csv dengan ketentuan dan langkah-langkah sebagai berikut

DATA PREPOCESSING

1. Load dataset dan buang feature yang tidak dibutuhkan serta rapikan dataset

In [None]:
from sklearn.cluster import KMeans
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from matplotlib import pyplot as plt
from itertools import combinations
from sklearn.decomposition import PCA
import seaborn as sns
%matplotlib inline

In [None]:
# load dataset
df = pd.read_csv('../../../../../Mall_Customers.csv')
df

In [None]:
# buang feature yang tidak diperlukan
df.drop(['CustomerID'], axis=1, inplace=True)

2. Cek tipe data pada masing-masing feature dan encoding bila diperlukan

In [None]:
# cek tipe data dari masing-masing feature
df.info()

In [None]:
# melakukan encode pada feature yang berupa string (Genre)
df_encode = df.copy()
df_encode['Genre'] = LabelEncoder().fit_transform(df_encode['Genre'])
df_encode.head()

3. Cek data null dan data duplicated

In [None]:
# cek data null
df_encode.isnull().sum()

In [None]:
# cek data duplikat
df_encode.duplicated().sum()

4. Handling miss value dan handling duplicated data jika diperlukan

5. Cek data outlier dengan visualisasikan dalam boxplot pada masing-masing feature

In [None]:
df_boxplot = df_encode.plot.box(figsize=(10,5), showmeans=True, meanline=True, grid=True)

6. Handling outlier jika diperlukan

In [None]:
# handling outliers
qnl = df_encode['Annual Income (k$)'].quantile(0.25)
qnh = df_encode['Annual Income (k$)'].quantile(0.75)
mqr = (qnh - qnl)

In [None]:
df_encode['Annual Income (k$)'] = df_encode['Annual Income (k$)'].mask(df_new['Annual Income (k$)'] < qnl - 1.5 * mqr, qnl - 1.5 * mqr)
df_encode

In [None]:
df_new = df_encode.copy()
df_new = df_new[(df_encode['Annual Income (k$)'] > qnl - 1.5 * mqr) & (df_encode['Annual Income (k$)'] < qnh + 1.5 * mqr)]

df_boxplot = df_new.plot.box(figsize=(10,5), showmeans=True, meanline=True, grid=True)

7. Visualisasikan korelasi antar-feature dengan visualisasi heatmap

In [None]:
# corelasi feature
corr = df_new.corr()
corr

In [None]:
sns.heatmap(corr, annot=True, cmap='RdYlGn', linewidths=0.2, annot_kws={'size':8}, fmt='.2f')

8. Normalisasikan data menggunakan normalisasi tertentu (bebas bisa standart atau minmax)


In [None]:
# minmax scaler
scaler = MinMaxScaler()
df_colomn = list(df_new.columns)
scaler.fit(df_new[df_colomn])

df_new_scaled = scaler.transform(df_new)
df_new[df_colomn] = df_new_scaled
df_new

9. Tampilkan scatter plot sebelum clustering dengan kombinasi 4 atribut

In [None]:
df_new2 = df_new.copy().columns
combins = list(combinations(df_new2, 2))

for x in combins:
    plt.scatter(df_new[x[0]], df_new[x[1]])
    plt.xlabel(x[0])
    plt.ylabel(x[1])
    plt.show()

MODELLING   

10. Clusterlah dataset dengan nilai K 3, 4, 5 dengan atribut 'Annual Income (k$)' dan 'Spending Score (1-100)'

11. Tampilkan scatter plot setelah clustering dengan centroid masing-masing cluster

In [None]:
data_model = df_new.copy().copy()
columns = data_model[['Annual Income (k$)', 'Spending Score (1-100)']]

for x in [3,4,5]:
    kmeans = KMeans(n_clusters=x)
    predict = kmeans.fit_predict(columns)
    plt.scatter(columns['Annual Income (k$)'], columns['Spending Score (1-100)'], c=predict)
    plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], c='red', s=200,label='centroid')
    plt.legend()
    plt.show()

12. Tentukan manakah k terbaik jika dilakukan analisis secara visual

OPTIONAL

13. Lakukan implementasi elbow method untuk mendeteksi K terbaik

In [None]:
# elbow method
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(data_model)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(10,5))
plt.plot(range(1, 11), wcss, marker='o', color='red')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') # within cluster sum of squares
plt.show()

14. Lakukan PCA untuk feature reduction dari seluruh feature menjadi 2 feature

In [None]:
# pca feature reduction
pca = PCA(n_components=2)
pca_data = pd.DataFrame(pca.fit_transform(data_model))
pca_data.columns = ['PC1', 'PC2']
pca_data

In [None]:
plt.scatter(pca_data['PC1'], pca_data['PC2'])

In [None]:
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(pca_data)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(10,5))
plt.plot(range(1, 11), wcss, marker='o', color='red')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') # within cluster sum of squares

In [None]:
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
predict = kmeans.fit_predict(pca_data)
plt.scatter(pca_data['PC1'], pca_data['PC2'], c=predict)

15. Lakukan visualisasi 3D scatterplot setelah clustering

In [None]:
# 3D scatter plot
pca = PCA(n_components=3)
pca_data = pd.DataFrame(pca.fit_transform(data_model))
pca_data.columns = ['PC1', 'PC2', 'PC3']

kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
predict = kmeans.fit_predict(pca_data)

plt.figure(figsize=(20,10))
ax = plt.axes(projection='3d')
ax.scatter(pca_data['PC1'], pca_data['PC2'], pca_data['PC3'], c=predict)

centroid = kmeans.cluster_centers_
ax.scatter(centroid[:,0], centroid[:,1], centroid[:,2], c='red', s=200, label='centroid')
ax.view_init(30, 225)

ax.set_xlabel('Age')
ax.set_ylabel('Annual Income (k$)')
ax.set_zlabel('Spending Score (1-100)')

plt.legend()
