**Using Gaussian Clustering and PCA Techniques to make clusters of the Credit Car data**

This notebook explains in detail the implementation of the gaussian mixture model using the expectation maximization for an appropriate data set and validating the clusters properly with the help of various external validation measures.

## Importing the data and necessary libraries 

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from pandas import DataFrame 
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture 
from sklearn.metrics import silhouette_score
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [None]:
df = pd.read_csv('../input/ccdata/CC GENERAL.csv')

In [None]:
df.head()

## Preprocessing the data and applying PCA

1. Dropping Customer ID as it does not provide useful information for clustering
2. Standardising and Normalising the data. This prevents one attribute from having a greater influence on clustering than another as the data is now uniform. 
3. Creating DataFrame of the uniform  data.
4. Applying PCA to reduce the data to 2 dimensions. This data is plotted. 

In [None]:
df1 = df.drop('CUST_ID', axis = 1) 
df1.fillna(method ='bfill', inplace = True) 

In [None]:
# Standardize data
scaler = StandardScaler() 
scaled_df = scaler.fit_transform(df1) 

In [None]:
# Normalizing the Data 
normalized_df = normalize(df1) 

In [None]:
# Converting the numpy array into a pandas DataFrame 
normalized_df = pd.DataFrame(normalized_df) 

In [None]:
# Reducing the dimensions of the data 
pca = PCA(n_components = 3) 
pcadf = pca.fit_transform(normalized_df) 
pcadf = pd.DataFrame(pcadf) 
pcadf.columns = ['Principal Component 1', 'Principal Component 2', 'Principal Component 3'] 
  
pcadf.head(10)

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
scatter = ax.scatter(pcadf['Principal Component 1'],
                     pcadf['Principal Component 2'], 
                     c = pcadf['Principal Component 3'],
                     alpha=0.6)
plt.title('Plotting the 3-Dimensional data after PCA is applied', fontsize = 20)
plt.xlabel('Principal Component 1', fontsize = 15)
plt.ylabel('Principal Component 2', fontsize = 15)
plt.legend(*scatter.legend_elements(), loc="best", title="Principal\nComponent 3")
ax.plot([])
ax.grid()
plt.show()

In [None]:
pca.explained_variance_ratio_

## Elbow Method 

In [None]:
import seaborn as sns; sns.set()
from sklearn.cluster import KMeans

sse = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(df1)
    sse.append(kmeanModel.inertia_)
plt.figure(figsize=(15,6))
plt.plot(K, sse, 'bx-')
plt.xlabel('Number of Clusters (k)', fontsize = 15)
plt.ylabel('Sum of Squared Error', fontsize = 15)
plt.title('The Elbow Method showing the optimal k \n(For data without PCA applied)', fontsize = 20)
plt.show()

We will choose 3 as k value here

In [None]:
import seaborn as sns; sns.set()
from sklearn.cluster import KMeans

sse = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(pcadf)
    sse.append(kmeanModel.inertia_)
plt.figure(figsize=(15,6))
plt.plot(K, sse, 'bx-')
plt.xlabel('Number of Clusters (k)', fontsize = 15)
plt.ylabel('Sum of Squared Error', fontsize = 15)
plt.title('The Elbow Method showing the optimal k \n(For data with PCA applied)', fontsize = 20)
plt.show()

We will choose k as 3 here 

In [None]:
import matplotlib
matplotlib.rc_file_defaults()

## Applying the Gaussian Mixture Model to cluster our data (PCA-applied) into 3 clusters

In [None]:
gmm = GaussianMixture(n_components = 3) 
gmm.fit(pcadf)

In [None]:
fig = plt.figure(figsize = (7, 7))
plt.suptitle("Plotting the 3 component with colors representing clusters", fontsize = 15)
             
x, s1 = pcadf['Principal Component 1'], "Principal Component 1"
y, s2 = pcadf['Principal Component 2'], "Principal Component 2"
z, s3 = pcadf['Principal Component 3'], "Principal Component 3"

c = gmm.fit_predict(pcadf) 

ax = fig.add_subplot(111, projection = '3d')
ax.scatter(x, y, z, c = c, s=0.5, alpha = 1)
plt.title('x axis : P1 | y axis : P2 | z axis : P3')
ax.set_xlabel(s1, fontsize = 13)
ax.set_ylabel(s2, fontsize = 13)
ax.set_zlabel(s3, fontsize = 13)

plt.show()

## Applying the Gaussian Mixture Model to cluster our data (without PCA) into 3 clusters

In [None]:
gmm1 = GaussianMixture(n_components = 3) 
gmm1.fit(df1)

## External Evaluation Metric

#### A. For data with PCA applied

In [None]:
y_pred = gmm.predict(pcadf)
pred = pd.DataFrame(y_pred)
pred.columns = ['Type']

prediction = pd.concat([pcadf, pred], axis = 1)

clus0 = prediction.loc[prediction.Type == 0]
clus1 = prediction.loc[prediction.Type == 1]
clus2 = prediction.loc[prediction.Type == 2]

cluster_list = [clus0.values, clus1.values, clus2.values]

#### B. For data without PCA applied

In [None]:
y_pred1 = gmm1.predict(df1)
pred1 = pd.DataFrame(y_pred1)
pred1.columns = ['Type']

prediction1 = pd.concat([df1, pred1], axis = 1)

clus10 = prediction1.loc[prediction1.Type == 0]
clus11 = prediction1.loc[prediction1.Type == 1]
clus12 = prediction1.loc[prediction1.Type == 2]

cluster_list1 = [clus10.values, clus11.values, clus12.values]

### 1. Silhoutte Score

#### A. For data with PCA applied

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

X = pcadf
y = gmm.fit_predict(pcadf)
print("Clusters\tSilhoutte Score\n")
for n_components in range(2, 11):
    gmm = GaussianMixture(n_components=n_components).fit(X)
    sil_coeff = silhouette_score(X, c, metric='euclidean')
    print("k = {} \t--> \t{}".format(n_components, sil_coeff))

#### B. For data without PCA applied

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

X = df1
y = gmm1.fit_predict(pcadf)
print("Clusters\tSilhoutte Score\n")
for n_components in range(2, 11):
    gmm = GaussianMixture(n_components=n_components).fit(X)
    sil_coeff = silhouette_score(X, c, metric='euclidean')
    print("k = {} \t--> \t{}".format(n_components, sil_coeff))

Thus, data with PCA applied gives better clustering results

### 2. DB Index

#### A. For data with PCA applied

In [None]:
from sklearn.metrics import davies_bouldin_score
davies_bouldin_score(pcadf, y_pred)

#### B. For data without PCA applied

In [None]:
davies_bouldin_score(df1, y_pred1) 

The lower the DB index, the better. Thus, data applied with PCA gives better results