----------------------------
## Objective: 
-----------------------------

Identify different segments in the existing customer based on their spending patterns as well as past interaction with the bank.

--------------------------
## About the data:
--------------------------
Data is of various customers of a bank with their credit limit, the total number of credit cards the customer has, and different channels through which customer has contacted the bank for any queries, different channels include visiting the bank, online and through a call centre.

- Sl_no - Customer Serial Number
- Customer Key - Customer identification
- Avg_Credit_Limit	- Average credit limit (currency is not specified, you can make an assumption around this)
- Total_Credit_Cards	- Total number of credit cards 
- Total_visits_bank	- Total bank visits
- Total_visits_online -	 Total online visits
- Total_calls_made - Total calls made

## Importing libraries and overview of the dataset

In [None]:
#Import all the necessary packages

import pandas as pd
import numpy as np

import matplotlib.pylab as plt
import seaborn as sns

#to scale the data using z-score 
from sklearn.preprocessing import StandardScaler

#importing clustering algorithms
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture


#if the below line of code gives an error, then uncomment the following code to install the sklearn_extra library
# !pip install scikit-learn-extra
from sklearn_extra.cluster import KMedoids

import warnings
warnings.filterwarnings("ignore")

#### Loading data

In [None]:
data = pd.read_excel('Credit Card Customer Data.xlsx')
data.head()

#### Check the info of the data

In [None]:
data.info()

**Observations:**

- There are 660 observations and 7 columns in the dataset.
- All columns have 660 non-null values i.e. there are no missing values.
- All columns are of int64 data type.

**There are no missing values. Let us now figure out the uniques in each column.** 

In [None]:
data.nunique()

- Customer key, which is an identifier, has repeated values. We should treat the same accordingly before applying any algorithm.

## Data Preprocessing and Exploratory Data Analysis

In [None]:
# Identify the duplicated customer keys
duplicate_keys = data[data.duplicated(subset=['Customer Key'],keep='last')==True]['Customer Key']
duplicate_keys

In [None]:
# Drop duplicated keys
data=data.drop_duplicates(subset=['Customer Key'],keep='last')

In [None]:
data.drop(columns = ['Sl_No', 'Customer Key'], inplace = True)

In [None]:
data[data.duplicated()]

We can drop these duplicated rows from the data

In [None]:
data=data[~data.duplicated()]

In [None]:
data.shape

- After removing duplicated keys and rows and unnecessary columns, there are 644 unique observations and 5 columns in our data.

#### Summary Statistics

In [None]:
data.describe().T

**Observations:After dropping the duplicate customer keys and 2 columns above it is observed that**
- Population size reduces to 644.
- Mean Creditlimit of users is 34869.56.
- Average bank vistis online is more than average bank visits and average calls made.

In [None]:
for col in data.columns:
    print(col)
    print('Skew :',round(data[col].skew(),2))
    plt.figure(figsize=(15,4))
    plt.subplot(1,2,1)
    data[col].hist(bins=10, grid=False)
    plt.ylabel('count')
    plt.subplot(1,2,2)
    sns.boxplot(x=data[col])
    plt.show()

**Observation:**
- It is observed from the histograms that Average Credit limit and Total visits online is highly right skewed with many outliers
- Toatl credit cards, total bank visits and Total calls made do not seem to have any outliers

In [None]:
plt.figure(figsize=(8,8))
sns.heatmap(data.corr(), annot=True, fmt='0.2f')
plt.show()

**Observation:**

- Avg_Credit_Limit is positively correlated with Total_Credit_Cards Total_visits_online which can makes sense.
- Avg_Credit_Limit is negatively correlated with Total_calls_made and Total_visits_bank.
- Total_visits_bank, Total_visits_online, Total_calls_made are negatively correlated which implies that majority of customers use only one of these channels to contact the bank.

#### Scaling the data

In [None]:
scaler=StandardScaler()
data_scaled=pd.DataFrame(scaler.fit_transform(data), columns=data.columns)

In [None]:
data_scaled.head()

In [None]:
data_scaled_copy = data_scaled.copy(deep=True)

## K-Means

Let us now fit k-means algorithm on our scaled data and find out the optimum number of clusters to use.

We will do this in 3 steps:
1. Initialize a dictionary to store the SSE for each k
2. Run for a range of Ks and store SSE for each run
3. Plot the SSE vs K and find the elbow

In [None]:
# step 1
sse = {} 

# step 2 - iterate for a range of Ks and fit the scaled data to the algorithm. Use inertia attribute from the clustering object and 
# store the inertia value for that k 
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000, random_state=1).fit(data_scaled)
    sse[k] = kmeans.inertia_

# step 3
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

- Looking at the plot, we can say that elbow point is achieved for k=3.
- We will fit the k-means again with k=3 to get the labels.

In [None]:
kmeans = KMeans(n_clusters=3, max_iter=1000, random_state=1)#Apply the K-Means algorithm
kmeans.fit(data_scaled) #Fit the kmeans function on the scaled data

#Adding predicted labels to the original data and scaled data 
data_scaled_copy['Labels'] = kmeans.predict(data_scaled) #Save the predictions on the scaled data from K-Means
data['Labels'] = kmeans.predict(data_scaled) #Save the predictions on the scaled data from K-Means

**Observation:**

- We can see a consistent dip at k=3 (elbow), which is why we choose k=3 as the number of clusters

We have generated the labels with k-means. Let us look at the various features based on the labels.

In [None]:
#Number of observations in each cluster
data.Labels.value_counts()

In [None]:
#Calculating summary statistics of the original data for each label
mean = data.groupby('Labels').mean()
median = data.groupby('Labels').median()
df_kmeans = pd.concat([mean, median], axis=0)
df_kmeans.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_kmeans.T

In [None]:
#Visualizing different features w.r.t K-means labels
data_scaled_copy.boxplot(by = 'Labels', layout = (1,5),figsize=(20,7))
plt.show()

**Cluster Profiles:**
- From the summary statistics 
    - The mean and median of group 2 for average credit limit is very high
    - The mean and median for Average credit limit is the least for group 0 
    - Total number of credit cards and total visits made online is the highest for group 2 Mean and Median
    
- From the box plots 
    - For cluster 0 average cluster limit has outliers whereas the other clusters have no outliers
    - Cluster 2 has an outlier for total credit cards
    - Total calls made and total visits bank has no outliers for any cluster
    - Cluster 0 has outliers for total visits online



## Gaussian Mixture

Let's create clusters using Gaussian Mixture Models

In [None]:
gmm = GaussianMixture(n_components = 3,random_state = 1) #Apply the Gaussian Mixture algorithm
gmm.fit(data_scaled) #Fit the gmm function on the scaled data

data_scaled_copy['GmmLabels'] = gmm.predict(data_scaled)
data['GmmLabels'] = gmm.predict(data_scaled)

In [None]:
#Number of observations in each cluster
data.GmmLabels.value_counts()

In [None]:
#Calculating summary statistics of the original data for each label
original_features = ["Avg_Credit_Limit","Total_Credit_Cards","Total_visits_bank","Total_visits_online","Total_calls_made"]

mean = data.groupby('GmmLabels').mean()
median = data.groupby('GmmLabels').median()
df_gmm = pd.concat([mean, median], axis=0)
df_gmm.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_gmm[original_features].T

In [None]:
# plotting boxplots with the new GMM based labels

features_with_lables = ["Avg_Credit_Limit","Total_Credit_Cards","Total_visits_bank","Total_visits_online","Total_calls_made","GmmLabels"]

data_scaled_copy[features_with_lables].boxplot(by = 'GmmLabels', layout = (1,5),figsize=(20,7))
plt.show()

**Cluster Profiles:**
- Similar summary statistics and box plots to KMeans

**Comparing Clusters:**
- The results are identical for this scenario, typically GMM is more evenly distributed.

## K-Medoids

In [None]:
kmedo = KMedoids(n_clusters = 3, random_state=1) #Apply the K-Medoids algorithm
kmedo.fit(data_scaled) #Fit the kmedo function on the scaled data

data_scaled_copy['kmedoLabels'] = kmedo.predict(data_scaled)
data['kmedoLabels'] = kmedo.predict(data_scaled)

In [None]:
#Number of observations in each cluster
data.kmedoLabels.value_counts()

In [None]:
#Calculating summary statistics of the original data for each label
mean = data.groupby('kmedoLabels').mean()
median = data.groupby('kmedoLabels').median()
df_kmedoids = pd.concat([mean, median], axis=0)
df_kmedoids.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_kmedoids[original_features].T

In [None]:
#plotting boxplots with the new K-Medoids based labels

features_with_lables = ["Avg_Credit_Limit","Total_Credit_Cards","Total_visits_bank","Total_visits_online","Total_calls_made","kmedoLabels"]

data_scaled_copy[features_with_lables].boxplot(by = 'kmedoLabels', layout = (1,5),figsize=(20,7))
plt.show()

Let's compare the clusters from K-Means and K-Medoids 

In [None]:
comparison = pd.concat([df_kmedoids, df_kmeans], axis=1)[original_features]
comparison

**Cluster Profiles:**
- Cluster 0 consists of users who have the least variance in Average credit limit, lowest number of total credit cards, highest calls made, lowest range of total visits made to the bank and moderate amount of online visits. This cluster contains outliers for features Average credit limit and total visits online.
- Cluster 1 consists of users who have the highest average credit limit, maximum of number of credit cards and maximum online visits.
- Cluster 2 consists of users who have the highest bank visits and lowest online visits.

**Comparing Clusters:**
- KMediods has more evenly distributed observation counts as compared to KMeans
- This happens because KMediods is less affected by outliers