## Unsupervised Learning
----------------------------------------

## Context: 
-----------------------------
AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalized campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customers queries are resolved faster. Head of Marketing and Head of Delivery both decide to reach out to the Data Science team for help.


----------------------------
## Objective: 
-----------------------------

Identify different segments in the existing customer based on their spending patterns as well as past interaction with the bank.

--------------------------
## About the data:
--------------------------
Data is of various customers of a bank with their credit limit, the total number of credit cards the customer has, and different channels through which customer has contacted the bank for any queries, different channels include visiting the bank, online and through a call centre.

- Sl_no - Customer Serial Number
- Customer Key - Customer identification
- Avg_Credit_Limit	- Average credit limit (currency is not specified, you can make an assumption around this)
- Total_Credit_Cards	- Total number of credit cards 
- Total_visits_bank	- Total bank visits
- Total_visits_online -	 Total online visits
- Total_calls_made - Total calls made

## Importing libraries and overview of the dataset

In [None]:
#Import all the necessary packages

import pandas as pd
import numpy as np

import matplotlib.pylab as plt
import seaborn as sns

#to scale the data using z-score 
from sklearn.preprocessing import StandardScaler

#importing clustering algorithms
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture


#installing and importing the sklearn_extra library
!pip install scikit-learn-extra
from sklearn_extra.cluster import KMedoids

import warnings
warnings.filterwarnings("ignore")

#### Loading data

In [None]:
data = pd.read_csv('../input/credit-card-customer-data/Credit Card Customer Data.csv')
data.head()

#### Check the info of the data

In [None]:
data.info()

**Observations:**

- There are 660 observations and 7 columns in the dataset.
- All columns have 660 non-null values i.e. there are no missing values.
- All columns are of int64 data type.
- There are no missing values.

**Let us now figure out the uniques in each column** 

In [None]:
data.nunique()

- Customer key, which is an identifier, has repeated values. We should treat the same accordingly before applying any algorithm.

## Data Preprocessing and Exploratory Data Analysis

#### Identify and drop the rows with duplicated customer keys

In [None]:
# Identify the duplicated customer keys
duplicate_keys = data.duplicated('Customer Key') == True

In [None]:
# Drop duplicated keys

data = data[duplicate_keys == False]

We have done some basic checks. Now, let's drop the variables that are not required for our analysis.

In [None]:
data.drop(columns = ['Sl_No', 'Customer Key'], inplace = True)

Now that we have dropped unnecessary column. We can again check for duplicates. Duplicates would mean customers with identical features.

In [None]:
data[data.duplicated()]

We can drop these duplicated rows from the data

In [None]:
data=data[~data.duplicated()]

In [None]:
data.shape

- After removing duplicated keys and rows and unnecessary columns, there are 644 unique observations and 5 columns in our data.

#### Summary Statistics

In [None]:
data.describe().T

**Observations:___________**

- Credit limit average is around 35K with 50% of customers having a credit limit less than 18K, which implies a high positive skewness.
- Looking at standard deviation, we can see a considerably high variation in credit limits as well.
- On average, credit cards owned by each customer are ~5. Some customers have 10.
- On average, most customer interactions are through calls, then online. Also, some customers never contacted/visited the bank.

#### Now let's go ahead with the exploring each variable at hand. We will check the distribution and outliers for each variable in the data.

In [None]:
for col in data.columns:
     print(col)
     print('Skew :',round(data[col].skew(),2))
     plt.figure(figsize=(15,4))
     plt.subplot(1,2,1)
     data[col].hist()
     plt.ylabel('count')
     plt.subplot(1,2,2)
     sns.boxplot(x=data[col])
     plt.show()

**Observation:**

- Many outliers in average credit limit. High credit customers are causing skewness.
- Online visits are mostly between 1 and 4 with some outliers with more than 7 and above.

**Now, let's check the correlation among different variables.**

In [None]:
plt.figure(figsize=(8,8))
sns.heatmap(data.corr(), annot=True, fmt='0.2f')
plt.show()

**Observation:**

- Avg_Credit_Limit is positively correlated with Total_Credit_Cards Total_visits_online which can makes sense.
- Avg_Credit_Limit is negatively correlated with Total_calls_made and Total_visits_bank.
- Total_visits_bank, Total_visits_online, Total_calls_made are negatively correlated which implies that majority of customers use only one of these channels to contact the bank.

#### Scaling the data

In [None]:
scaler=StandardScaler()
data_scaled=pd.DataFrame(scaler.fit_transform(data), columns=data.columns)

In [None]:
data_scaled.head()

In [None]:
#Creating copy of the data to store labels from each algorithm
data_scaled_copy = data_scaled.copy(deep=True)

## K-Means

Let us now fit k-means algorithm on our scaled data and find out the optimum number of clusters to use.

We will do this in 3 steps:
1. Initialize a dictionary to store the SSE for each k
2. Run for a range of Ks and store SSE for each run
3. Plot the SSE vs K and find the elbow

In [None]:
# step 1
sse = {} 

# step 2 - iterate for a range of Ks and fit the scaled data to the algorithm. Use inertia attribute from the clustering object and 
# store the inertia value for that k 
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000, random_state=1).fit(data_scaled)
    sse[k] = kmeans.inertia_

# step 3
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

- Looking at the plot, we can say that elbow point is achieved for k=3.
- We will fit the k-means again with k=3 to get the labels.

#### Fit the K-means algorithms on the scaled data

In [None]:
#Apply the K-Means algorithm
kmeans = KMeans(n_clusters=3, max_iter=1000, random_state=1) 

#Fit the kmeans function on the scaled data
kmeans.fit(data_scaled)

#Adding predicted labels to the original data and scaled data 
data_scaled_copy['Labels'] = kmeans.predict(data_scaled) #Save the predictions on the scaled data from K-Means
data['Labels'] = kmeans.predict(data_scaled) #Save the predictions on the scaled data from K-Means

We have generated the labels with k-means. Let us look at the various features based on the labels.

In [None]:
#Number of observations in each cluster
data.Labels.value_counts()

In [None]:
#Calculating summary statistics of the original data for each label
mean = data.groupby('Labels').mean()
median = data.groupby('Labels').median()
df_kmeans = pd.concat([mean, median], axis=0)
df_kmeans.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_kmeans.T

In [None]:
#Visualizing different features w.r.t K-means labels
data_scaled_copy.boxplot(by = 'Labels', layout = (1,5),figsize=(20,7))
plt.show()

**Cluster Profiles:**


Group 0:

- Customers with minimum credit limits (~ 12K in average).
- They also have the least average number of credit cards (~ 2 cards each).
- They tend to make phone calls rather than online and bank visits.

Group 1:

- Customers with middle credit limits (~ 34K in average).
- They also have the middle average number of credit cards(~ 6 cards each).
- They tend to visit the bank more often rather than making calls and online transactions.

Group 2:

- Customers with maximum credit limits (~ 140K in average).
- They also have the maximum average number of credit cards(~ 9 cards each).
- They tend to make online transactions rather than phone calls and bank visits.

## Gaussian Mixture

Let's create clusters using Gaussian Mixture Models

In [None]:
#Apply the Gaussian Mixture algorithm
gmm = GaussianMixture(n_components=3, random_state=1) 

#Fit the gmm function on the scaled data
gmm.fit(data_scaled)

data_scaled_copy['GmmLabels'] = gmm.predict(data_scaled)
data['GmmLabels'] = gmm.predict(data_scaled)

In [None]:
#Number of observations in each cluster
data.GmmLabels.value_counts()

In [None]:
#Calculating summary statistics of the original data for each label
original_features = ["Avg_Credit_Limit","Total_Credit_Cards","Total_visits_bank","Total_visits_online","Total_calls_made"]

mean = data.groupby('GmmLabels').mean()
median = data.groupby('GmmLabels').median()
df_gmm = pd.concat([mean, median], axis=0)
df_gmm.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_gmm[original_features].T

In [None]:
# plotting boxplots with the new GMM based labels

features_with_lables = ["Avg_Credit_Limit","Total_Credit_Cards","Total_visits_bank","Total_visits_online","Total_calls_made","GmmLabels"]

data_scaled_copy[features_with_lables].boxplot(by = 'GmmLabels', layout = (1,5),figsize=(20,7))
plt.show()

**Cluster Profiles:____________**

Group 0:

- Customers with minimum credit limits (~ 12K in average).
- They also have the least average number of credit cards (~ 2 cards each).
- They tend to make phone calls rather than online and bank visits.

Group 1:

- Customers with middle credit limits (~ 34K in average).
- They also have the middle average number of credit cards(~ 6 cards each).
- They tend to visit the bank more often rather than making calls and online transactions.

Group 2:

- Customers with maximum credit limits (~ 140K in average).
- They also have the maximum average number of credit cards(~ 9 cards each).
- They tend to make online transactions rather than phone calls and bank visits.


**Comparing Clusters,we can clearly see that both algorithms produces clusters with the same clustring profiles**

## K-Medoids

In [None]:
#Apply the K-Medoids algorithm
kmedo = KMedoids(n_clusters=3, max_iter=1000, random_state=1)

#Fit the kmedo function on the scaled data
kmedo.fit(data_scaled)

data_scaled_copy['kmedoLabels'] = kmedo.predict(data_scaled)
data['kmedoLabels'] = kmedo.predict(data_scaled)

In [None]:
#Number of observations in each cluster
data.kmedoLabels.value_counts()

In [None]:
#Calculating summary statistics of the original data for each label
mean = data.groupby('kmedoLabels').mean()
median = data.groupby('kmedoLabels').median()
df_kmedoids = pd.concat([mean, median], axis=0)
df_kmedoids.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_kmedoids[original_features].T

In [None]:
#plotting boxplots with the new DBScan based labels

features_with_lables = ["Avg_Credit_Limit",	"Total_Credit_Cards","Total_visits_bank","Total_visits_online","Total_calls_made","kmedoLabels"]

data_scaled_copy[features_with_lables].boxplot(by = 'kmedoLabels', layout = (1,5),figsize=(20,7))
plt.show()

Let's compare the clusters from K-Means and K-Medoids 

In [None]:
comparison = pd.concat([df_kmedoids, df_kmeans], axis=1)[original_features]
comparison

**Cluster Profiles:____________**

Group 0:

- Customers with minimum credit limits (~ 12K in average).
- They also have the least average number of credit cards (~ 2 cards each).
- They tend to make phone calls rather than online and bank visits.

Group 1:

- Customers with maximum credit limits (~ 85K in average).
- They also have the maximum average number of credit cards(~ 7 cards each).
- They tend to make online transactions rather than phone calls and bank visits.

Group 2:

- Customers with middle credit limits (~ 28K in average).
- They also have the middle average number of credit cards(~ 5 cards each).
- They tend to visit the bank more often rather than making calls and online transactions.

**Comparing Clusters:___________________**

- Both algorithms produced one cluster identically (which is cluster 0) having the same profile.
- K-Medoids grouped the data points differently than K-Means, which could be reasoned to the fact that it centres a median point, rather than mean as in K-Means. This appearantly has driven k-means to expand cluster 2 as the centroid kept moving towards outliers.
