## Project: Unsupervised Learning
----------------------------------------
**Marks: 30**
-----------------------------------------

Welcome to the project on Unsupervised Learning. We will be using the Credit Card Customer Data for this project.

----------------------------
## Context: 
-----------------------------
AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalized campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customers queries are resolved faster. Head of Marketing and Head of Delivery both decide to reach out to the Data Science team for help.


----------------------------
## Objective: 
-----------------------------

Identify different segments in the existing customer based on their spending patterns as well as past interaction with the bank.

--------------------------
## About the data:
--------------------------
Data is of various customers of a bank with their credit limit, the total number of credit cards the customer has, and different channels through which customer has contacted the bank for any queries, different channels include visiting the bank, online and through a call centre.

- Sl_no - Customer Serial Number
- Customer Key - Customer identification
- Avg_Credit_Limit	- Average credit limit (currency is not specified, you can make an assumption around this)
- Total_Credit_Cards	- Total number of credit cards 
- Total_visits_bank	- Total bank visits
- Total_visits_online -	 Total online visits
- Total_calls_made - Total calls made

## Importing libraries and overview of the dataset

In [1]:
#Import all the necessary packages

import pandas as pd
import numpy as np

import matplotlib.pylab as plt
import seaborn as sns

#to scale the data using z-score 
from sklearn.preprocessing import StandardScaler

#importing clustering algorithms
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture


#if the below line of code gives an error, then uncomment the following code to install the sklearn_extra library
!pip install scikit-learn-extra
from sklearn_extra.cluster import KMedoids

import warnings
warnings.filterwarnings("ignore")



#### Loading data

In [2]:
data = pd.read_excel('Credit+Card+Customer Data.xlsx')
data.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Credit Card Customer Data.xlsx'

#### Check the info of the data

In [None]:
data.info()

**Observations:**

- There are 660 observations and 7 columns in the dataset.
- All columns have 660 non-null values i.e. there are no missing values.
- All columns are of int64 data type.

**There are no missing values. Let us now figure out the uniques in each column.** 

In [None]:
data.nunique()

- Customer key, which is an identifier, has repeated values. We should treat the same accordingly before applying any algorithm.

## Data Preprocessing and Exploratory Data Analysis

#### **Question 1: Identify and drop the rows with duplicated customer keys (2 Marks)**

In [None]:
# Identify the duplicated customer keys - There should be 5
duplicate_keys = pd.concat(g for _, g in data.groupby("Customer Key") if len(g) > 1)
duplicate_keys

In [None]:
# Drop duplicated keys

data = data.drop_duplicates(subset=['Customer Key'])
data.nunique() # just to check that there were 5 dropped

We have done some basic checks. Now, let's drop the variables that are not required for our analysis.

In [None]:
data.drop(columns = ['Sl_No', 'Customer Key'], inplace = True)

Now that we have dropped unnecessary column. We can again check for duplicates. Duplicates would mean customers with identical features.

In [None]:
data[data.duplicated()]

We can drop these duplicated rows from the data

In [None]:
data=data[~data.duplicated()]

In [None]:
data.shape

- After removing duplicated keys and rows and unnecessary columns, there are 644 unique observations and 5 columns in our data.

#### Summary Statistics

#### **Question 2: Write your observations on the summary statistics of the data (1 Mark)**

In [None]:
data.describe().T

**Observations:**

Credit Card limit: Average is 34,543 +/- $37,428.70 dollars (Data is not normally distributed). The median is 18,000 dollars (IQR = 11,000 to 448,000). The range is 3,000 to 200,000 dollars.

Total Number of Credit Cards each Customer has: Average is 4.69 +/- 2.18 cards. The median is 5 (IQR = 3 - 6). The range is 1 to 10.

Total Bank Visits: Average is 2.40 +/- 1.63. The median is 2 (IQR = 1 - 4). The range is 0 to 5.

Total Online Visits: Average is 2.63 +/- 2.96. The median is 2 (IQR = 1 - 4). The range is 0 to 15.

Total Calls Made Through the Call Center: Average is 3.61 +/- 2.88. The median is 3 (IQR = 1 - 5.25). The range is 0 to 10.

#### Now let's go ahead with the exploring each variable at hand. We will check the distribution and outliers for each variable in the data.

#### Question 3:
- **Check the distribution of all variables (use .hist() attribute) (2 Marks)**
- **Check outliers for all variables (use sns.boxplot()) (2 Mark)**
- **Write your observations (1 Marks)**

In [None]:
# Uncomment and complete the code by filling the blanks 

for col in data.columns:
     print(col)
     print('Skew :',round(data[col].skew(),2))
     plt.figure(figsize=(15,4))
     plt.subplot(1,2,1)
     data[col].hist()
     plt.ylabel('count')
     plt.subplot(1,2,2)
     sns.boxplot(x=data[col])
     plt.show()

**Observation:**
Avg_Credit_Limit: Skew : 2.19 (Highly Right Skewed)

- Several outliers above 100,000 dollar credit limit (over 35 hard to count)

Total_Credit_Cards: Skew : 0.17 (Relatively symmetric)

- no outliers

Total_visits_bank: Skew : 0.15 (Relatively symmetric)

- no outliers

Total_visits_online: Skew : 2.21 (Highly Right Skewed)

- 7 outliers above 8 online visits

Total_calls_made: Skew : 0.65 (Slightly Right Skewed)

- no outliers

**Now, let's check the correlation among different variables.**

In [None]:
plt.figure(figsize=(8,8))
sns.heatmap(data.corr(), annot=True, fmt='0.2f')
plt.show()

**Observation:**

- Avg_Credit_Limit is positively correlated with Total_Credit_Cards Total_visits_online which can makes sense.
- Avg_Credit_Limit is negatively correlated with Total_calls_made and Total_visits_bank.
- Total_visits_bank, Total_visits_online, Total_calls_made are negatively correlated which implies that majority of customers use only one of these channels to contact the bank.

#### Scaling the data

In [None]:
scaler=StandardScaler()
data_scaled=pd.DataFrame(scaler.fit_transform(data), columns=data.columns)

In [None]:
data_scaled.head()

In [None]:
#Creating copy of the data to store labels from each algorithm
data_scaled_copy = data_scaled.copy(deep=True)

## K-Means

Let us now fit k-means algorithm on our scaled data and find out the optimum number of clusters to use.

We will do this in 3 steps:
1. Initialize a dictionary to store the SSE for each k
2. Run for a range of Ks and store SSE for each run
3. Plot the SSE vs K and find the elbow

In [None]:
# step 1
sse = {} 

# step 2 - iterate for a range of Ks and fit the scaled data to the algorithm. Use inertia attribute from the clustering object and 
# store the inertia value for that k 
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000, random_state=1).fit(data_scaled)
    sse[k] = kmeans.inertia_

# step 3
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

- Looking at the plot, we can say that elbow point is achieved for k=3.
- We will fit the k-means again with k=3 to get the labels.

#### Question 4: 

- **From the above elbow plot, state the reason for choosing k=3 and with random_state=1(1 Mark)**
- **Fit the K-means algorithms on the scaled data with number of cluster equal to 3 (2 Mark)**
- **Store the predictions as 'Labels' to the 'data_scaled_copy' and 'data' dataframes (2 Marks)**

In [None]:
kmeans = KMeans(init="random",n_clusters = 3,random_state = 1) #Apply the K-Means algorithm
kmeans.fit(data_scaled) #Fit the kmeans function on the scaled data

#Adding predicted labels to the original data and scaled data 
data_scaled_copy['Labels'] = kmeans.predict(data_scaled)#Save the predictions on the scaled data from K-Means
data['Labels'] = kmeans.predict(data_scaled) #Save the predictions on the scaled data from K-Means

We have generated the labels with k-means. Let us look at the various features based on the labels.

#### **Question 5: Create cluster profiles using the below summary statistics and box plots for each label (6 Marks)**

In [None]:
#Number of observations in each cluster
data.Labels.value_counts()

In [None]:
#Calculating summary statistics of the original data for each label
mean = data.groupby('Labels').mean()
median = data.groupby('Labels').median()
df_kmeans = pd.concat([mean, median], axis=0)
df_kmeans.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_kmeans.T

In [None]:
#Visualizing different features w.r.t K-means labels
data_scaled_copy.boxplot(by = 'Labels', layout = (1,5),figsize=(20,7))
plt.show()

**Cluster Profiles:**
All relationships are explained relative to the median of each group.

GROUP 0: Fewer cards with low credit limits (one outlier - high), and many more calls made to the bank compared to the other two groups, few visits compared to group 1 and same as group 2 and few online visits, similar to group 1 but fewer than group 2, with one outlier. Group 0 was the only group with an outlier, although it is unknown if this is the same person.

GROUP 1: This group was in the middle of the three for credit limit and total cards. It had the most total bank visits of the three groups, but made a similar number of calls as group 2 and a simmilar number of visits online as group 0, but the lowest of the three.

GROUP 2: This group has the highest credit limit (relatively large IQR) and the most number of cards of the three groups. It makes the most online visits (relatively large IQR), but the fewest visits and calls to the bank of the three groups.

## Gaussian Mixture

Let's create clusters using Gaussian Mixture Models

#### Question 6: 

- **Apply the Gaussian Mixture algorithm on the scaled data with random_state=1 (2 Marks)** 
- **Create cluster profiles using the below summary statistics and box plots for each label (2 Marks)**
- **Compare the clusters from both algorithms - K-means and Gaussian Mixture (1 Mark)**

In [None]:
gmm = GaussianMixture(n_components=3, random_state=1) #Apply the Gaussian Mixture algorithm
gmm.fit(data_scaled) #Fit the gmm function on the scaled data

data_scaled_copy['GmmLabels'] = gmm.predict(data_scaled)
data['GmmLabels'] = gmm.predict(data_scaled)

In [None]:
#Number of observations in each cluster
data.GmmLabels.value_counts()

In [None]:
#Calculating summary statistics of the original data for each label
original_features = ["Avg_Credit_Limit","Total_Credit_Cards","Total_visits_bank","Total_visits_online","Total_calls_made"]

mean = data.groupby('GmmLabels').mean()
median = data.groupby('GmmLabels').median()
df_gmm = pd.concat([mean, median], axis=0)
df_gmm.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_gmm[original_features].T

In [None]:
# plotting boxplots with the new GMM based labels

features_with_lables = ["Avg_Credit_Limit","Total_Credit_Cards","Total_visits_bank","Total_visits_online","Total_calls_made","GmmLabels"]

data_scaled_copy[features_with_lables].boxplot(by = 'GmmLabels', layout = (1,5),figsize=(20,7))
plt.show()

**Cluster Profiles:**
All relationships are explained relative to the median of each group.

GROUP 0: Lowest credit limit (one high outlier), fewest number of cards. Highest number of calls made. Visits to the bank similar to group 2 and visits online most similar to group 1 (one high outlier). The only group with an outlier.

GROUP 1: Between group 0 and group 2 for the number of cards and credit limit (although most similar to group 0 for number of cards). Highest number of visits to the bank, but total calls were low (similar to group 2) and total visits online were the lowest of the three groups (similar to group 0).

GROUP 2: Highest credit limit (relatively large IQR) and number of cards. Fewest calls made and visits to the bank, with the most visits online (relatively large IQR).

**Comparing Clusters:**

The two methods created the same stories/descriptions of the characteristics of the three groups.

## K-Medoids

#### Question 7: 

- **Apply the K-Mediods on the scaled data with random_state=1 (2 Marks)** 
- **Create cluster profiles using the below summary statistics and box plots for each label (2 Marks)**
- **Compare the clusters from both algorithms - K-Means and K-Medoids (2 Marks)**

In [None]:
kmedo = KMedoids(metric="euclidean", n_clusters=3)#Apply the K-Medoids algorithm
kmedo.fit(data_scaled) #Fit the kmedo function on the scaled data

data_scaled_copy['kmedoLabels'] = kmedo.predict(data_scaled)
data['kmedoLabels'] = kmedo.predict(data_scaled)

In [None]:
#Number of observations in each cluster
data.kmedoLabels.value_counts()

In [None]:
#Calculating summary statistics of the original data for each label
mean = data.groupby('kmedoLabels').mean()
median = data.groupby('kmedoLabels').median()
df_kmedoids = pd.concat([mean, median], axis=0)
df_kmedoids.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_kmedoids[original_features].T

In [None]:
#plotting boxplots with the new K-Medoids based labels

features_with_lables = ["Avg_Credit_Limit",	"Total_Credit_Cards","Total_visits_bank","Total_visits_online","Total_calls_made","kmedoLabels"]

data_scaled_copy[features_with_lables].boxplot(by = 'kmedoLabels', layout = (1,5),figsize=(20,7))
plt.show()

Let's compare the clusters from K-Means and K-Medoids 

In [None]:
comparison = pd.concat([df_kmedoids, df_kmeans], axis=1)[original_features]
comparison

**Cluster Profiles:**

All relationships are explained relative to the median of each group.

GROUP 0: Lowest average credit limit (one higher outlier), fewest number of credit cards. Highest number of calls, lowest number of bank visits, highest median of visits online (one high outlier).

GROUP 1: Highest credit limit (largest IQR) and most cards (one low outlier). Low number of calls (simiar to group 2), low number of bank visits (just more than group 0)(one high outlier), in the middle of the three groups for visits online (with a large IQR).

GROUP 2: In the middle of the three for credit limit and number of cards. Low number of calls to the bank, similar to group 1, highest visits to the bank, but lowest online visits.

**Comparing Clusters:**

GROUP 0: The two clustering techniques, KMediods and KMeans, described this group almost identically with respect to means and exactly the same with respect to medians. And they each found the same outliers. I conclude that these two methods described GROUP 0 using the same people.

GROUP 1 and GROUP 2: This is where the differences are between the two methods.

GROUP 1: Kmediods had a higher mean and median for Average Credit Limit, Credit Cards, and Online Visits than KMeans. While for Bank Visits KMediods mean and median were lower than KMeans for this group. The two techniques reported the same mean and median for this group in the total Calls Made to the bank.

GROUP 2: KMediods returned a higher mean and median for Visits to the Bank and Calls Made than KMeans. KMediods returned lower mean and median for Average Credit Limit, Number of Credit Cards, and Visits Online than KMeans.

KMediods and KMeans found different members for GROUPS 1 and 2.