## Check out the interactive plots at the bottom.

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
Path.ls = lambda x: list(x.iterdir())
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
import math
from scipy import stats
from sklearn.cluster import KMeans

In [None]:
path = Path('/kaggle/input/ccdata/')
path.ls()

In [None]:
df = pd.read_csv(path/'CC GENERAL.csv')
df.head()

We are going to use PCA and KMeans clustering to perform customer segmentation with credit card data in this notebook.
We have the following features: 

* CUSTID : Identification of Credit Card holder (Categorical)
* BALANCE : Balance amount left in their account to make purchases (
* BALANCEFREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
* PURCHASES : Amount of purchases made from account
* ONEOFFPURCHASES : Maximum purchase amount done in one-go
* INSTALLMENTSPURCHASES : Amount of purchase done in installment
* CASHADVANCE : Cash in advance given by the user
* PURCHASESFREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
* ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
* PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
* CASHADVANCEFREQUENCY : How frequently the cash in advance being paid
* CASHADVANCETRX : Number of Transactions made with "Cash in Advanced"
* PURCHASESTRX : Numbe of purchase transactions made
* CREDITLIMIT : Limit of Credit Card for user
* PAYMENTS : Amount of Payment done by user
* MINIMUM_PAYMENTS : Minimum amount of payments made by user
* PRCFULLPAYMENT : Percent of full payment paid by user
* TENURE : Tenure of credit card service for user

In [None]:
df.describe()

#### Observations: 

* We have skewed data. We can see 0's in `25th` and `50th` percentile. We will have to plot the histograms to find out more.
* The Tenure looks like a categorical column which makes sense. 


Lets check for NAs

In [None]:
df.info(), df.isna().sum(), df.isna().sum()/len(df)

#### Observations:
* We can see there are missing values in `CREDIT_LIMIT` and `MINIMUM_PAYMENTS`. 
* We can also see that the missing values account for only 3 percent data in `MINIMUM_PAYMENTS` and there is only one missing value in `CREDIT_LIMIT`. 
* We can easily use median to fill the NAs.


Lets fill the nas with median

In [None]:
na_cols = df.columns[df.isna().sum() > 0].tolist()
df.loc[:,na_cols] = df.loc[:,na_cols].fillna(df[na_cols].median())

In [None]:
df.isna().sum().sum()

We have handled the null values. Lets move on now. Lets list all the columns by their type.

* `categorical_cols`: 'TENURE'
* `continuous_cols`: 'BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX', 'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT'.
* `index_col`: 'CUST_ID'

# EDA

In [None]:
# I am too lazy to write down all the columns for cont cols ;) 
cat_cols = ['TENURE']
cont_cols = df.columns.tolist()
cont_cols.remove(cat_cols[0])
cont_cols.remove('CUST_ID')

In [None]:
def plot_univariable_plots(df, cat_cols, cont_cols):
    total_cols = len(cat_cols)+len(cont_cols)
    fig, axes = plt.subplots(math.ceil(total_cols/3),3, figsize=(20,20),constrained_layout=True)
    axes = axes.flatten()
    fig.suptitle(f'Univariate plots'.title(),fontsize=18)
    
    for i, (col, ax) in enumerate(zip(cont_cols, axes)):
        sns.distplot(df[col], ax=ax)
        ax.set_title(f'Histogram of {col}')
    
    for col in cat_cols:
        sns.countplot(df[col],ax=axes[i+1])
        ax.set_title(f'Histogram of {col}')
        
    plt.show()

## Looking at the data distribution

In [None]:
plot_univariable_plots(df, cat_cols, cont_cols)

#### Observations: 
* We can see most of the features are heavily skewed. We can try transforming with log
* We can see that some features are left skewed and most are right skewed. 

Lets transform skewed data with boxcox transform from scipy.stats.

In [None]:
transformed_df = df.copy()
transformed_df.loc[:,cont_cols] = transformed_df[cont_cols].apply(lambda x: stats.boxcox(x+1)[0], axis=0)
plot_univariable_plots(transformed_df, cat_cols, cont_cols)

# Normalizing Data for PCA

To perform PCA on our data, we need to scale our data between 0 and 1. We will use MinMaxScaler from scikit learn to achieve that

In [None]:
scaler = MinMaxScaler()
scaler.fit(transformed_df[cont_cols+cat_cols])
scaled = scaler.transform(transformed_df[cont_cols+cat_cols])
scaled_df = pd.DataFrame(scaled, columns=cont_cols+cat_cols)

# PCA

Now that our data is scaled, we are ready to apply PCA

In [None]:
N_COMPONENTS = 15
pca = PCA(n_components=N_COMPONENTS)
pca.fit(scaled_df)
pca.explained_variance_ratio_[:4].sum()

We get a 84% explained variance with just 4 components. We have successfully reduced the dimensions from 16 continuous variables to 4. It will help a lot in visualizing the data now.

In [None]:
pca_data = pca.transform(scaled_df)
pca_df = pd.DataFrame(pca_data).iloc[:,:4]
pca_df.columns = list(map(lambda x: f'pca_{x+1}', pca_df.columns))
# pca_df['TENURE'] = df.TENURE

In [None]:
fig = px.scatter_3d(pca_df,x='pca_1',y='pca_2',z='pca_3',opacity=0.3,color='pca_4')
fig.show()

We can already see the clusters in the 3d scatterplot. **The plots are interactive. Try playing with it.**

# KMeans Clustering

In [None]:
cost = []
ks = []
for i in range(3,30):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(pca_df)
    cost.append(kmeans.inertia_)
    ks.append(i)
sns.lineplot(x=np.array(ks), y=np.array(cost))
plt.xticks(ks)
plt.show()

In [None]:
kmeans = KMeans(n_clusters=5)
kmeans.fit(pca_df)
out = kmeans.predict(pca_df)

In [None]:
fig = px.scatter_3d(pca_df,x='pca_1',y='pca_2',z='pca_3',color=out,opacity=0.5,
                    title='KMeans cluster with k=5')
fig.show()

Please have a look at the 3d plots to understand the clusters. 

# Understanding the clusters

In [None]:
def display_component(v, features_list, component_num,ax):
    
    row_idx = component_num
    
    v_1_row = v.iloc[:,row_idx]
    v_1 = np.squeeze(v_1_row.values)
    
    comps = pd.DataFrame(list(zip(v_1, features_list)),
                         columns=['weights', 'features'])
    
    comps['abs_weights']=comps['weights'].apply(lambda x: np.abs(x))
    sorted_weight_data = comps.sort_values('abs_weights',ascending=False).head()
    
    sns.barplot(data=sorted_weight_data,
                   x="weights",
                   y="features",
                   palette="Blues_d",ax=ax)
    ax.set_title("PCA Component Makeup, Component #" + str(component_num), fontsize=20)


In [None]:
features_list = np.array(cont_cols+cat_cols)
v = pd.DataFrame(pca.components_)

In [None]:
fig, axes = plt.subplots(2,2,figsize=(20,8),constrained_layout=True)
axes=axes.flatten()
for i,ax in enumerate(axes):
    display_component(v, features_list, i,ax=ax)
plt.show()

We can see that how pca features correspond to the actual features in the dataset. 

**The 1st PCA component has high positive correlation with BALANCE_FREQUENCY and is has negative correlation with ONEOFF_PURCHASES and so on**

# Finally Understanding Customer behaviour

We will now take the cluster centers and use the features in the **PCA** feature space to visualize customer behaviour

In [None]:
cluster_centers = kmeans.cluster_centers_
behaviours = cluster_centers.dot(v[:4])

In [None]:
fig, axes = plt.subplots(3,2,figsize=(15,12),constrained_layout=True)
axes=axes.flatten()
threshold = 0.2
for i,behaviour in enumerate(behaviours):
    thresh_mask = np.nonzero(np.abs(behaviour)>threshold)[0].tolist()
    sns.barplot(behaviour[thresh_mask], y=features_list[thresh_mask],ax=axes[i])
    axes[i].set_title(f'Cluster {i+1} features')
plt.show()


## Behaviours observed:

* **Cluster 1** : Customers who use credit card for Installment Purchases. They do not make oneoff purchases at all. 
* **Cluster 2** : Customers who use their credit card for all types of purchases and pay their bills in advance.  
* **Cluster 3** : Customers who have a huge tendency of oneoff purchases and do it frequently. They also have high amount purchases.
* **Cluster 4** : Customers who dont make huge purchases on the credit card. Also, they pay the bill in advance. 
* **Cluster 5** : Customers who use the credit card mostly for oneoff Purchases only. They also don't pay bills in advance.

## Please upvote the notebook if you like my work. Keeps me motivated.