# Credit Card Customer Segmentation using k-means clustering

Not all customers are alike. Consumers usually show a wide variety of behaviors. A lot of times, Segments that are used in businesses are threshold based.
With growing number of features and a general theme of personlized products, there is a need for a scietific based methodology to group customers together.
Clustering based on the behavioral data comes to the rescue.
The aim of this analysis is to group credit card holders in appropriate groups to better understand their needs and behaviors and to serve them better with appropriate marketing offers.

We will use k-means algorithm to create the appropriate segmentation strategy.

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

In [None]:
df_original = pd.read_csv('/kaggle/input/ccdata/CC GENERAL.csv', index_col='CUST_ID')
df = df_original.copy()

In [None]:
df.shape

In [None]:
df.columns

It's worth understanding what do these features correspond to:

* **BALANCE** : Balance amount left in customers account to make purchases
* **BALANCE_FREQUENCY** : How frequently the Balance is updated, score between 0 and 1 
* **PURCHASES** : Amount of purchases made from account
* **ONEOFF_PURCHASES** : Maximum purchase amount done in one-go
* **INSTALLMENTS_PURCHASES** : Amount of purchase done in installment
* **CASH_ADVANCE** : Cash in advance given by the user
* **PURCHASES_FREQUENCY** : How frequently the Purchases are being made, score between 0 and 1 
* **ONEOFFPURCHASESFREQUENCY** : How frequently Purchases are happening in one-go
* **PURCHASESINSTALLMENTSFREQUENCY** : How frequently purchases in installments are being done
* **CASHADVANCEFREQUENCY** : How frequently the cash in advance being paid
* **CASHADVANCETRX** : Number of Transactions made with "Cash in Advanced"
* **PURCHASES_TRX** : Numbe of purchase transactions made
* **CREDIT_LIMIT** : Limit of Credit Card for user
* **PAYMENTS** : Amount of Payment done by user
* **MINIMUM_PAYMENTS** : Minimum amount of payments made by user
* **PRCFULLPAYMENT** : Percent of full payment paid by user
* **TENURE** : Tenure of credit card service for user

(1=frequent, 0=not frequent)

In [None]:
df.sample(10)

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
# Filling out all the null values using median 
# More appropriate strategies might be required depending on the context
df.fillna(df.median(), inplace=True)

In [None]:
for col in df.columns:
    print('{:33} : {:6} : {:}'.format(col, df[col].nunique(), df[col].dtype))

In [None]:
(1e2*df['TENURE'].value_counts().sort_index()/len(df)).plot(kind='barh')
plt.title('Tenure Distribution')
plt.xlabel('% Distribution');

In [None]:
sns.boxplot(x="TENURE", y="BALANCE", data=df)
plt.ylim(-10**3, 10**4)
plt.title('Balance distribution with Tenure');

In [None]:
fig, axs = plt.subplots(nrows=4, ncols=4, figsize=(15, 15))
for i in range(4):
    for j in range(4):
        sns.distplot(df[df.columns[4 * i + j]], ax=axs[i,j])
plt.show()

In [None]:
df.shape

In [None]:
from sklearn.cluster import KMeans
k = 5
kmeans = KMeans(n_clusters=k, random_state=1)
df['k_5_label'] = kmeans.fit_predict(df)

The intertia is one measure of understanding the behaviors of clusters.

In [None]:
kmeans.inertia_

In [None]:
profile = df.groupby('k_5_label').mean().T

In [None]:
round(profile)

In [None]:
# round(profile.apply(lambda x: (max(x) - min(x))/x.median(), axis=1))

In [None]:
round(pd.DataFrame(kmeans.cluster_centers_.T))

## Minibatch Clustering

In [None]:
from sklearn.cluster import MiniBatchKMeans

minibatch_kmeans = MiniBatchKMeans(n_clusters=5, random_state=1)
df['k_5_batch'] = minibatch_kmeans.fit_predict(df)

In [None]:
pd.crosstab(df['k_5_label'], df['k_5_batch'])

In [None]:
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

# Evaluations of clustering metrics

To figure out the number of clusters that can be found out in our datasets, we can evaluate a set of indices or scores.

1. Silhoutte score
2. Calinski Harabasz score
3. Davies Bouldin score


In [None]:
def evaluate_metrics(df, min_clust=2, max_clust=10, rand_state=1):
    inertias = []
    silhouette = []
    ch_score = []
    db_score = []
    for n_clust in range(min_clust, max_clust):
        kmeans = KMeans(n_clusters=n_clust, random_state=rand_state)
        y_label = kmeans.fit_predict(df)
        inertias.append(kmeans.inertia_)
        silhouette.append(silhouette_score(df, y_label))
        ch_score.append(calinski_harabasz_score(df, y_label))
        db_score.append(davies_bouldin_score(df, y_label))        

    fig, ax = plt.subplots(2, 2, figsize=(15, 10))
    ax[0][0].plot(range(min_clust, max_clust), inertias, '-x', linewidth=2)
    ax[0][0].set_xlabel('No. of clusters')
    ax[0][0].set_ylabel('Inertia')
    
    ax[0][1].plot(range(min_clust, max_clust), silhouette, '-x', linewidth=2)
    ax[0][1].set_xlabel('No. of clusters')
    ax[0][1].set_ylabel('Silhouette Score')
    
    ax[1][0].plot(range(min_clust, max_clust), ch_score, '-x', linewidth=2)
    ax[1][0].set_xlabel('No. of clusters')
    ax[1][0].set_ylabel('Calinski Harabasz Score')
    
    ax[1][1].plot(range(min_clust, max_clust), db_score, '-x', linewidth=2)
    ax[1][1].set_xlabel('No. of clusters')
    ax[1][1].set_ylabel('Davies Bouldin Score')
    fig.suptitle('Metrics to evaluate the number of clusters')
    plt.show()

In [None]:
evaluate_metrics(df.iloc[:, :-2], min_clust=2, max_clust=15, rand_state=0)

# Scaling of features

In [None]:
df = df_original.copy()
df.fillna(df.median(), inplace=True)

In [None]:
from sklearn.preprocessing import StandardScaler
df_scaled = StandardScaler().fit_transform(df)

In [None]:
evaluate_metrics(df_scaled, min_clust=2, max_clust=15, rand_state=0)

In [None]:
from yellowbrick.cluster.silhouette import SilhouetteVisualizer

In [None]:
plt.style.use('seaborn-paper')
fig, axs = plt.subplots(2, 3, figsize=(20, 15))
axs = axs.reshape(6)
for i, k in enumerate(range(7, 13)):
    ax = axs[i]
    sil = SilhouetteVisualizer(KMeans(n_clusters=k, random_state=1), ax=ax)
    sil.fit(df_scaled)
    sil.finalize()

In [None]:
plt.style.use('fivethirtyeight')

With the general intuition obtained from various methods above, we conclude that 8 seems to be an appropriate number for clustering.

In [None]:
df.T

In [None]:
kmeans = MiniBatchKMeans(n_clusters=8, random_state=1)
df['k_8_label'] = kmeans.fit_predict(df)

Let's look at the distribution of the population within the cluster.

In [None]:
round(1e2 * df['k_8_label'].value_counts().sort_index()/len(df), 2)

This uneven distribution in the clusters is a desirable or undesirable thing, depending on the business context.
If there is a need for equal number of clustes in the datasets, then some of the clusters can be combined into other clusters.
If however, the business ask is to create a anomaly/fraud detection strategy (for example, identification of gamers), then having a small number of customers in a particular segment is not an issue.

Let us see the profiles of the customers in different groups.

In [None]:
round(df.groupby('k_8_label').mean().T, 2)

In [None]:
#fig, ax = plt.subplots(figsize=(6, 4))
df.mean()

Let us look at some of the main features of the clusters. **Balances, Purchases, Cash advances, Credit Limit, Payments** are some of the most important features at play for credit card products. It is really important however to be aware and keep in the mind the percentage distribution of the clusters.

In [None]:
round(1e2 * df['k_8_label'].value_counts().sort_index()/len(df))

In [None]:
(df[['BALANCE', 'PURCHASES', 'CASH_ADVANCE', 'CREDIT_LIMIT', 'PAYMENTS', 'k_8_label']]
 .groupby('k_8_label').mean().plot.bar(figsize=(15, 5)))
plt.title('Purchase Behavior of various segments')
plt.xlabel('SEGMENTS');

In [None]:
(df[['PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY', 'k_8_label']]
 .groupby('k_8_label').mean().plot.bar(figsize=(15, 5)))
plt.title('Frequency behavior of various segments')
plt.xlabel('SEGMENTS');

# Observations: 

## Large segments:
* **Cluster 6**: This cluster shows low balances but average activity. This cluster will be an approprite cluster for spend campaign targetting.
* **Cluster 0**: This cluster shows slightly higher balances and purchase activities, but higher one-off purchase behavior. 
* **Cluster 4**: This cluster has the highest activity, balances, and purchases. This group of customers interestingly also have a higher set of credit lines, indicating that an increasing credit limit increases leads to an increase in the purchase activitis. (A rigourous testing of this hypothesis should be carries out.)

## Small segments:
* **Cluster 2**: This group of customers is in a dire need of a credit limit increase. They also have the highest activities among all the clusters.
* **Cluster 3**: This group of customers on the other hand are not completely utilizing the credit line assigned to them. Additional investigations are needed to understand why this particular set of consumers are not utilizing their lines or if their credit lines could in the future be assigned to a different set of consumers.