# Credit Card Dataset for Clustering
## Table of Contents
<ul>
<li><a href="#Dictionary">Data Dictionary</a></li>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling"> Data wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#cluster">clustering </a></li>
</ul> 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt  # for visualization 
%matplotlib inline
import seaborn as sns           # for visualization 

In [None]:
# read the data
df = pd.read_csv('/kaggle/input/ccdata/CC GENERAL.csv')
df.head()

<a id='Dictionary'></a>
## Data Dictionary

**Following is the Data Dictionary for Credit Card dataset :-**

- **CUSTID** : Identification of Credit Card holder (Categorical)
- **BALANCE**: Balance amount left in their account to make purchases (
- **BALANCEFREQUENCY** : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
- **PURCHASES** : Amount of purchases made from account
- **ONEOFFPURCHASES**: Maximum purchase amount done in one-go
- **INSTALLMENTSPURCHASES** : Amount of purchase done in installment
- **CASHADVANCE**: Cash in advance given by the user
- **PURCHASESFREQUENCY** : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
- **ONEOFFPURCHASESFREQUENCY**: How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
- **PURCHASESINSTALLMENTSFREQUENCY** : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
- **CASHADVANCEFREQUENCY** : How frequently the cash in advance being paid
- **CASHADVANCETRX** : Number of Transactions made with "Cash in Advanced"
- **PURCHASESTRX**: Numbe of purchase transactions made
- **CREDITLIMIT**: Limit of Credit Card for user
- **PAYMENTS** : Amount of Payment done by user
- **MINIMUM_PAYMENTS** : Minimum amount of payments made by user
- **PRCFULLPAYMENT** : Percent of full payment paid by user
- **TENURE** : Tenure of credit card service for user

<a id='intro'></a>
## Introduction



**This case requires to develop a customer segmentation to define marketing strategy. The
sample Dataset summarizes the usage behavior of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioral variables.**


<a id='wrangling'></a>
## Data wrangling

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe().T

In [None]:
df.nunique()

# Missing Values

In [None]:
def missing_percentage(df):

    total = df.isnull().sum().sort_values(
        ascending=False)[df.isnull().sum().sort_values(ascending=False) != 0]
    percent = (df.isnull().sum().sort_values(ascending=False) / len(df) *
               100)[(df.isnull().sum().sort_values(ascending=False) / len(df) *
                     100) != 0]
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])


missing_data = missing_percentage(df)

fig, ax = plt.subplots( figsize=(16, 6))

sns.barplot(x=missing_data.index,
            y='Percent',
            data=missing_data)


ax.set_title('Missing Values')
plt.show()

## Checking Variables Before Imputing

Just wanted to check variable distribution before we impute the missing ones

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.distplot(df.MINIMUM_PAYMENTS, color='#fdc029')
plt.subplot(1,2,2)
sns.distplot(df.CREDIT_LIMIT, color='#fdc029')
plt.show()

**After discovered the data with missing value and knowing its distribution, the best way to fill missing values is median**

## filling missing values 

In [None]:
df.MINIMUM_PAYMENTS.fillna(df.MINIMUM_PAYMENTS.median(),inplace=True)

In [None]:
print('MINIMUM_PAYMENTS FEATURE HAS',df.MINIMUM_PAYMENTS.isna().sum(),'MISSING VALUE')

In [None]:
df.CREDIT_LIMIT.fillna(df.CREDIT_LIMIT.median(),inplace=True)

In [None]:
print('CREDIT_LIMIT FEATURE HAS',df.CREDIT_LIMIT.isna().sum(),'MISSING VALUE')

<a id='eda'></a>
## EDA

In [None]:
g = sns.PairGrid(df)
g.map(plt.scatter)
plt.title('relations between features')
plt.show()

In [None]:
def scatter_purchases(x):
    sns.scatterplot(y='PURCHASES',x=x,data = df,color='#171820',alpha=0.7)

In [None]:
scatter_purchases('BALANCE')

In [None]:
plt.figure(figsize=(16,5))

plt.subplot(1,2,1)
sns.lineplot(x='TENURE',y='PURCHASES',data=df)
plt.title('The Purchases based on Tenure of credit card service for use')
plt.subplot(1,2,2)
scatter_purchases('TENURE')


**Now we can confirm that with the increase in the period of use of the card, the purchase price increases, especially for a year, because there is a big difference between it and the rest**

In [None]:
plt.hist(df.CREDIT_LIMIT)
plt.title('credit limit distribution')
plt.show()

In [None]:
col = list(df.drop('CUST_ID',axis=1).columns)

In [None]:
plt.figure(figsize=(30,30))
for idx,val in enumerate(col):
    plt.subplot(6,3,idx+1)
    sns.boxplot(x=val,data=df)

## We can see that the data contains a lot of outliers which we have to deal with

**But! What if we divide them into clusters; Outliers will be in Private cluster**

<a id='cluster'></a>

# Clustering

## Kmeans

In [None]:
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
X = scale.fit_transform(df.drop('CUST_ID',axis=1))

In [None]:
from sklearn.cluster import KMeans
n_clusters=30
cost=[]
for i in range(1,n_clusters):
    kmean= KMeans(i)
    kmean.fit(X)
    cost.append(kmean.inertia_)  

In [None]:
plt.plot(cost, 'bx-')

In [None]:
n_clusters=10
cost=[]
for i in range(1,n_clusters):
    kmean= KMeans(i)
    kmean.fit(X)
    cost.append(kmean.inertia_)  

In [None]:
plt.plot(cost, 'gx-')
plt.title('Elbow Criterion')
plt.show()

## 6 clusters are good

In [None]:
kmean= KMeans(6)
kmean.fit(X)
labels=kmean.labels_

In [None]:
clusters=pd.concat([df, pd.DataFrame({'cluster':labels})], axis=1)
clusters.head()

In [None]:
clusters.info()

In [None]:
for c in clusters.iloc[:,1:]:
    grid= sns.FacetGrid(clusters.iloc[:,1:], col='cluster')
    grid.map(plt.hist, c)

- **Cluster0** People with average to high credit limit who make all type of purchases
- 
- **Cluster1** This group has more people with due payments who take advance cash more often
- 
- **Cluster2** Less money spenders with average to high credit limits who purchases mostly in installments
- 
- **Cluster3** People with high credit limit who take more cash in advance
- 
- **Cluster4** High spenders with high credit limit who make expensive purchases
- 
- **Cluster5** People who don't spend much money and who have average to high credit limit

# Visualization of Clusters

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
dist = 1 - cosine_similarity(X)

pca = PCA(2)
pca.fit(dist)
X_PCA = pca.transform(dist)
X_PCA.shape

In [None]:
x, y = X_PCA[:, 0], X_PCA[:, 1]

colors = {0: 'red',
          1: 'blue',
          2: 'green', 
          3: 'yellow', 
          4: 'orange',  
          5:'purple'}

names = {0: 'who make all type of purchases', 
         1: 'more people with due payments', 
         2: 'who purchases mostly in installments', 
         3: 'who take more cash in advance', 
         4: 'who make expensive purchases',
         5:'who don\'t spend much money'}
  
df = pd.DataFrame({'x': x, 'y':y, 'label':labels}) 
groups = df.groupby('label')

fig, ax = plt.subplots(figsize=(20, 13)) 

for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=5,
            color=colors[name],label=names[name], mec='none')
    ax.set_aspect('auto')
    ax.tick_params(axis='x',which='both',bottom='off',top='off',labelbottom='off')
    ax.tick_params(axis= 'y',which='both',left='off',top='off',labelleft='off')
    
ax.legend()
ax.set_title("Customers Segmentation based on their Credit Card usage bhaviour.")
plt.show()
