# 1. Introduction: Business Goal & Problem Definition

IF YOU LIKE IT OR IF IT HELPS YOU SOMEHOW, COULD YOU PLEASE UPVOTE? THANK YOU VERY MUCH!!!

This project´s goal is doing an exploratory data analysis (EDA) in the Credit Card Industry to help the company with proactive offers and services to the customers, gain market share and minimize customer churn. The goal is to analyze the data set to summarize its main characteristics using several different visual methods, primarily for seeing what the data can tell us. The available dataset features for analysis are:

* Customer
* Age
* City
* Product
* Limit
* Company
* Segment
* Spend Month
* Spend Type
* Spend Amount
* Payment Month
* Payment Amount

# 2. Importing Basic Libraries

In [None]:
!pip install openpyxl
import io
import openpyxl
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# 3. Data Collection

In [None]:
customer_ds = pd.read_csv("../input/credit-card-exploratory-data-analysis/Customer Acqusition.csv", sep=",")
spend_ds = pd.read_csv("../input/credit-card-exploratory-data-analysis/spend.csv", sep=",")
repayment_ds = pd.read_csv("../input/credit-card-exploratory-data-analysis/Repayment.csv", sep=",")

customer_spend_ds = customer_ds.merge(spend_ds, on="Customer", how="left")
customer_payment_ds = customer_ds.merge(repayment_ds, on="Customer", how="left")

customer_spend_ds
customer_payment_ds

# 4. Data Preliminary Exploration

In [None]:
#Checking a dataset sample

pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
pd.options.display.float_format="{:,.2f}".format

print("Customer Spend:")
customer_spend_ds.sample(n=10, random_state=0)
print("Customer Payment:")
customer_payment_ds.sample(n=10, random_state=0)

In [None]:
#Checking dataset info by feature

print("Customer Spend:")
customer_spend_ds.info(verbose=True, null_counts=True)
print("")
print("Customer Payment:")
customer_payment_ds.info(verbose=True, null_counts=True)

In [None]:
#Checking the existence of zeros in rows

print("Customer Spend:")
(customer_spend_ds==0).sum(axis=0).to_excel("customer_spend_ds_zeros_per_feature.xlsx")
(customer_spend_ds==0).sum(axis=0)
print("Customer Payment:")
(customer_payment_ds==0).sum(axis=0).to_excel("customer_payment_ds_zeros_per_feature.xlsx")
(customer_payment_ds==0).sum(axis=0)

In [None]:
#Checking the existence of duplicated rows

print("Customer Spend:")
customer_spend_ds.duplicated().sum()
print("Customer Payment:")
customer_payment_ds.duplicated().sum()

In [None]:
#Checking basic statistical data by feature

print("Customer Spend:")
customer_spend_ds.describe(include="all")
print("Customer Payment:")
customer_payment_ds.describe(include="all")

# 5. Data Cleaning

    We´ll perform the following:
    
    
    1. Change some columns names for better interpretability:
        1.1 Month_x: Spend Date
        1.2 Type: Spend Type
        1.3 Amount_x: Spend Amount
        1.4 Month_y: Payment Date
        1.5 Amount_y: Payment Amount
    
    
    2. Create calculated feature that could bring relevant information to the analysis:
        2.1 Spend Month
        2.2 Spend Year
        2.3 Payment Month
        2.4 Payment Year
        2.5 Spend Amount to Limit Ratio
        
        
    3. Remove irrelevant features for the analysis:
        3.1 No
        3.2 Company
        3.3 Sl No:
        3.4 SL No:
        3.5 Unnamed: 4
    
        
    4. Convert categorical features to dummies so we can analyze their correlations in step 7:
        4.1 Product
        4.2 Segment
        4.3 Spend Type (in this exercise we won´t convert Spend Type in order to simplify the model)
        
        
    * No duplications found
    * No missing, zero or invalid values to treat
    * No outliers found
    * The entire dataset will be taken

In [None]:
#1

customer_spend_ds.rename({"Month": "Spend_Date", "Type": "Spend_Type", "Amount": "Spend_Amount"}, axis=1, inplace=True)
customer_payment_ds.rename({"Month": "Payment_Date", "Amount": "Payment_Amount"}, axis=1, inplace=True)

#2

customer_spend_ds["Spend_Month"] = pd.DatetimeIndex(customer_spend_ds["Spend_Date"]).month
customer_spend_ds["Spend_Year"] = pd.DatetimeIndex(customer_spend_ds["Spend_Date"]).year
customer_payment_ds["Payment_Month"] = pd.DatetimeIndex(customer_payment_ds["Payment_Date"]).month
customer_payment_ds["Payment_Year"] = pd.DatetimeIndex(customer_payment_ds["Payment_Date"]).year
customer_spend_ds["Spend Amount to Limit Ratio"] = customer_spend_ds["Spend_Amount"] / customer_spend_ds["Limit"]

#3

customer_spend_ds.drop(["No", "Company", "Sl No:"], axis=1, inplace=True)
customer_payment_ds.drop(["SL No:", "Unnamed: 4"], axis=1, inplace=True)

#4

customer_spend_ds["Product_Level"] = customer_spend_ds["Product"].apply(lambda x: ["Silver", "Gold", "Platimum"].index(x))+1
customer_payment_ds["Product_Level"] = customer_payment_ds["Product"].apply(lambda x: ["Silver", "Gold", "Platimum"].index(x))+1

customer_spend_ds = pd.concat([customer_spend_ds, pd.get_dummies(customer_spend_ds["Segment"], prefix="Segment")], axis=1)
customer_payment_ds = pd.concat([customer_payment_ds, pd.get_dummies(customer_payment_ds["Segment"], prefix="Segment")], axis=1)

# customer_spend_ds = pd.concat([customer_spend_ds, pd.get_dummies(customer_spend_ds["Spend_Type"], prefix="Spend_Type")], axis=1)


customer_spend_ds.to_excel("customer_spend_ds_clean.xlsx")
customer_payment_ds.to_excel("customer_payment_ds_clean.xlsx")

# 6. Data Exploration

# 6.1 Checking Top Customers by Spend Amount

In [None]:
#Checking Top Customers by Spend Amount

pivot = customer_spend_ds.pivot_table(index=["Customer", "Age", "City", "Product", "Limit", "Segment"], columns=["Spend_Year"], values=["Spend_Amount"], aggfunc=np.sum, margins=True).reset_index()
pivot.reindex(pivot["Spend_Amount"].sort_values(by="All", ascending=False).index)

# 6.2 Checking Top Customers by Payment Amount

In [None]:
#Checking Top Customers by Payment Amount

pivot = customer_payment_ds.pivot_table(index=["Customer", "Age", "City", "Product", "Limit", "Segment"], columns=["Payment_Year"], values=["Payment_Amount"], aggfunc=np.sum, margins=True).reset_index()
pivot.reindex(pivot["Payment_Amount"].sort_values(by="All", ascending=False).index)

# 6.3 Checking Customers by Product and Limit Using TreeMap

In [None]:
#Checking Customers by Product and Limit

import matplotlib
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 12}
matplotlib.rc('font', **font)

pivot = customer_spend_ds.pivot_table(index=["Customer", "Product"], columns=[], values=["Limit"], aggfunc=np.mean).reset_index()
fig = px.treemap(pivot, path=["Customer", "Product"], values="Limit", color="Product", title="Customers by Product and Limit").show()

# 6.4 Checking Customers by City and Spend Amount Using TreeMap

In [None]:
#Checking Customers by City and Spend Amount

pivot = customer_spend_ds.pivot_table(index=["Customer", "City"], columns=[], values=["Spend_Amount"], aggfunc=np.sum).reset_index()
fig = px.treemap(pivot, path=["Customer", "City"], values="Spend_Amount", color="City", title="Customers by City and Spend Amount").show()

# 6.5 Checking Customers by Segment and Spend Amount Using TreeMap

In [None]:
#Checking Customers by Segment and Spend Amount

pivot = customer_spend_ds.pivot_table(index=["Customer", "Segment"], columns=[], values=["Spend_Amount"], aggfunc=np.sum).reset_index()
fig = px.treemap(pivot, path=["Customer", "Segment"], values="Spend_Amount", color="Segment", title="Customers by Segment and Spend Amount").show()

# 6.6 Checking Customers by Spend Type and Amount Using TreeMap

In [None]:
#Checking Customers by Spend Type and Amount

pivot = customer_spend_ds.pivot_table(index=["Customer", "Spend_Type"], columns=[], values=["Spend_Amount"], aggfunc=np.sum).reset_index()
fig = px.treemap(pivot, path=["Customer", "Spend_Type"], values="Spend_Amount", color="Spend_Type", title="Customers by Spend Type and Amount").show()

# 6.7 Checking Customers by Age and Spend Amount Using Bubble Chart

In [None]:
#Checking Customers by Age and Spend Amount

# !pip install ipywidgets
# !jupyter nbextension enable --py widgetsnbextension --sys-prefix

%matplotlib inline
import plotly.offline as offline
import ipywidgets as widgets

def update_map(customer_spend_ds, year, month):
    print(f"Year range: {year}")
    print(f"Month range: {month}")
    global update_ds, pivot, fig
    update_ds = customer_spend_ds[(customer_spend_ds["Spend_Year"] >= year[0]) & (customer_spend_ds["Spend_Year"] <= year[-1]) &
    (customer_spend_ds["Spend_Month"] >= month[0]) & (customer_spend_ds["Spend_Month"] <= month[-1])][["Spend_Year", "Spend_Month", "Age", "Spend_Amount", "Spend_Type", "Customer"]]
    pivot = update_ds.pivot_table(index=["Age", "Customer"], columns=[], values=["Spend_Amount"], aggfunc=np.sum).reset_index()
    fig = px.scatter(pivot, x="Age", y="Spend_Amount", size="Spend_Amount", color="Customer", hover_name="Customer", size_max=30, title="Customers by Age and Spend Amount").show()
    
year_select = widgets.SelectionRangeSlider(
              options=sorted(customer_spend_ds["Spend_Year"].unique().tolist()),
              index=(0,1),
              description="Select year:",
              disabled=False)

month_select = widgets.SelectionRangeSlider(
              options=sorted(customer_spend_ds["Spend_Month"].unique().tolist()),
              index=(0,1),
              description="Select month:",
              disabled=False)

widgets.interactive(update_map, customer_spend_ds=widgets.fixed(customer_spend_ds), year=year_select, month=month_select)

# 6.8 Checking Spend Amount by City Using Geographic Map

In [None]:
# Checking Spend Amount by City

coordinates_ds = pd.DataFrame({"City": ["COCHIN", "BANGALORE", "CHENNAI", "CALCUTTA", "BOMBAY", "PATNA", "TRIVANDRUM", "DELHI"], "Lat": [9.9312, 12.9716, 13.0827, 22.5726, 19.0760, 25.5941, 8.5241, 28.7041], "Lon": [76.2673, 77.5946, 80.2707, 88.3639, 72.8777, 85.1376, 76.9366, 77.1025]})
plotly_ds = customer_spend_ds.merge(coordinates_ds, on="City", how="left") 
pivot = plotly_ds.pivot_table(index=["Lat", "Lon", "City"], columns=[], values=["Spend_Amount"], aggfunc=np.sum).reset_index()

fig = px.scatter_mapbox(pivot,
                       lat="Lat",
                       lon="Lon",
                       color="City",
                       size="Spend_Amount",
                       hover_name="Spend_Amount",
                       color_continuous_scale=px.colors.cyclical.IceFire,
                       size_max=50,
                       zoom=3.5)

fig.update_layout(mapbox_style="open-street-map", height=600, margin={"r":0, "t":0, "l":0, "b":0})

# 6.9 Checking Dataset Behaviour Along the Time Using Line Chart

In [None]:
#Checking Dataset Behaviour Along the Time

sns.set(font_scale=1.2)

fig, axarr = plt.subplots(1, 1, figsize=(30, 10))
sns.lineplot(data=customer_spend_ds, x="Spend_Year", y="Spend_Amount", estimator="sum")
fig.suptitle("Spend Amount Behaviour Along The time", fontsize=25)

fig, axarr = plt.subplots(1, 1, figsize=(30, 10))
sns.lineplot(data=customer_payment_ds, x="Payment_Year", y="Payment_Amount", estimator="sum")
fig.suptitle("Payment Amount Behaviour Along The time", fontsize=25)

# 6.10 Checking Categorical Variables Bar and Pie Charts

In [None]:
#Plotting Categorical Variables

fig, ax = plt.subplots(1, 2, figsize=(15,5))
customer_spend_ds["City"].value_counts().plot.bar(color="purple", ax=ax[0])
customer_spend_ds["City"].value_counts().plot.pie(autopct='%1.1f%%', shadow=True,ax=ax[1])
fig.suptitle("City Frequency", fontsize=25)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, 2, figsize=(15,5))
customer_spend_ds["Product"].value_counts().plot.bar(color="purple", ax=ax[0])
customer_spend_ds["Product"].value_counts().plot.pie(autopct='%1.1f%%', shadow=True,ax=ax[1])
fig.suptitle("Product Frequency", fontsize=25)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, 2, figsize=(15,5))
customer_spend_ds["Segment"].value_counts().plot.bar(color="purple", ax=ax[0])
customer_spend_ds["Segment"].value_counts().plot.pie(autopct='%1.1f%%', shadow=True,ax=ax[1])
fig.suptitle("Segment Frequency", fontsize=25)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, 2, figsize=(15,5))
customer_spend_ds["Spend_Type"].value_counts().plot.bar(color="purple", ax=ax[0])
customer_spend_ds["Spend_Type"].value_counts().plot.pie(autopct='%1.1f%%', shadow=True,ax=ax[1])
fig.suptitle("Spend Type Frequency", fontsize=25)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

# 6.11 Checking Numerical Variables Histogram, Boxplot and Violinplot

In [None]:
#Plotting Numerical Variables

fig, ax = plt.subplots(1, 3, figsize=(15,5))
fig.suptitle("Age Distribution", fontsize=25)
sns.histplot(customer_spend_ds["Age"], ax=ax[0])
sns.boxplot(customer_spend_ds["Age"], ax=ax[1])
sns.violinplot(customer_spend_ds["Age"], ax=ax[2])
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, 3, figsize=(15,5))
fig.suptitle("Limit Distribution", fontsize=25)
sns.histplot(customer_spend_ds["Limit"], ax=ax[0])
sns.boxplot(customer_spend_ds["Limit"], ax=ax[1])
sns.violinplot(customer_spend_ds["Limit"], ax=ax[2])
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, 3, figsize=(15,5))
fig.suptitle("Spend Amount Distribution", fontsize=25)
sns.histplot(customer_spend_ds["Spend_Amount"], ax=ax[0])
sns.boxplot(customer_spend_ds["Spend_Amount"], ax=ax[1])
sns.violinplot(customer_spend_ds["Spend_Amount"], ax=ax[2])
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, 3, figsize=(15,5))
fig.suptitle("Payment Amount Distribution", fontsize=25)
sns.histplot(customer_payment_ds["Payment_Amount"], ax=ax[0])
sns.boxplot(customer_payment_ds["Payment_Amount"], ax=ax[1])
sns.violinplot(customer_payment_ds["Payment_Amount"], ax=ax[2])
plt.xticks(rotation=90)
plt.yticks(rotation=45)

In [None]:
#Alternatively using Profile Report to see variables statistics and correlations

# from pandas_profiling import ProfileReport
# profile = ProfileReport(customer_spend_ds, title="Credit Card Exploratory Data Analysis")
# profile.to_file(output_file="Credit Card Exploratory Data Analysis.html")

# 7. Correlations Analysis

In [None]:
#Deleting categorical columns

customer_spend_ds2 = customer_spend_ds.drop(["Customer", "City", "Product", "Segment", "Spend_Type", "Spend_Date", "Spend_Type", "Spend_Month", "Spend Amount to Limit Ratio"], axis=1)
customer_payment_ds2 = customer_payment_ds.drop(["No", "Customer", "City", "Product", "Company", "Segment", "Payment_Date", "Payment_Month"], axis=1)


#Plotting a Heatmap

sns.set(font_scale=1)

fig, ax = plt.subplots(1, figsize=(20,20))
sns.heatmap(customer_spend_ds2.corr(), annot=True, fmt=",.2f")
plt.title("Heatmap Correlation - Customer Spend", fontsize=20)
plt.tick_params(labelsize=12)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, figsize=(20,20))
sns.heatmap(customer_payment_ds2.corr(), annot=True, fmt=",.2f")
plt.title("Heatmap Correlation - Customer Payment", fontsize=20)
plt.tick_params(labelsize=12)
plt.xticks(rotation=90)
plt.yticks(rotation=45)


#Plotting a Pairplot

sns.pairplot(customer_spend_ds2)
sns.pairplot(customer_payment_ds2)

# 8. Conclusions

IF YOU LIKE IT OR IF IT HELPS YOU SOMEHOW, COULD YOU PLEASE UPVOTE? THANK YOU VERY MUCH!!!

The dataset brings information about 100 customers.


Silver clients: they represent 21% of our customers, are in average 49 Years Old, have an average of 167k Limit and their Spend Amount grew 7% from 2004 to 2006, from 27.389M to 29.324M. The average spends per client in 2006 is 0.978M. The Payment Amount in 2006 is 27.923M, meaning it represents 95% of Spend Amount, what could be a yellow light in terms of default, so we should review their Limits. Their main jobs are in Government (26%) and in Multinational Corporations (24%), meaning there´s some job and salary stability to them. Their main spends are in Transportation Tickets (27%), Petro (14%) and Camera (13%), indicating it´s probably a niche with high interest in leisure trips. The strategy advice for this niche is keeping it as a stable base of clients, but with a potential to grow, offering them trip related products.


Gold clients: they represent 41% of our customers, are in average 47 Years Old, have an average of 500k Limit and their Spend Amount grew 33% from 2004 to 2006, from 39.424M to 52.287M. The average spends per client in 2006 is 1.376M. The Payment Amount in 2006 is 58.515M, meaning it represents 112% of Spend Amount, what probably means they´re good payers, so we should consider increasing their Limits. Their main jobs are as Self Employed (24%) and Normal Job (24%), making it the group that can oscillate most in terms of job and salary stability, what´s also a point to consider when increasing their Limits. Their main spends are in Transportation Tickets (28%), Petro (14%) and Camera (12%). It´s also probably a niche with high interest in leisure trips, so we could increase advertisements related to it. This group is the one that will probably most grow in spends in times of economic growth and will most decrease in times of economic depression, but it´s currently growing fast, so the company should take the opportunity to bring this type of customers from the market and gain market share.


Platinum clients: they represent 38% of our customers, are in average 43 Years Old, have an average of 140k Limit and their Spend Amount grew 33% from 2004 to 2006, from 37.679M to 50.044M. The average spends per client in 2006 is 1.564M. The Payment Amount in 2006 is 54.410M, meaning it represents 109% of Spend Amount, what probably means they´re good payers as well, so we should consider increasing their Limits, but not so much as for the Gold clients. Their main jobs are Normal Salary (41%), followed by Government and Salaried Pvt, both with 17% each, indicating it´s a group that can oscillate a bit in terms of job stability and salary. Their main spend are Transportation Tickets (21%), Petro (13%) and Food (12%). This niche represents the one with the highest spends per client, plus they have some salary stability and are good payers, so they really need to be treated with special attention as premium clients. The company strategy here needs to be in retaining them, investing in relationship, offering products they want and maybe increasing their Limits.