## Exploratory Data Analysis – Credit Card Customer Churn

**Author:** Shahana  
**Duration:** 5 Days  
**Tools:** Python, pandas, numpy, matplotlib, seaborn  

## Overview

A bank is experiencing an increase in customers discontinuing their credit card services. Customer attrition results in revenue loss and higher customer acquisition costs.

The objective of this analysis is to understand customer behavior and identify key factors associated with customer churn, enabling the bank to proactively identify and engage at-risk customers and take preventive actions.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


## Dataset Description

In [3]:
df=pd.read_csv(r"C:\Users\shaha\Downloads\BankChurners.csv\BankChurners.csv")
df

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,...,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
0,768805383,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,...,12691.0,777,11914.0,1.335,1144,42,1.625,0.061,0.000093,0.999910
1,818770008,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,...,8256.0,864,7392.0,1.541,1291,33,3.714,0.105,0.000057,0.999940
2,713982108,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,...,3418.0,0,3418.0,2.594,1887,20,2.333,0.000,0.000021,0.999980
3,769911858,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,...,3313.0,2517,796.0,1.405,1171,20,2.333,0.760,0.000134,0.999870
4,709106358,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,...,4716.0,0,4716.0,2.175,816,28,2.500,0.000,0.000022,0.999980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10122,772366833,Existing Customer,50,M,2,Graduate,Single,$40K - $60K,Blue,40,...,4003.0,1851,2152.0,0.703,15476,117,0.857,0.462,0.000191,0.999810
10123,710638233,Attrited Customer,41,M,2,Unknown,Divorced,$40K - $60K,Blue,25,...,4277.0,2186,2091.0,0.804,8764,69,0.683,0.511,0.995270,0.004729
10124,716506083,Attrited Customer,44,F,1,High School,Married,Less than $40K,Blue,36,...,5409.0,0,5409.0,0.819,10291,60,0.818,0.000,0.997880,0.002118
10125,717406983,Attrited Customer,30,M,2,Graduate,Unknown,$40K - $60K,Blue,36,...,5281.0,0,5281.0,0.535,8395,62,0.722,0.000,0.996710,0.003294


In [8]:
print("Rows & Columns",df.shape)
df.columns 
df.info()

Rows & Columns (10127, 23)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 23 columns):
 #   Column                                                                                                                              Non-Null Count  Dtype  
---  ------                                                                                                                              --------------  -----  
 0   CLIENTNUM                                                                                                                           10127 non-null  int64  
 1   Attrition_Flag                                                                                                                      10127 non-null  object 
 2   Customer_Age                                                                                                                        10127 non-null  int64  
 3   Gender                                                

In [4]:
df.describe(include="object")

Unnamed: 0,Attrition_Flag,Gender,Education_Level,Marital_Status,Income_Category,Card_Category
count,10127,10127,10127,10127,10127,10127
unique,2,2,7,4,6,4
top,Existing Customer,F,Graduate,Married,Less than $40K,Blue
freq,8500,5358,3128,4687,3561,9436


In [15]:
df.head()
df.tail()

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,...,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
10122,772366833,Existing Customer,50,M,2,Graduate,Single,$40K - $60K,Blue,40,...,4003.0,1851,2152.0,0.703,15476,117,0.857,0.462,0.000191,0.99981
10123,710638233,Attrited Customer,41,M,2,Unknown,Divorced,$40K - $60K,Blue,25,...,4277.0,2186,2091.0,0.804,8764,69,0.683,0.511,0.99527,0.004729
10124,716506083,Attrited Customer,44,F,1,High School,Married,Less than $40K,Blue,36,...,5409.0,0,5409.0,0.819,10291,60,0.818,0.0,0.99788,0.002118
10125,717406983,Attrited Customer,30,M,2,Graduate,Unknown,$40K - $60K,Blue,36,...,5281.0,0,5281.0,0.535,8395,62,0.722,0.0,0.99671,0.003294
10126,714337233,Attrited Customer,43,F,2,Graduate,Married,Less than $40K,Silver,25,...,10388.0,1961,8427.0,0.703,10294,61,0.649,0.189,0.99662,0.003377


In [10]:
df.isnull().sum()

CLIENTNUM                                                                                                                             0
Attrition_Flag                                                                                                                        0
Customer_Age                                                                                                                          0
Gender                                                                                                                                0
Dependent_count                                                                                                                       0
Education_Level                                                                                                                       0
Marital_Status                                                                                                                        0
Income_Category                                 

In [12]:
df.duplicated().sum()

np.int64(0)

In [14]:
df.nunique()

CLIENTNUM                                                                                                                             10127
Attrition_Flag                                                                                                                            2
Customer_Age                                                                                                                             45
Gender                                                                                                                                    2
Dependent_count                                                                                                                           6
Education_Level                                                                                                                           7
Marital_Status                                                                                                                            4
Income_Category     

In [18]:
numerical_col=df.select_dtypes(include=['int64','float']).columns
categorical_col=df.select_dtypes(include=["object"]).columns
date_col=df.select_dtypes(include=['datetime']).columns
print("numerical_col",numerical_col)
print("categorical_col",categorical_col)
print("date_col",date_col)

numerical_col Index(['CLIENTNUM', 'Customer_Age', 'Dependent_count', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'],
      dtype='object')
categorical_col Index(['Attrition_Flag', 'Gender', 'Education_Level', 'Marital_Status',
       'Income_Category', 'Card_Category'],
      dtype='object')
date_col Index([], dtype='object')


## Data Quality Issue Log

| Column Name | Data Quality Issue | Description | Business / Analytical Impact | Recommended Action |
|------------|-------------------|-------------|------------------------------|--------------------|
| CLIENTNUM | Non-Analytical Identifier | Unique customer identifier with no predictive value | Adds no analytical insight and may introduce noise | Exclude from analysis and modeling |
| Attrition_Flag | Categorical Target Variable | Target variable stored as text labels | Cannot be directly used in modeling | Encode into binary values during preprocessing |
| Education_Level | Unknown Category | Contains "Unknown" values | Reduces clarity of customer segmentation | Retain as separate category or group under "Other" |
| Marital_Status | Unknown Category | Presence of "Unknown" marital status | Impacts demographic analysis | Treat as distinct category |
| Income_Category | Unknown / Missing Information | Income category includes "Unknown" values | Limits income-based behavioral insights | Keep as category; avoid imputation assumptions |
| Credit_Limit | High Variability | Wide range of values across customers | Can skew summary statistics and models | Apply scaling or outlier treatment if required |
| Total_Revolving_Bal | Potential Outliers | Some customers exhibit unusually high balances | May distort mean-based analysis | Inspect distribution; cap if necessary |
| Avg_Utilization_Ratio | Skewed Distribution | Values concentrated near extremes (0 and 1) | Affects model stability | Consider transformation during modeling |
| Naive_Bayes_Classifier_* | Data Leakage Risk | Model-generated probability features present | Can lead to biased and unrealistic results | Remove from analysis before modeling |
| Date/Time Columns | Missing Feature | Dataset lacks time-based variables | Prevents time-series or trend analysis | Proceed with cross-sectional EDA |


### Data Quality Summary
- The dataset is complete with no missing values, indicating strong data integrity.
- Several categorical variables contain "Unknown" categories that require careful business interpretation.
- Certain columns act as identifiers or pre-built model outputs and should be excluded to avoid analytical bias.
- Numerical features show high variability, requiring scaling or transformation during later stages.
