# Credit Card Customers Dataset

## Introduction

The dataset in this project is of credit card users in some bank. The owner of the bank wants to know the behavior and demography of his customers and see how trustworthy they are using the average utilization ratio. in order to direct the loan advertisements and marketing to attract them.




## Utilization Ratio
Credit scoring models often consider your credit utilization rate when calculating a credit score for you. They can impact up to 30% of a credit score (which makes them among the more influential factors), depending on the scoring model being used.

A low credit utilization rate shows you're using less of your available credit. Credit scoring models generally interpret this as an indication you're doing a good job managing credit by not overspending, and keeping your spending in check can help you reach higher credit scores. Having higher credit scores can make it easier to secure additional credit, such as auto loans, mortgage.

## Questions

1- What are the segments in the data and how are the customers demographics?

    a-What card categories have the most customers?

    b-The age, gender and income categories distributions?
    
    c- relation between income and education level?
    

2-What income category has the most utilization ratio?

3-How Family Size (dependents) affects the utilization ratio?

### Data Wrangling and Exploration

In [None]:
#Load necessary modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [None]:
#Read The Data
pd.set_option('display.max_columns', None)
df = pd.read_csv('/kaggle/input/credit-card-customers/BankChurners.csv')
df.head()

## Data Wrangling 

In [None]:
#Look  data parameters

print('Number of rows in the data = {}'.format(df.shape[0]))
print('Number of Columns in the data = {}'.format(df.shape[1]))




In [None]:
# Value Types
df.info()

In [None]:
#check for null values
df.isna().sum()

In [None]:
#Clone the data to work on it
df_clean = df.copy()

In [None]:
#Drop unneccessary columns
df_clean.drop(columns=['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'
                      , 'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1'], inplace = True)


In [None]:
df_clean.columns

In [None]:
#Divide values into Categorical and Numerical
df_categorical = df_clean.loc[:,df_clean.dtypes == np.object]
print('Categorical Columns are : {}'.format(df_categorical.columns))

df_numerical = df_clean.loc[:,df_clean.dtypes != np.object]
print('Numerical Columns are : {}'.format(df_numerical.columns))

In [None]:
#make the client numer as index to avoid correlating with it
df_clean.set_index('CLIENTNUM', inplace=True)
df_clean.head(2)

## Univariate Exploration

### Demographics of  Bankl Customers

#### The ratio between Churning and Non Churning Customers

In [None]:
plt.figure(figsize=[10,10])
sorted_counts= df_clean.Attrition_Flag.value_counts()
plt.pie(sorted_counts, explode=(0.1,0),labels=['Non-Churners', 'Churners'],  
       colors=['#009ACD', '#ADD8E6'], autopct='%1.0f%%', 
       shadow=False, startangle=0,   
       pctdistance=1.2,labeldistance=1.4)
plt.axis('equal')
plt.title("Number of credit card Churners vs Non- Churners");


Comment:
16% of the credit card holders are churning

#### Demographic: Age Distribution

In [None]:
#Now let's look at the demographics of our dataset
#Age Distribution
plt.rcParams['font.size']=14
plt.figure(figsize=[10,6])
max_age, min_age = df_clean.Customer_Age.max(),  df_clean.Customer_Age.min()
sb.distplot(df_clean['Customer_Age'], bins= np.arange(min_age,max_age), color='#009ACD', kde=False )
plt.xticks(ticks=np.arange(20,80,10))
plt.title('Age Distribution Among Bank Customers')
plt.ylabel('Number Of Customers')
plt.xlabel('Age in Years');





 The Age distribution among bankl customers is a normal dilstribution with mean age between 40 and 50 year olds.

#### Demographic : Gender Distribution among The Bank customers

In [None]:
#Males vs Females

sb.countplot(df_clean['Gender'], palette=['#009ACD', '#f21ff2']);
sb.despine()
plt.title('Number of Males vs Females')
plt.xticks(ticks=[0,1], labels= ['Males', 'Females'])
plt.xlabel('Gender')
plt.ylabel('Number of Customers');

 Comment:
 
 Bank has more female customers for credit cards

In [None]:
#Let's take a look at the education lvl and Income
df_income= df_clean[df_clean.Income_Category != 'Unknown']
order_dict={'Less than $40K':1, '$40K - $60K':2, '$60K - $80K':3, '$80K - $120K':4,
       '$120K +':5}

from sklearn.preprocessing import OrdinalEncoder
encoder=OrdinalEncoder()
sorted_counts = df_clean['Education_Level'].value_counts(sort=True)
df_income['Income_Level'] = df_income['Income_Category'].replace(order_dict).astype('float64')

In [None]:
#Plot
sb.catplot(data=df_income, y='Education_Level',   x='Income_Level', kind='boxen');
sb.despine()
plt.title('Education Level Among Customers')
plt.xticks(rotation = 90)
plt.ylabel('Education Level')
plt.ylabel('Income Level')
plt.xticks(ticks=[1,2,3,4,5], labels=['Less than $40K', '$40K - $60K', '$60K - $80K',  '$80K - $120K', 
       '$120K +']);

Comment: In our sample We can't distinguish a reation ship between a customer's education and their earnings, so when we advertise we can bypass this segmentation

In [None]:
#now for the final demographic we look at the income level
plt.rcParams['font.size'] = 16
plt.figure(figsize=[10,10])
sorted_counts = df_clean['Income_Category'].value_counts(sort=True)
plt.pie(sorted_counts, colors= sb.color_palette('muted'), labels=['Less than $40K', '$40K - $60K', '$60K - $80K', '$80K - $120K', 
       '$120K +', 'Unknown'],autopct='%1.0f%%' , shadow=False, startangle=0,   
       pctdistance=0.5,labeldistance=1.2);
sb.despine()
plt.title('Income Category of Customers');

The most represented income level category among the bank customers is those earning less than   USD 40K a year

### Bivariate Exploration

In this section we consider the relations of some of the factors to study to relate them.


In [None]:
#First let's take a look at how our numerical variables relate to each other let's see which of them has the highest correlation with Attrition
#df_numerical['Attrition'] = pd.Series(np.searchsorted(['Existing Customer', 'Attrited Customer'], df_clean.Attrition_Flag.values), df_clean.index)
#df_numerical.set_index('CLIENTNUM', inplace=True)


corr = df_numerical.corr()
plt.figure(figsize=[15,15])
plt.rcParams['font.size']=10
sb.heatmap(corr, cmap='YlGnBu', annot= True, linewidths=0.5 );



Comment: There is a strong relationship between the avg utilization ratio, Credit imit and avg open to buy

#### Now we dig deeper into the relation ship between utillization, attrition and other factors 

In [None]:
'''some of the numerical values are integer values and most of them don't fall into the financial variables
    so we will separate them'''
df_num_float = df_clean.select_dtypes(include = ['float64']) #float variables
df_num_float['a_flag'],df_num_float['revolving_balance'], df_num_float['total_trans_amt'], df_num_float['total_trans_count'] = df_clean['Attrition_Flag'], df_clean['Total_Revolving_Bal'], df_clean['Total_Trans_Amt'], df_clean['Total_Trans_Ct']
sb.pairplot(df_num_float, hue='a_flag', palette='magma');


Comment: 

There is a strong relationship between credit limit and the average opens to buy

Existing customers have higher values in total_ct_chng and Tota_amnt_change which means they factor in attrition

total_ct_chng and Tota_amnt_change are somewhat related

The bigger the credit limit, the lower the average utilization ratio

In [None]:
#Let's investigate the relationships between demographics and financial factors
#let's investigate Women vs Men in terms utilization 
order= ['Less than $40K', '$40K - $60K', '$60K - $80K',  '$80K - $120K', 
       '$120K +', 'Unknown'];
plt.figure(figsize=[30,6]);
plt.rcParams['font.size']= 16;
plot = sb.catplot(data=df_clean, x='Gender', y='Avg_Utilization_Ratio', col='Income_Category', kind='violin', col_wrap=3,
           col_order= order);
plot.set_ylabels('Average Utilization Ratio');
plt.xticks(ticks=[0,1], labels= ['Males', 'Females']);
plt.ylabel('Average Utilization Ratio');





Comments:

Both Females and Males have nearly the same utilization ratio distributions except in the income category lower than 40K where Females have much higher utilization ratio

People with higher income has lower utilization ratios

Another interesting finding is how there are no female clients having income of above 60K 

In [None]:
#Check Card Categories
df_clean.Card_Category.value_counts()

In [None]:
#Now we look to see the relation between card type and utilization 

sb.catplot(data=df_clean, x='Card_Category', y='Avg_Utilization_Ratio' , kind='point')
plt.xlabel('Card Categories')
plt.ylabel('Utilization Ratio')
plt.title('Utilization Ratios for Different Card Categories');




As expected, Highest utilization ratio is found in the lowest card category


In [None]:
#Dependents Count vs Avg Utilization Ratio
sb.relplot(data=df_clean, x='Dependent_count', y='Avg_Utilization_Ratio', kind='line' )
plt.xlabel('Dependent Count')
plt.ylabel('Utilization Ratio')
plt.title('Reation between the number of Dependencies and Utilization Ratio');

It is obvious that with increased dependencies, people use credict cards more wisely

## Conclusion:
The bank has more female customers than males for credit cards

Blue card category has the most customers however the higher card categories customers make for better candidates for loans

Education Level can't be used as indicator for Income level or utilization so it is passable as atargetting factor

Women customers of our bank has much lower income level than men on average

Finally, If we want to market a new set of loans using utilization ratio as our indication for trustworthy customers we should target high earners of all ages and genders. 

The Bank needs to make more incentives for high earning women to attract them.



In [None]:
!jupyter nbconvert "Visualization Project.ipynb" --to slides --template output-toggle.tpl --post serve