**Task**: An in-depth Exploratory Data Analysis and visualization of difference between churning and non-churning customers, especially all features, other than income.   

*Churning* rate refers to the rate at which customers stop doing business with an entity (Investopedia.com).  

**Findings:**   

**Churned:**   

Attrited Customers account for 1627 of the 10127 entries, or about **6%** of clients' database.   
Most entries in the *churned* dataset fall under the 'Less than $40K' income category.   
Most of these clients identify as **female, about 47 years old, with a graduate degree, married, with 3 dependents, and with a Blue-Level card**.   

The **most influential factors** behind churned clients' decision to end relationship with this business is **Total Transaction Amount**, followed by **revolving balance** and **credit limit**.  
**Active clients have higher credit limits, make larger transactions, and revolving balance than churned clients**.   

**Active:**  

These clients represent 8500 of the 10127 entries (the majority).   

The majority of clients in the dataset are active clients.
Like the other group, customers here are mostly female, about 46 years old, with graduate degrees, married, with an income less than $40K, and who have Blue-level cards.
However, they have one less dependent (2) compared to churned clients (3).   


In [None]:
# Using Python 3 environment with analytics libraries installed
# as defined by the kaggle/python Docker 

import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df1= pd.read_csv("../input/credit-card-customers/BankChurners.csv")

#clean out last 2 columns
df1=df1.iloc[:,:-2]
df1

In [None]:
#are there NaNs?
df1.isnull().values.any()

In [None]:
#description of all columns
pd.set_option('display.max_columns', 21)
df1.describe(include='all')

In [None]:
df2=df1.drop(['CLIENTNUM', 'Income_Category'], axis=1)
corrALL=df2.corr()

In [None]:
#graph correlation
sns.heatmap(corrALL, cmap="cubehelix", annot=True)

In [None]:
sns.lmplot(x='Customer_Age', y='Months_on_book', hue= 'Attrition_Flag', data=df2, logistic=False, palette="Set2",  markers=["o", "x"])

In [None]:
sns.lmplot(x='Total_Trans_Ct', y='Total_Trans_Amt', hue= 'Attrition_Flag', data=df2, logistic=False,  markers=["o", "x"])

In [None]:
sns.lmplot(x='Total_Relationship_Count', y='Total_Trans_Amt', hue= 'Attrition_Flag', data=df2, palette="Set1", logistic=False,  markers=["o", "x"])

In [None]:
sns.lmplot(x='Credit_Limit', y='Avg_Utilization_Ratio', hue= 'Attrition_Flag', palette="Set3", data=df2, logistic=False,  markers=["o", "x"])

In [None]:
sns.lmplot(x='Total_Revolving_Bal', y='Avg_Utilization_Ratio', hue= 'Attrition_Flag', data=df2, logistic=False,  markers=["o", "x"])

In [None]:
#spliting the dataframe into churned and active customers
active=df1[(df1.Attrition_Flag == 'Existing Customer')]
churned=df1[(df1.Attrition_Flag == 'Attrited Customer')]           

## Attrited Customers 'Churned'

In [None]:
#dataframe of churned clients
churned.head()

In [None]:
#drop columns: CLIENTNUM and Attrition_Flag
churned=churned.drop(['CLIENTNUM', 'Attrition_Flag'], axis=1)
#view
churned['Income_Category'].value_counts()

In [None]:
#drop the Income column
churned2=churned.drop(['Income_Category'], axis=1)
#reset indexes
churned2=churned2.reset_index(drop=True)
churned2.describe(include='all')

In [None]:
#possible correlations among columns
corr1=churned2.corr()
#graph correlation
sns.heatmap(corr1, cmap="magma", annot=True)

In [None]:
sns.lmplot(x='Total_Trans_Ct', y='Total_Trans_Amt', data=corr1, logistic=False, markers=["x"])

In [None]:
sns.lmplot(x='Credit_Limit', y='Avg_Open_To_Buy', data=corr1, logistic=False,  markers=["o"])

In [None]:
sns.lmplot(x='Avg_Utilization_Ratio', y='Avg_Open_To_Buy', data=corr1, logistic=False,  markers=["+"])

In [None]:
# Density Plot and Histogram of age of customers
sns.distplot(churned2['Customer_Age'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkred', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 4})

In [None]:
# Density Plot and Histogram of credit limit 
sns.distplot(churned2['Credit_Limit'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkorange', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 3})

In [None]:
colors='gray','darkgreen', 'purple', 'sienna', 'mediumblue', 'goldenrod', 'mediumorchid'
churned2.Education_Level.value_counts().plot(kind='barh',title='Education level',figsize=(6,5),edgecolor=(0,0,0), color=colors)

In [None]:
#Gender representation
mylabels='Female','Male'
mycolors='maroon', 'dodgerblue'
plt.pie(churned2.Gender.value_counts(),
        labels=mylabels,autopct='%1.1f%%',
       colors=mycolors)

In [None]:
#Dependents clients have
sns.countplot(churned2['Dependent_count'], palette = "Set2", edgecolor=(0,0,0))
plt.xticks(rotation=45)

In [None]:
#civil status 
mylabels='Married','Single', 'Unknown', 'Divorced'
mycolors='gray', 'pink', 'darkcyan', 'goldenrod'
plt.pie(churned2.Marital_Status.value_counts(),
        labels=mylabels,autopct='%1.1f%%',
       colors=mycolors)

In [None]:
cdetails=churned2[['Months_on_book',
'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1']]

In [None]:
cdetails.plot(kind='density', subplots=True, layout=(14,1), sharex=False, figsize=(10,8))
plt.show()

In [None]:
sns.regplot(x='Credit_Limit', y='Total_Revolving_Bal', data=churned2, logistic=False, color='rosybrown')

In [None]:
sns.countplot(churned2['Card_Category'])

In [None]:
status_count = churned2['Marital_Status'].value_counts()
sns.set(style="darkgrid")
sns.barplot(status_count.index, status_count.values, alpha=0.9)
plt.title('Frequency of civil status')
plt.ylabel('Occurrence', fontsize=12)
plt.xlabel('status', fontsize=12)
plt.show()

#### Active
The majority of clients in the dataset are active clients.   
Like the other group, customers here are mostly female, about 46 years old, with graduate degrees, married, with an income less than $40K, and who have	Blue-level cards.   
However, they have **one less dependent** (2) compared to churned clients (3).   

The average active clients has 36 months on book, 3-4 total relationship count, spend about 2 Months Inactive during a 12 mo period, 2 Contacts Count during a 12 mo period, their **Credit Limit averages at 8727**, and **1256 in Total_Revolving_Balance**. Their Avg_Open_To_Buy is about 7470, with a	Total_Amt_Chng_Q4_Q1 of 0.7, Total_Trans_Amt averaging 4654.65, about 69 Total transaction countt 	Total_Ct_Chng_Q4_Q1 of also 0.7, and an **Average Utilization Ratio of about 0.3**.

In [None]:
active.describe(include='all')

In [None]:
#drop columns: CLIENTNUM and Attrition_Flag
active2=active.drop(['CLIENTNUM', 'Attrition_Flag', 'Income_Category'], axis=1)

In [None]:
corr2=active2.corr()

In [None]:
sns.heatmap(corr2, annot=True, cmap="YlGnBu")

In [None]:
sns.lmplot(x='Avg_Utilization_Ratio', y='Avg_Open_To_Buy', data=corr2, logistic=False,  markers=["s"])

In [None]:
sns.lmplot(x='Customer_Age', y='Months_on_book', data=corr2, logistic=False, markers=["o"])

In [None]:
sns.lmplot(x='Credit_Limit', y='Avg_Utilization_Ratio', data=corr2, logistic=False, markers=["+"])

In [None]:
#compare results between churned and active clients
fig, (ax, ax2) = plt.subplots(ncols=2, figsize=(10,8))

churned2.plot(x='Credit_Limit', y=['Avg_Utilization_Ratio'], ax=ax, color='darkorange')
active2.plot(x='Credit_Limit', y=['Avg_Utilization_Ratio'], ax=ax2, ls="--", color='mediumvioletred')

plt.show()

In [None]:
sns.regplot(x='Credit_Limit', y='Total_Revolving_Bal', data=active2, logistic=False, color='olivedrab')

In [None]:
x=active2['Customer_Age']
y=active2['Total_Revolving_Bal']

sns.kdeplot(x, label="Age")
sns.kdeplot(y, label="Total Revolving Balance")
plt.legend()

In [None]:
#Kernel density estimation of bivariate distribution
sns.jointplot(x, y, data=active2, kind="kde", color='salmon')

In [None]:
mycolors='salmon', 'cadetblue'
grp = active2.groupby(['Customer_Age', 'Gender']).size().unstack()
grp.plot(kind = 'bar', color=mycolors, title='Age by gender')

In [None]:
mycolors='blueviolet', 'forestgreen'
grp = active2.groupby(['Dependent_count', 'Gender']).size().unstack()
grp.plot(kind = 'bar', color=mycolors, title='Dependents by gender')

In [None]:
mycolors='darkorange', 'gray'
grp = active2.groupby(['Card_Category', 'Gender']).size().unstack()
grp.plot(kind = 'bar', color=mycolors, title='Card type by gender')

In [None]:
mycolors='sienna', 'mediumseagreen'
grp = active2.groupby(['Education_Level', 'Gender']).size().unstack()
grp.plot(kind = 'bar', color=mycolors, title='Education by gender')

In [None]:
mycolors='gold', 'coral'
grp = active2.groupby(['Marital_Status', 'Gender']).size().unstack()
grp.plot(kind = 'bar', color=mycolors, title='Civil status by gender')

In [None]:
#plot 
ax = sns.countplot(y="Dependent_count", hue="Education_Level", data=active2, palette="Set3")

In [None]:
#side-by-side plots
g = sns.catplot(x="Card_Category", hue="Education_Level", col="Gender",
                data=active2, kind="count",
                height=4, aspect=2)

In [None]:
#plot columns from two dataframes
ax = plt.gca(title='Card types')
active2.Card_Category.value_counts().T.plot(kind='bar', stacked=True, color='green', ax=ax, width=0.05, position=0)
churned2.Card_Category.value_counts().T.plot(kind='bar', stacked=True, ax=ax, color='darkred', width=0.10, position=1)
plt.show()

In [None]:
x = active2['Credit_Limit']
y= active2['Total_Trans_Amt']
colors = 'navy'
 
# Plot
plt.scatter(x, y, c=colors, alpha=0.2)
plt.title('Scatter plot')
plt.xlabel('Credit Limit')
plt.ylabel('Total Transaction Amount')
plt.show()

In [None]:
#plot several columns as pie charts

fig, (ax1,ax2, ax3, ax4, ax5) = plt.subplots(1,5, figsize=(10,10))
# 1,5 denotes 1 row, 5 columns 

#colors per label
colors= 'palevioletred', 'tan', 'steelblue', 'darkseagreen', 'gold', 'sienna', 'darkolivegreen'

labels = active2['Gender'].unique()
values = active2['Gender'].value_counts()
ax1.pie(values,labels = labels,colors = colors,autopct = '%1.1f%%') 

labels = active2['Dependent_count'].unique()
values = active2['Dependent_count'].value_counts()
ax2.pie(values,labels = labels,colors = colors,autopct = '%1.1f%%') 

labels = active2['Education_Level'].unique()
values = active2['Education_Level'].value_counts()
ax3.pie(values,labels = labels,colors = colors,autopct = '%1.1f%%')

labels = active2['Marital_Status'].unique()
values = active2['Marital_Status'].value_counts()
ax4.pie(values,labels = labels,colors = colors,autopct = '%1.1f%%') 

labels = active2['Card_Category'].unique()
values = active2['Card_Category'].value_counts()
ax5.pie(values,labels = labels,colors = colors,autopct = '%1.1f%%') 

ax1.set(aspect="equal", title='gender')
ax2.set(aspect="equal", title='dependent count')
ax3.set(aspect="equal", title='education level')
ax4.set(aspect="equal", title='marital status')
ax5.set(aspect="equal", title='card type')
plt.show()



In [None]:
#linear regression between columns

from sklearn.linear_model import LinearRegression
x=pd.DataFrame(active2['Credit_Limit'])
y=pd.DataFrame(active2['Total_Revolving_Bal'])

model = LinearRegression().fit(x, y)

plt.scatter(x, y,  color='gray')
plt.plot(x, model.predict(x), color='blue', linewidth=2)
plt.xticks(())
plt.yticks(())
plt.show()



In [None]:
#age comparison

ch = churned2['Customer_Age']
av = active2['Customer_Age']
dfAge = df1['Customer_Age']
dfgn= df1['Gender'].value_counts()

# Plotting all the subplots
fig, axes = plt.subplots(2, 3)
axes[0, 0].hist(ch, color='red')
axes[0, 1].hist(av, color='green')
axes[0, 2].hist(dfAge, color='gray')


axes[1, 0].plot(dfAge, dfAge-ch, color='goldenrod')
axes[1, 1].plot(dfAge, dfAge-av, color='sienna')
axes[1, 2].hist(dfgn, color='lightseagreen')
plt.tight_layout()

plt.show()

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(18, 10))

fig.suptitle('Compare by Gender')
sns.countplot(ax=axes[0, 0], y="Customer_Age", hue="Gender", data=df1, palette="Set1")
sns.countplot(ax=axes[0, 1], y="Customer_Age", hue="Gender", data=active2, palette="Set3")
sns.countplot(ax=axes[1, 0], y="Customer_Age", hue="Gender", data=churned2, palette="Set2")


### Factors differing between active and churned

In [None]:
# Factor with the MOST difference
#credit limits plot
f, axes = plt.subplots(2, 2, figsize=(7, 7), sharex=True)
sns.distplot( df1["Credit_Limit"] , color="steelblue", ax=axes[0, 0])
sns.distplot( churned2["Credit_Limit"] , color="palevioletred", ax=axes[0, 1])
sns.distplot( active2["Credit_Limit"] , color="olive", ax=axes[1, 0])

In [None]:
#count of transactions
TimesUsed=pd.concat([df1['Total_Trans_Ct'], churned2['Total_Trans_Ct'], active2['Total_Trans_Ct']], axis=1, keys=['overall', 'churned','active'])

TimesUsed.head()

In [None]:
TimesUsed.plot(kind='density', subplots=True, layout=(14,1), sharex=True, figsize=(10,10))
plt.show()

In [None]:
#total transaction amount
a=active2['Total_Trans_Amt']
c=churned2['Total_Trans_Amt']
t=df1['Total_Trans_Amt']

sns.kdeplot (a, label="Active Total Transactions")
sns.kdeplot(c, label="Churned Total Transactions")
sns.kdeplot(t, label="Overall Total Transactions")
plt.legend()

In [None]:
#dependent factor
fig, (ax1,ax2, ax3) = plt.subplots(1,3, figsize=(11,11))
# 1,3 denotes 1 row of 3 columns ploted

#colors per label
colors= 'lightcoral', 'wheat', 'dodgerblue', 'mediumseagreen', 'gold', 'peru'

labels = df1['Dependent_count'].unique()
values = df1['Dependent_count'].value_counts()
ax1.pie(values,labels = labels,colors = colors,autopct = '%1.1f%%')

labels = '3','2','1', '4', '0', '5'
values = churned2['Dependent_count'].value_counts()
ax2.pie(values, labels = labels,colors = colors,autopct = '%1.1f%%') 

labels = active2['Dependent_count'].unique()
values = active2['Dependent_count'].value_counts()
ax3.pie(values,labels = labels,colors = colors,autopct = '%1.1f%%')


In [None]:
#contact factor

churnedcall = churned2['Contacts_Count_12_mon']
actvcall = active2['Contacts_Count_12_mon']
AllCall = df1['Contacts_Count_12_mon']

# Plotting all the subplots
fig, axes = plt.subplots(1, 3)
axes[0].hist(AllCall, color='dimgray')
axes[1].hist(actvcall, color='goldenrod')
axes[2].hist(churnedcall, color='darkorange')

plt.tight_layout()

plt.show()