# Loading the data

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as pyo
from plotly.subplots import make_subplots
from sklearn.ensemble import RandomForestClassifier
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import LabelEncoder


import xgboost as xgb

pyo.init_notebook_mode()

In [None]:
df = pd.read_csv('../input/credit-card-customers/BankChurners.csv')
df = df.iloc[:, :-2]
df.head()


# Basic Information

I begin by printing some basic information before doing any real analysis.

## Non nulls and types

In [None]:
df.loc[:2].info()

These columns can be split into:
- Customer Information: `CLIENTNUM`, `Customer_Age`, `Gender`, `Dependent_count`, `Education_Level`, `Marital Status`.
- Account Information: `Attrition_Flag`, `Card_Category`, `Month_on_book`, `Total_Relationship_Count`, `Credit_Limit`, `Avg_Open_To_Buy`.
- Activity Information: `Months_Inactive_12_mon`, `Contacts_Count_12_mon`, `Total_Revolving_Bal`, `Total_Amt_Chng_Q4_Q1`, `Total_Trans_Amt`, `Total_Trans_Ct`, `Total_Ct_Chng_Q4_Q1`, `Avg_Utilization_Ratio`.

I ignored the last two, as they are not meaningful to us.

## Unique Counts

In [None]:
df.nunique()

# Data Transformation

## Deleting unnecessary columns

The last two columns in the original dataset have to be deleted. I've already done to get a cleaner output from the last two commands.

## Using `CLIENTNUM` as index

By looking at the number of unique values, we can see that `CLIENTNUM` doesn't contain duplicates. Moreover,
being a client identifier, it's the perfect choice as index.

In [None]:
df.set_index('CLIENTNUM', inplace=True)

## Duplicated Records

Having detected that there are no duplicated clients -- `CLIENTNUM` (which is the only column that identifies clients)
has nothing but unique values -- there's no need for us to check for duplicates.

## Missing Data Analysis

As anticipated by the output of the `info` method, none of the columns contain missing data. You can double-check
this by running:

In [None]:
df.isna().sum()

However, this is misleading. Some columns contain cells with `Unknown`. We can count the records affected by this situation
with:

In [None]:
(df == 'Unknown').sum()

This tells us that we have three columns that need some work. They are: `Education_Level`, `Marital_Status` and `Income_Category`.

Rows with an `Unknown` field amount to 3046 (a third of the whole dataset), as we can see by executing the following:

In [None]:
len(df[(df == 'Unknown').any(axis=1)].index)

Given the number of records affected by this -- and the nature of the columns -- none of the more standard techniques for
dealing with missing data seem appropriate. However, we can use an `IterativeImputter` from
the `scikit-learn` library, to replace the `Unknown` values with estimates produced by a model
(a `RandomForestClassifier` in this case).

In [None]:
categorical = ['Education_Level', 'Marital_Status', 'Income_Category']

encoders = {}

for cat in categorical:
    encoder = LabelEncoder()
    encoders[cat] = encoder
    values = df[cat]
    known_values = values[values != 'Unknown']
    df[cat] = pd.Series( encoder.fit_transform(known_values), index=known_values.index)

imp_cat = IterativeImputer(estimator=RandomForestClassifier(),
                           initial_strategy='most_frequent',
                           max_iter=10, random_state=0)


df[categorical] = imp_cat.fit_transform(df[categorical])

for cat in categorical:
    df[cat] = encoders[cat].inverse_transform(df[cat].astype(int))

It's important to note that we need to keep the `LabelEncoder` instances, so we can call the `inverse_transform` on them
after imputing the `Unknown` values, converting the values from numeric back to string.

## Formatting Columns and Enforcing Types

I proceed to format columns with the appropriate types.

In [None]:
def make_categorical(data: pd.DataFrame, column: str, categories: list, ordered: bool = False):
    data[column] = pd.Categorical(df[column],
                                       categories=categories,
                                       ordered=ordered)

### Active Customer (`Attrition_Flag`)

I convert the values for `Attrition_Flag` into booleans.

In [None]:
df['Attrition_Flag'] = df['Attrition_Flag'] == 'Attrited Customer'

### Gender

It's not really necessary, but I'm making the `Gender` column categorical because why not.

In [None]:
make_categorical(df, 'Gender', ['F', 'M'])

### Education Level

I add order to the `Education_Level` column.

In [None]:
make_categorical(df, 'Education_Level', ['Uneducated', 'High School', 'Graduate', 'College', 'Post-Graduate', 'Doctorate'], True)

### Marital Status

This is another one of those columns that I don't really have to make Categorical, but I do it for the sake of expressiveness.

In [None]:
make_categorical(df, 'Marital_Status', ['Married', 'Single', 'Divorced'])

### Income Category

I add order to `Income_Category`.

In [None]:
make_categorical(df, 'Income_Category', ['Less than $40K', '$40K - $60K', '$60K - $80K', '$80K - $120K', '$120K +'], True)

### Card Category

I also add order to `Card_Category`.

In [None]:
make_categorical(df, 'Card_Category', ['Blue', 'Silver', 'Gold', 'Platinum'], True)

## Adding additional columns

Next, I aggregate the dataset with additional columns that will help extract some useful information.

### Age Range

Customers vary in age from 26 to 73. I create a new column with 20-year bins.

In [None]:
age_bins = [20, 40, 60, 80]
age_labels = ['20 - 40', '40 - 60', '60 - 80']
df['Age_Range'] = pd.cut(df['Customer_Age'], age_bins, labels=age_labels, ordered=True)

### Revolving Balance

I add a flag for detecting those customers that haven't paid their balance in full.

In [None]:
df['No_Revolving_Bal'] = df['Total_Revolving_Bal'] == 0

### New Customers

I set another flag for detecting those customers that opened their account 2 years ago or less.

In [None]:
df['New_Customer'] = df['Months_on_book'] <= 24

### Average Utilization Ratio

Supposedly, keeping your utilization ratio below 30% helps to improve your credit score. Therefore, I added a column
for flagging customers that keep their utilization ratio below that threshold.

In [None]:
df['Optimal_Utilization'] = df['Avg_Utilization_Ratio'] <= 0.3

### Types

Having formatted the columns, added all the additional information, and imputed the missing values, we can begin
analysing the data. I start by printing the data types to make sure everything looks good.

In [None]:
df.dtypes

# Descriptive Statistics

Now, let's take a look at some statistics for our data.

In [None]:
df.describe().T

In [None]:
df.describe(include=[bool, 'category']).T

In [None]:
LABELS = {
    'Age_Range': 'Age',
    'Card_Category': 'Card',
    'Dependent_Count': 'Dependents',
    'Income_Category': 'Income',
    'Months_Inactive_12_mon': 'Months Inactive',
    'Contacts_Count_12_mon': 'Contacts',
    'Total_Revolving_Bal': 'Revolving Bal'
}

def format_label(label: str) -> str:
    if label in LABELS:
        return LABELS[label]
    else:
        return ' '.join(w.capitalize() for w in label.split('_'))

def format_labels(labels: list) -> dict:
    return {l: format_label(l) for l in labels}

def group_and_count_by(data, column_names:list, reset_index:bool = True) -> pd.DataFrame:
    index_name = data.index.name
    _df = data.filter(column_names).reset_index().groupby(column_names).count().rename(columns={index_name: 'count'})
    if reset_index:
        _df = _df.reset_index()
    return _df.sort_values(by=['count'], ascending=False)

def plot_bars_with_color(data, x:str, color:str, barmode:str = 'group', width:int=-1, height:int=-1):
    categories = [x, color]
    categories_orders = { v: list(data[v].cat.categories) for v in categories if data[v].dtype.name == 'category'}
    labels = { v: format_label(v) for v in categories }
    fig = px.bar(
        data,
        x=x,
        y='count',
        color=color,
        barmode=barmode,
        category_orders=categories_orders,
        labels=format_labels(['count', x, color])
    )
    if(height > 0 and width > 0):
        fig.update_layout(width=width, height=height)
    fig.show()

def pie_plot(data, fig, row, col, top:int = -1):
    labels = data['labels']
    values = data['values']
    if top > 0:
        labels = data['labels'][:top]
        labels.loc[labels.index.max() + 1] = "Others"
        values = data['values'][:top]
        values.loc[values.index.max() + 1] = data['values'][top:].sum()
    fig.add_trace(
        go.Pie(labels=labels,
               values=values, automargin=False,
               name=data['name']),
        row, col
    )

def pie_plots(plots_data:list, height:int, width:int, top:int = -1):
    list_size = len(plots_data)
    rows = int(list_size / 2)
    if list_size % 2 == 1:
        rows = rows + 1
    fig = make_subplots(
        rows=rows,
        cols=2,
        specs= np.full((rows, 2), {"type": "domain"}).tolist(),
        vertical_spacing = 0.05,
        subplot_titles=[plot_data['name'] for plot_data in plots_data]
    )
    for index, plot_data in enumerate(plots_data):
        pie_plot(plot_data, fig, 1 + int(index / 2), 1 + (index % 2), top)
    
    fig.update_layout(
        width=width, 
        height=height
    )
    fig.show()

# Customers

I start by looking at the information available regarding customers. A deep understanding of this data can help the bank launch better ads campaigns and advertising.

## Univariate Analysis

We can begin by plotting a histogram for each customer-related column, to see how the values are distributed.

In [None]:
columns = ['Customer_Age', 'Gender', 'Dependent_count', 'Education_Level', 'Marital_Status']
fig = make_subplots(rows=int(len(columns)/2) + len(columns) % 2, cols= 2)

for index, column in enumerate(columns):
    fig.append_trace(
        go.Histogram(x=df[column], name= format_label(column)),
        1 + int(index / 2), 1 + (index % 2))
    
fig.update_layout(height=900, width=900)
fig.show()

Obviously, a univariate analysis doesn't tell us anything about the relationship between features.

## The big picture

With the following, I plot the number of customers grouped by income, education, marital status and gender.
Given the number of subplots, the information in each of them might not be very clear, and that's fine.
At this point I'm only interested in getting some insight on the demographics.

Note: if you are not familiar with Plotly, know that you can click on the different legends at the bottom of the plot to
hide individual groups. You can also hover over the different bars to get a tooltip with the information for that
particular bar.

In [None]:
columns = ['Education_Level', 'Gender', 'Income_Category', 'Marital_Status']
labels = format_labels(columns)

customers = group_and_count_by(df, columns).dropna()
fig = px.bar(customers,
             x='Gender',
             y='count',
             color='Marital_Status',
             category_orders={
                 'Income_Category': list(df['Income_Category'].cat.categories),
                 'Education_Level': list(df['Education_Level'].cat.categories)
             },
             labels=labels,
             facet_row='Education_Level',
             facet_col='Income_Category')
fig.update_layout(width=900,
                  height=1200,
                  legend=dict(
                    orientation="h",
                    yanchor="bottom",
                    xanchor="center",
                    x=0.5
                  )
)
fig.update_xaxes(automargin=True)
fig.show()

With the previous plot we can see some interesting facts:

- There are no women customers with education higher than "College" and incomes above $60K.
- Of the total customers, a good number are women in the lowest income category.
- Most men are on the higher income categories.

The same information can be plotted using `sunburst`. While having multiple `bar charts` is useful for looking at individual groups, `sunburts` help see the big picture.

**Note**: if you are not familiar with Plotly's sunburst, know that you can click on individual sections to change the aggregation level. 

In [None]:
fig = px.sunburst(customers,
                  path=columns,
                  values='count'
)
fig.update_layout(width=900,
                  height=1200
)
fig.show()

Yet another way of visualizing the same information can be done through a parallel categories plot, which allows us to detect
the most relevant combination of attributes really easy.

In [None]:
fig = px.parallel_categories(
    customers,
    dimensions=columns,
                color='count', color_continuous_scale='deep',
                labels=labels)

fig.update_layout(width=900,
                  height=900
)

fig.show()

In [None]:
df_by_age_and_gender = group_and_count_by(df, ['Age_Range', 'Gender'])
plot_bars_with_color(df_by_age_and_gender, 'Age_Range', 'Gender')


## Customers by Education and Age

In [None]:
df_by_education_and_age = group_and_count_by(df, ['Age_Range', 'Education_Level'])
crosstab = pd.crosstab(df['Age_Range'], df['Education_Level'])
crosstab

In [None]:
fig = px.imshow(crosstab, color_continuous_scale='Viridis')
fig.show()

From the previous plot, it's clear that the Graduates between 40 and 60 years old are, by far, the most frequent customers.


## Customers by Age and Income

Similar to the previous plot, we can check the income by age:

In [None]:
df_by_income_and_age = group_and_count_by(df, ['Age_Range', 'Income_Category'])
crosstab = pd.crosstab(df['Age_Range'], df['Income_Category'])
fig = px.imshow(crosstab, color_continuous_scale='Viridis')
fig.show()

With the heatmap above we can see that most customers seem to be between their forties and fifties, with incomes heavily skewed toward the lower categories.

## Customers by Dependent Count and Marital Status


In [None]:
x = 'Dependent_count'
y = 'count'
color = 'Marital_Status'

df_by_dependents_and_status = group_and_count_by(df, [x, color])
# plot_bars_with_color(df_by_dependents_and_status, 'Dependent_count', 'Marital_Status')
fig = px.scatter(
    df_by_dependents_and_status.sort_values(by=[x]),
    x=x,
    y=y,
    color=color,
    size=y,
    labels=format_labels([x, y, color])
)
fig.show()

With the previous plot we learn that most of the customers have 2 or 3 dependents, which probably speaks of families
with a couple of kids.


# Conlusion on Customers

Having created different plots for customers, we can conclude that a good number seem to be in their forties and fifties, with only a high-school degree, low incomes and 2 or 3 people in their care.

In [None]:
relevant = (df['Income_Category'] == 'Less than $40K') & ((df['Dependent_count'] == 2) | (df['Dependent_count'] == 3)) & (df['Age_Range'] == '40 - 60') & (df['Education_Level'] == 'Graduate')
relevant.name = 'Relevancy'
px.histogram(relevant, color='value')

Although the previous chart might not look like much, it tells that out of the 540 possible possibilities (3 age bins, 6 total dependents values, 6 education levels, 5 income categories), the one we selected accounts for 8% of the customers.


# Accounts

In [None]:
columns = ['Attrition_Flag', 'Card_Category', 'Months_on_book', 'Total_Relationship_Count', 'Credit_Limit']
fig = make_subplots(rows=int(len(columns)/2) + len(columns) % 2, cols= 2)

for index, column in enumerate(columns):
    fig.append_trace(
        go.Histogram(x=df[column], name= format_label(column)),
        1 + int(index / 2), 1 + (index % 2))
    
fig.update_layout(height=900, width=900)
fig.show()

## Card Category by Income Category

One of the first interesting we can look at while studying customers and accounts is to see what category of cards customers with different incomes choose most frequently.

In [None]:
x = 'Income_Category'
y = 'count'
color = 'Card_Category'

data = group_and_count_by(df, [x, color])
plot_bars_with_color(data, x, color, 'stack', 900, 600)

As we can see, for all the different income categories, the card that predominates is the Blue Card. One interesting I note is that the proportion of higher tier cards doesn't seem to increase with the income category. To further analyze this, we can plot a series of pie charts.

In [None]:
pd.crosstab(df[x], df[color])

In [None]:
plots_data = []
for income in data['Income_Category'].cat.categories:
    plots_data.append({
            'labels': data['Card_Category'].cat.categories.values,
            'values': (data[data['Income_Category'] == income])['count'],
            'name': income
    })

pie_plots(plots_data, 800, 800)

The pie plots above confirm that regardless of income:
- Blue cards are about 90% of the total.
- Silver cards are around 5% ~ 6%.
- Gold cards account for only 1% or 2% of the total cards.

## Months on Book

I now turn to study how accounts have been opened and closed through time.

In [None]:
trace0 = go.Histogram(x=df[df['Attrition_Flag'] == False]['Months_on_book'], name='Accounts Created')
trace1 = go.Histogram(x=df[df['Attrition_Flag'] == True]['Months_on_book'], name='Accounts Closed')
fig = go.Figure()
fig.add_trace(trace0)
fig.add_trace(trace1)
fig.update_layout(barmode='overlay')
fig.show()

The chart above shows that clearly, something happened 36 months ago. Probably there was a massive ad campaign. There was, also, a significant number of accounts closed that same month -- maybe some people opened their account and quickly changed their mind.

With the next chart, I visualize how new customers have joined the bank historically (segregated by card category). The chart reveals the huge impact of the new Blue Cards 36 months ago.

In [None]:
fig = px.line(group_and_count_by(df[df['Attrition_Flag'] == False], ['Months_on_book', 'Card_Category']).sort_values(by=['Card_Category', 'Months_on_book']), x="Months_on_book", y='count', color='Card_Category')
fig.update_layout(
    height=600,
    width=900
)
fig.show()

## Credit Limits by Card Category

Another interesting information we can extract is how the credit limits are affected by the card category.

In [None]:
fig = px.box(df, y='Credit_Limit', color='Card_Category')
fig.show()

The plot clearly indicates that, although Blue Cards tend to impose lower extraction limits, the other categories offer pretty much the same limits.

## Products held by Income

The following chart is used to check if there's any relationship between the number of products held by the customers and their income. The idea is to determine if people that earn more use more products. Unfortunately, there doesn't seem to be any relationship between these two columns (we could always calculate the correlation, but I won't bother).

In [None]:
columns = ['Total_Relationship_Count', 'Income_Category']
df_products_by_income = group_and_count_by(df, columns)
plot_bars_with_color(df_products_by_income,
                    x='Total_Relationship_Count', color='Income_Category')

## Customers that left the bank owing money

In [None]:
fig = px.box(df[df['Attrition_Flag'] == True], y='Total_Revolving_Bal', color='Card_Category')
fig.show()

The preceeding plot shows how clients tend not to accumulate much debt, especially those with the Platinum Card. 

# Account Activity

In [None]:
columns = ['Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Total_Revolving_Bal', 
           'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 
           'Avg_Utilization_Ratio', 'Avg_Open_To_Buy']
fig = make_subplots(rows=int(len(columns)/2) + len(columns) % 2, cols= 2)

for index, column in enumerate(columns):
    fig.append_trace(
        go.Histogram(x=df[column], name= format_label(column)),
        1 + int(index / 2), 1 + (index % 2))

fig.update_layout(height=1200, width=900)
fig.show()


## Spending habits for new customers (Q4 over Q1)

Let's now take a closer look at "new customers" (those that joined the bank less than 2 years ago) and see how much of their credit they are spending.

The next chart makes evident that those customers with cards with lower credit limit use almot all of their credit, while all clients with more than $13K in credit spend less than 20% of that amount. Having this low utilization ratio increases these customers credit score.

In [None]:
fig = px.scatter(df[df['New_Customer'] == True], x='Credit_Limit', y='Avg_Utilization_Ratio', color='Card_Category')
fig.show()

In [None]:
fig = px.scatter(df[df['New_Customer'] == True], x='Total_Ct_Chng_Q4_Q1', y='Total_Amt_Chng_Q4_Q1', color='Card_Category', labels=format_labels(['Total_Ct_Chng_Q4_Q1', 'Total_Amt_Chng_Q4_Q1']))
fig.show()

The previous plot shows how expenditure has, in general, reduce in both, total transactions, and total spent. This speaks of customer using their cards for about the same kind of purchases, but not that frequently as the previous year.

When compared against the rest of the customers, we see that the New Customers seem to be more conservative (i.e: they are largely piled up together in the same region.

In [None]:
fig = px.scatter(df, x='Total_Ct_Chng_Q4_Q1', y='Total_Amt_Chng_Q4_Q1', color='New_Customer', labels=format_labels(['Total_Ct_Chng_Q4_Q1', 'Total_Amt_Chng_Q4_Q1', 'New_Customer']))
fig.show()

## Credit limits and debts by Gender

With the next plot we can see how men tend to have higher credit limits.

In [None]:
fig = px.density_contour(df, x="Total_Revolving_Bal", y="Credit_Limit", color="Gender")
fig.show()

## Transactions

We can take the information related to transactions (counts and amounts) to further analyze spending habits. The chart shows 3 groups. In each group the amount spent doesn't change much, regardless of the number of transactions. Most of the New Customers seem to fall under the lower group: those who operate and spend less.

In [None]:
fig = px.scatter(df[df['Attrition_Flag'] == False], x='Total_Trans_Ct', y='Total_Trans_Amt', color='New_Customer')
fig.show()

## Contact made to people with debt (and closed accounts)

The last plot tries presents how the bank contacts customers that closed their account and are no longer active. I use the `Optimal_Utilization` flag for the colors. Thus, we can see how the bank tends to contact bad customers repeatdely after 2 months of inactivity. This chart also shows how even some customers who don't use their cards much (less than 30% of the available credit) default their debt too.

In [None]:
fig = px.strip(
    df[df['Attrition_Flag'] == True].sort_values(by=['Contacts_Count_12_mon']), 
    x="Total_Revolving_Bal", 
    y="Months_Inactive_12_mon", 
    color="Optimal_Utilization", 
    facet_col="Contacts_Count_12_mon",
    labels=format_labels(['Total_Revolving_Bal', 'Months_Inactive_12_mon', 'Optimal_Utilization', 'Contacts_Count_12_mon'])
)
fig.show()