In this notebook, I will introduce some strategies of visualizing data of different attributes that I often use for basic EDA.

![](https://media.geeksforgeeks.org/wp-content/uploads/Capture-67.png)

Qualitative data (also known as categorical data) has two subtypes, which are **nominal** and **ordinal** data. In the figure above, binary data is a special type of nominal data. 

**Nominal data** has characteristics similar to that of a noun (e.g. feature 'Marital_Status' in this dataset). In my EDA strategy, I always show the value counts of different feature values **by sorting them based on their value counts**.

**Ordinal data** includes elements that are ranked, ordered or have a rating scale attached (e.g. feature 'Education_Level' in this dataset). In my EDA strategy, I always show the value counts of different feature values **by sorting them based on the ordinal rank of feature values**.

---------------------------------------------------------------------------------------------------------------

Quantitative data (also known as numeric data) has two subtypes, which are **discrete** and **continuous** data.

**Discrete data** is a type of numerical data with countable elements (e.g. feature 'Dependent_Count' in this dataset). In my EDA strategy, since all the discrete data in this dataset are **finite**, I treat them like what I do in showing ordinal data.

**Continuous data** is a type of numerical data with uncountable elements (e.g. feature 'Customer_Age' in this dataset). In my EDA strategy, I always **visualize them in histogram or boxplot**.

In the detail EDA on Credit Card Customer Churning dataset, I will specify the reasons why I use these strategies.

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
import plotly.graph_objects as go

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

# Data Overview

From data overview phase, you can figure out the specific data type (attribute) of each feature.

In [None]:
data = pd.read_csv('../input/credit-card-customers/BankChurners.csv')
data.drop(columns=['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
                   'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'],
          axis=1,inplace=True)
data.head()

In [None]:
data.dtypes

As we can see from this data type summary, features with 'float64' and 'int64' dtype are numeric features while features with 'object' dtype are categorical features.

# EDA 

# Nominal Data

For nominal data, pie chart is a good choice if there are no more than 5 differnt feature values. **The legend is sorted by the value counts of different feature values**. When there are more than 5 different features values, bar chart is a good choice.

In [None]:
af_cnt = data['Attrition_Flag'].value_counts()
trace = go.Pie(labels = af_cnt.index, 
               values = af_cnt.values,
               hoverinfo = 'percent+value+label',
               textinfo = 'percent',
               textposition = 'inside',
               textfont = dict(size=14),
               title = 'Attrition Flag',
               titlefont = dict(size=15),
               hole = 0.5,
               showlegend = True,
               marker = dict(line=dict(color='black',width=2)))
fig = go.Figure(data=[trace])
fig.update_layout(height=600, width=600)
fig.show()

16.1% of customers churned.

In [None]:
gender_cnt = data['Gender'].value_counts()
trace = go.Pie(labels = gender_cnt.index, 
               values = gender_cnt.values,
               hoverinfo = 'percent+value+label',
               textinfo = 'percent',
               textposition = 'inside',
               textfont = dict(size=14),
               title = 'Gender',
               titlefont = dict(size=15),
               hole = 0.5,
               showlegend = True,
               marker = dict(line=dict(color='black',width=2)))
fig = go.Figure(data=[trace])
fig.update_layout(height=600, width=600)
fig.show()

Almost same number of female and male customers 

In [None]:
marital_cnt = data['Marital_Status'].value_counts()
trace = go.Pie(labels = marital_cnt.index, 
               values = marital_cnt.values,
               hoverinfo = 'percent+value+label',
               textinfo = 'percent',
               textposition = 'inside',
               textfont = dict(size=14),
               title = 'Marital Status',
               titlefont = dict(size=15),
               hole = 0.5,
               showlegend = True,
               marker = dict(line=dict(color='black',width=2)))
fig = go.Figure(data=[trace])
fig.update_layout(height=600, width=600)
fig.show()

In bar charts of visualizing nominal data, I always use deeper and deeper colors with the growth of value counts of feature values while all the colors are in the same color family (In this case, blue is the color family).

In [None]:
fig,ax = plt.subplots(figsize=(10,8))
ax.barh(marital_cnt.index,marital_cnt.values,
        height=0.5, edgecolor='darkgrey',color=sns.color_palette("Blues_r",4))
ax.set_yticklabels(marital_cnt.index,fontsize=15)
ax.set_xticks([])

for i in ax.patches:
    ax.text(i.get_width()+.4, i.get_y()+.28,
            str(i.get_width()), 
            fontsize=15, color='black')
    
for pos in ['left','right','bottom','top']:
    ax.spines[pos].set_color(None)
    
ax.grid(axis='x',linestyle='-',alpha=0.4)
ax.invert_yaxis()
fig.text(0.13, 0.95, 'Marital Status Count of Customers', fontsize=15, fontweight='bold') 

Most of the customers are married or single.

When the number of feature values is small, pie chart is more informative.

# Ordinal Data

For ordinal data, I prefer to use **bar chart**. Bar chart can show the value counts values without mixing the original order of different features. I always use an unique color to represent the prominent feature value(s), which makes the readers focus more on the specific feature value(s).

In [None]:
edu_level = ['Uneducated','High School','College',
             'Graduate','Post-Graduate','Doctorate','Unknown']
edu_cnt = data['Education_Level'].value_counts().reindex(edu_level)
colors = [(0,0,1)]*7
colors[3] = (1,0,0)
fig,ax = plt.subplots(figsize=(10,8))
ax.barh(edu_cnt.index,edu_cnt.values,
        height=0.7, edgecolor='darkgrey',color=colors, alpha=0.7)
ax.set_yticklabels(edu_cnt.index,fontsize=15)
ax.set_xticks([])

for i in ax.patches:
    #print(i.get_width(),i.get_y())
    ax.text(i.get_width()+.3, i.get_y()+.38,
            str(i.get_width()), 
            fontsize=15, color='black')
    
for pos in ['left','right','bottom','top']:
    ax.spines[pos].set_color(None)
    
ax.grid(axis='x',linestyle='-',alpha=0.4)
ax.invert_yaxis()
fig.text(0.13, 0.95, 'Education Level of Customers', fontsize=15, fontweight='bold') 

In [None]:
income_order = ['Less than $40K','$40K - $60K',
                '$60K - $80K','$80K - $120K','$120K +','Unknown']
income_cnt = data['Income_Category'].value_counts(sort=False).reindex(income_order)
colors = [(0,0,1)]*6
colors[0] = (1,0,0)
fig,ax = plt.subplots(figsize=(10,8))
ax.barh(income_cnt.index,income_cnt.values,
        height=0.5, edgecolor='darkgrey',color=colors, alpha=0.7)
ax.set_yticklabels(income_cnt.index,fontsize=15)
ax.set_xticks([])

for i in ax.patches:
    ax.text(i.get_width()+.32, i.get_y()+.28,
            str(i.get_width()), 
            fontsize=15, color='black')
    
for pos in ['left','right','bottom','top']:
    ax.spines[pos].set_color(None)
    
ax.grid(axis='x',linestyle='-',alpha=0.4)
ax.invert_yaxis()
fig.text(0.13, 0.95, 'Income Category of Customers', fontsize=15, fontweight='bold') 

As we know, rainbow colors in bar charts may cause the chart less informative. However, under special circumstance, you can use different colors to represent different feature values. For example, the credit card level of customers are originally classified with differenr colors.

In [None]:
level_order = ['Blue','Silver','Gold','Platinum']
level_cnt = data['Card_Category'].value_counts(sort=False).reindex(level_order)
colors = ['Blue','Silver','Gold','Black']
fig,ax = plt.subplots(figsize=(10,8))
ax.bar(level_order,level_cnt.values,
        width=0.4,edgecolor='darkgrey',color=colors, alpha=0.8)
ax.set_xticklabels(level_order,fontsize=15)
#ax.set_yticks([])

for g in level_cnt.index:
    ax.annotate(f"{level_cnt.loc[g]}",
                xy = (g,level_cnt.loc[g]+200),
                va = 'center',ha='center',fontsize=14)
    
for pos in ['left','right','bottom','top']:
    ax.spines[pos].set_color(None)
    
ax.grid(axis='y',linestyle='-',alpha=0.4)
#fig.text(0.14, 0.95, 'Dependent Count of Customers ', fontsize=16, fontweight='bold')
ax.set_title('Card Category of Customers',fontsize=16,fontweight='bold')

Almost all the customers are in level blue.

# Discrete Data

As I said above, for discrete data in this dataset, I visualize them like what I do in ordinal features.

In [None]:
dep_cnt = data['Dependent_count'].value_counts(sort=False)
colors = [(0,0,1)]*6
colors[2] = colors[3] = (1,0,0)
fig,ax = plt.subplots(figsize=(10,8))
ax.bar(['0','1','2','3','4','5'],dep_cnt.values,
        width=0.6,edgecolor='darkgrey',color=colors, alpha=0.7)
ax.set_xticklabels(['0','1','2','3','4','5'],fontsize=15)
#ax.set_yticks([])

for g in dep_cnt.index:
    ax.annotate(f"{dep_cnt.loc[g]}",
                xy = (g,dep_cnt.loc[g]+50),
                va = 'center',ha='center',fontsize=14)
    
for pos in ['left','right','bottom','top']:
    ax.spines[pos].set_color(None)
    
ax.grid(axis='y',linestyle='-',alpha=0.4)
#fig.text(0.14, 0.95, 'Dependent Count of Customers ', fontsize=16, fontweight='bold')
ax.set_title('Dependent Count of Customers',fontsize=16,fontweight='bold')

In [None]:
relation_cnt = data['Total_Relationship_Count'].value_counts(sort=False)
colors = [(0,0,1)]*6
colors[2] = (1,0,0)
fig,ax = plt.subplots(figsize=(10,8))
ax.bar(['1','2','3','4','5','6'],relation_cnt.values,
        width=0.6,edgecolor='darkgrey',color=colors, alpha=0.7)
ax.set_xticklabels(['1','2','3','4','5','6'],fontsize=15)
#ax.set_yticks([])

for g in ['1','2','3','4','5','6']:
    ax.annotate(f"{relation_cnt.loc[int(g)]}",
                xy = (g,relation_cnt.loc[int(g)]+50),
                va = 'center',ha='center',fontsize=14)
    
for pos in ['left','right','bottom','top']:
    ax.spines[pos].set_color(None)
    
ax.grid(axis='y',linestyle='-',alpha=0.4)
#fig.text(0.14, 0.95, 'Dependent Count of Customers ', fontsize=16, fontweight='bold')
ax.set_title('Total Relationship Count of Customers',fontsize=16,fontweight='bold')

In [None]:
inactive_cnt = data['Months_Inactive_12_mon'].value_counts(sort=False)
colors = [(0,0,1)]*7
colors[2] = colors[3] = (1,0,0)
fig,ax = plt.subplots(figsize=(10,8))
ax.bar(['0','1','2','3','4','5','6'],inactive_cnt.values,
        width=0.6,edgecolor='darkgrey',color=colors, alpha=0.7)
ax.set_xticklabels(['0','1','2','3','4','5','6'],fontsize=15)
#ax.set_yticks([])

for g in ['0','1','2','3','4','5','6']:
    ax.annotate(f"{inactive_cnt.loc[int(g)]}",
                xy = (g,inactive_cnt.loc[int(g)]+50),
                va = 'center',ha='center',fontsize=14)
    
for pos in ['left','right','bottom','top']:
    ax.spines[pos].set_color(None)
    
ax.grid(axis='y',linestyle='-',alpha=0.4)
#fig.text(0.14, 0.95, 'Dependent Count of Customers ', fontsize=16, fontweight='bold')
ax.set_title('Inactive Months Count of Customers within 12 Months',fontsize=16,fontweight='bold')

In [None]:
contact_cnt = data['Contacts_Count_12_mon'].value_counts(sort=False)
colors = [(0,0,1)]*7
colors[2] = colors[3] = (1,0,0)
fig,ax = plt.subplots(figsize=(10,8))
ax.bar(['0','1','2','3','4','5','6'],contact_cnt.values,
        width=0.6,edgecolor='darkgrey',color=colors, alpha=0.7)
ax.set_xticklabels(['0','1','2','3','4','5','6'],fontsize=15)
#ax.set_yticks([])

for g in ['0','1','2','3','4','5','6']:
    ax.annotate(f"{contact_cnt.loc[int(g)]}",
                xy = (g,contact_cnt.loc[int(g)]+50),
                va = 'center',ha='center',fontsize=14)
    
for pos in ['left','right','bottom','top']:
    ax.spines[pos].set_color(None)
    
ax.grid(axis='y',linestyle='-',alpha=0.4)
#fig.text(0.14, 0.95, 'Dependent Count of Customers ', fontsize=16, fontweight='bold')
ax.set_title('Contacts Months Count of Customers within 12 Months',fontsize=16,fontweight='bold')

# Continuous Data

For continuous data, histogram is a good choice because we can know the distribution of feature values. Meanwhile, with the help of boxplot, you can figure the **outliers** quickly.

In [None]:
fig,ax = plt.subplots(nrows=1,ncols=2,figsize=(20,8))
sns.distplot(data['Customer_Age'],bins=20,kde=True,ax=ax[0])
for pos in ['left','right','top']:
    ax[0].spines[pos].set_color(None)
ax[0].set_yticks([])
ax[0].set_title('Histogram of Customer Age',fontsize=15,fontweight='bold')

ax[1].boxplot(data['Customer_Age'])
for pos in ['left','right','top','bottom']:
    ax[1].spines[pos].set_color(None)
ax[1].grid(axis='y')
ax[1].set_xticks([])
ax[1].set_title('Boxplot of Customer Age',fontsize=15,fontweight='bold')

In [None]:
fig,ax = plt.subplots(nrows=1,ncols=2,figsize=(20,8))
sns.distplot(data['Months_on_book'],bins=20,kde=True,ax=ax[0])
for pos in ['left','right','top']:
    ax[0].spines[pos].set_color(None)
ax[0].set_yticks([])
ax[0].set_title('Histogram of Customer Months_on_book',fontsize=15,fontweight='bold')

ax[1].boxplot(data['Months_on_book'])
for pos in ['left','right','top','bottom']:
    ax[1].spines[pos].set_color(None)
ax[1].grid(axis='y')
ax[1].set_xticks([])
ax[1].set_title('Boxplot of Customer Months_on_book',fontsize=15,fontweight='bold')

In [None]:
fig,ax = plt.subplots(nrows=1,ncols=2,figsize=(20,8))
sns.distplot(data['Credit_Limit'],bins=20,kde=True,ax=ax[0])
for pos in ['left','right','top']:
    ax[0].spines[pos].set_color(None)
ax[0].set_yticks([])
ax[0].set_title('Histogram of Customer Credit Limit',fontsize=15,fontweight='bold')

ax[1].boxplot(data['Credit_Limit'])
for pos in ['left','right','top','bottom']:
    ax[1].spines[pos].set_color(None)
ax[1].grid(axis='y')
ax[1].set_xticks([])
ax[1].set_title('Boxplot of Customer Credit Limit',fontsize=15,fontweight='bold')

This distribution is very interesting compared with previous features with continuous values. There are two peaks at two sides in this histogram.

Since data in the left features are all like tabular data, I will plot them together.

In [None]:
for col in data.columns[-7:]:
    fig,ax = plt.subplots(nrows=1,ncols=2,figsize=(10,4))
    sns.distplot(data[col],bins=20,kde=True,ax=ax[0])
    for pos in ['left','right','top']:
        ax[0].spines[pos].set_color(None)
    ax[0].set_yticks([])
    ax[0].set_title(col+' Distribution',fontsize=13,fontweight='bold')
    
    ax[1].boxplot(data[col])
    for pos in ['left','right','top','bottom']:
        ax[1].spines[pos].set_color(None)
    ax[1].grid(axis='y')
    ax[1].set_xticks([])
    ax[1].set_title(col+' Boxplot',fontsize=13,fontweight='bold')

It looks like there are no extreme values(outliers).

Reference:
1. https://www.formpl.us/blog/categorical-numerical-data
2. https://www.geeksforgeeks.org/understanding-data-attribute-types-qualitative-and-quantitative/

If you think my notebook is useful, please upvote it :) . Thanks for your time.