### Introduction

#### This competition is about identifying the customers who are highly interested in recommended credit cards. The customer's interests in recommended credit cards will be identified based on the customer account details.

### Import Necessary Packages

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
from plotnine import *
import warnings
warnings.filterwarnings('ignore')
import statistics as st

### Import Train and Test Dataset


In [None]:
train=pd.read_csv('../input/jobathon-may-2021-credit-card-lead-prediction/train.csv')
test=pd.read_csv('../input/jobathon-may-2021-credit-card-lead-prediction/test.csv')

### Let's see structure of the train and test dataset

In [None]:
train.info()

#### The above information shows that there are nine features to predict the customer's interests in the recommended credit cards.   
#### In train dataset there are  245725 observations and 9 features and 1 target column.  

In [None]:
test.info()


#### The test dataset contains 105312 obsevations and 9 features.

### Let's see the glimpse of train and test dataset.

In [None]:

train.head()

In [None]:

test.head()


### Let's see a statistical summary of the numerical columns in the train and test dataset.

#### Train data

In [None]:
train.iloc[:,1:10].select_dtypes(include='int').describe()

#### The above summary shows that the average age of the customers who are eligible to take credit cards is **43** and the minimum age is **23** and the maximum age is **85**.

#### Vintage is how long the eligible customers have been on the bank records.The average is **3 year 8 months** and the minimum is **7 months**, the maximum year is **11 years(135 months)**



#### Test data

In [None]:
test.iloc[:,1:10].select_dtypes(include='int').describe()

### Let's check is there are any missing values in train and test dataset by column wise.

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

#### Both train and test dataset have missin values in Credit_Product column.

### Let's perform Exploratory data analysis.


#### First let's see the target column distribution.

#### The target column **Is_Lead** is in binary(0,1) format.Let's convert into **0-as Not_Interested** and **1-as Interested**.

In [None]:
target_encode={0:'Not_Interested',1:'Interested'}
train['Is_Lead']=train['Is_Lead'].map(target_encode)

In [None]:
def combine(counts, percentages):
    fmt = '{} ({:.1f}%)'.format
    return [fmt(c, p) for c, p in zip(counts, percentages)]

(ggplot(train,aes(x='Is_Lead',fill='Is_Lead'))+
geom_bar()+
geom_text(
aes(label=after_stat('combine(count, prop*100)'), group=1),
stat='count',
nudge_y=0.125,
va='bottom')+
labs(x='Customers_Interest',y='Count',title="Customer's Response in the Recommended Credit Cards")+
theme_seaborn(style='ticks')+
theme(figure_size=(8,5),
legend_position='none',      
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text=element_text(style='normal',size=14,weight='bold'),    
axis_ticks=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))



#### The above bar chart shows that the target column biased towards **one class(1-Not_Interested)** and clearly shows a class imbalance.

### Let's see gender wise customer's response on recommended credit cards.

In [None]:
train['Gender'].value_counts()

In [None]:

(ggplot(train,aes(x='Gender',fill='Is_Lead'))+
geom_bar()+
geom_text(
aes(label=after_stat('count')),
stat='count',position=position_stack(vjust=0.5))+
labs(x='',y='Count',title="Gender Wise Customer's Response in the Recommended Credit Cards")+
theme_seaborn(style='ticks')+
theme(figure_size=(8,5),
legend_position='right',      
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text=element_text(style='normal',size=14,weight='bold'),    
axis_ticks=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

In [None]:
f = lambda x: 100 * x / float(x.sum())
train.groupby(['Gender','Is_Lead'])['Is_Lead'].agg({'count'}).assign(Percentage=f).reset_index()

#### The bar chart shows that male customers are highly interests in recommended credit cards when compare to female customers.
#### The percentage of customers not interested in the recommended policies is high in both genders.

### Let's see the Customer's age distribution

#### Function to find Inter-Quartile-Range

In [None]:
def iqr(x: [int,float])->[int,float]:
    """Inter_Quartile-Range.
    
    with the help of numpy percentile function we can get the 1st and 3rd quartile.
    Then subtract the 1st quartile value from 3rd quartile.
    
    parameters:
    -----------
    input: list of numerical values,array
    return: single nummerical value.
    """
    q1_x = np.percentile(x, 25, interpolation='midpoint')
    q3_x = np.percentile(x, 75, interpolation='midpoint')
    return q3_x - q1_x

#### Let's create a fuction to find a optimal bin width by using Freedman Diaconis Rule.

In [None]:
def bin_w(x: [int,float])->[int,float]:
    """
    with help of above iqr function we can get iqr value.
    Using the iqr value and the following freedman diaconis formula we can get optimal bin width.
    
    parametes:
    ----------
    input: list of numerical values,array
    return: single nummerical value.
    
    """
    bw=(2 * iqr(x)) / np.power(x.shape[0], 1/3)
    return bw

In [None]:
age_bw=bin_w(train['Age'])
(ggplot(train)+geom_histogram(aes(x='Age'),fill='seagreen',color='black',
                             binwidth=age_bw)+
scale_x_continuous(breaks=range(20,90,10))+ 
labs(y='',title="Customer's Age Distribution")+
theme_void()+
theme(figure_size=(10,5),
panel_grid=element_blank(),
axis_ticks=element_blank(), 
plot_background=element_rect(fill='pink'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text=element_text(style='normal',size=14,weight='bold'),
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))


In [None]:
age_bw=bin_w(train['Age'])
(ggplot(train,aes(x='0',y='Age'))+
geom_boxplot(color='seagreen',fill='none',size=2.5)+
labs(x='',y='',title="Customer's Age Distribution Boxplot")+
theme_void()+
theme(figure_size=(8,5),
panel_grid=element_blank(),
axis_ticks=element_blank(), 
plot_background=element_rect(fill='pink'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_y=element_text(style='normal',size=14,weight='bold'),
axis_text_x=element_blank(),
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

In [None]:
print("The Customer's Age Median is:",train['Age'].median())

In [None]:
print("The Customer's Age Mode is:",st.mode(train['Age']))

In [None]:
print("Customer's Age Summary\n",train['Age'].describe())

#### The above histogram shows that the customer's age distribution looks normal and there is no skew. There are two peaks at the **age 30 and 50**.

#### The boxplot shows there are **no ouliers**.

#### The **mean and median are equal** so the distribution is **symmetric and zero skewness**.

#### Most customers who are eligible to get credit cards are in the age range of **24 to 30**.

### Let's see if there is any difference in customer age distribution and their responses on recommended credit card categories.

In [None]:
age_bw=bin_w(train['Age'])
(ggplot(train,aes(x='Age'))+
       geom_histogram(fill='seagreen',color='black',
                             binwidth=age_bw)+
scale_x_continuous(breaks=range(20,90,5))+ 
labs(y='',title="Customer's Age Distribution")+
facet_wrap('Is_Lead')+ 
theme_dark()+
theme(figure_size=(12,5),
panel_grid=element_blank(),
axis_ticks=element_blank(), 
plot_background=element_rect(fill='pink'), 
panel_background=element_rect(fill='none'),   
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text=element_text(style='normal',size=14,weight='bold'),
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))


#### The above histogram explains that the customers who are not interested in the recommended credit card options, in that most of them are in the age range of **27 to 30**.

#### **45 to 55** age customer's shows higher interest in recommended credit card options when compare to the other age range customers.

#### Let's create a categorical group from customer age column 

In [None]:
train['age_bin']=pd.cut(train['Age'],bins=7,labels=['22-31','31-40','40-49','49-58','58-67','67-76','76-85'])

### Let's see if there are any differences between gender-wise customer's interest responses and customer's age category.

In [None]:
(ggplot(train,aes(x='age_bin',fill='Is_Lead'))+
geom_bar()+
geom_text(
aes(label=after_stat('count')),
stat='count',position=position_stack(vjust=0.5))+
labs(x='',y='Count')+
facet_wrap('Gender')+
theme_dark()+
theme(figure_size=(12,8),
legend_position='right',  
panel_grid=element_blank(),
#plot_background=element_rect(fill='lightgrey'), 
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

#### The stacked bar chart explains that both male and female customers who are in the age category of **40-49 and 49-58** are highly interested in recommended credit cards.

#### More customers of the bank are in the age range of **22 to 31**.

### Let's see how many regions are there and check region wise customer responses on recommended credit cards.

In [None]:
(ggplot(train,aes(x='Region_Code',fill='Is_Lead'))+
geom_bar(position='identity')+
geom_text(
aes(label=after_stat('count')),ha='left',size=10,
stat='count',position=position_identity())+
labs(x='',y='Count')+
facet_wrap('Is_Lead')+
theme_dark()+
theme(figure_size=(12,8),
legend_position='none',  
panel_grid=element_blank(),
#plot_background=element_rect(fill='lightgrey'), 
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold'))+coord_flip())

#### The above chart shows that the customers from the region **254,268,283,284** are shown higher interest in recommended credit card. even though, the same regions have more numbers of uninterested customers.

### Let's see Region-wise gender distribution.

In [None]:
(ggplot(train,aes(x='Region_Code',fill='Gender'))+
geom_bar(position='dodge2')+
labs(x='',y='Count')+
#facet_wrap('Is_Lead',ncol=1,scales='free')+
theme_dark()+
theme(figure_size=(14,5),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'), 
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=90),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))


In [None]:
#function to highlight the maximum number
def highlight_max(x):
    return ['background-color: lightgreen' if v == x.max() else ''
                for v in x]


In [None]:
train.groupby(['Region_Code','Gender'])['Gender'].agg({'count'}).reset_index().pivot_table(index='Gender', 
                    columns='Region_Code', 
                    values='count').style.apply(highlight_max)

#### The above chart and table explains that the region **252,256,264,267,271,and 275** have more number of female customers.

### Let's see region-wise age distribution using boxplot

In [None]:
(ggplot(train,aes(x='Region_Code',y='Age',fill='Region_Code'))+
geom_boxplot()+
labs(x='',y='Count')+
#facet_wrap('Is_Lead',ncol=1,scales='free')+
theme_dark()+
theme(figure_size=(14,5),
legend_position='none',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'), 
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=90),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold')))

#### The above boxplot explains that there is a significant differences between region and customer's age distribution.

#### There are outlier points in some regions.

### Let's see if customer's responses on recommended credit cards will be based on their occupations.

In [None]:
(ggplot(train,aes(x='Occupation',fill='Is_Lead'))+
geom_bar(position='dodge')+
geom_text(
aes(label=after_stat('count')),va='baseline',size=12,
stat='count',position=position_dodge(width=1))+
labs(x='Occupation',y='Count')+
#facet_wrap('Is_Lead',ncol=1,scales='free')+
theme_dark()+
theme(figure_size=(14,5),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'), 
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold'),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold')))

#### The above bar chart explains that the customers who are self-employed and the customers who have other occupations are highly interested in recommended credit cards.
#### Most of the eligible customers are self-employed and salaried.

### Let's see if there are any differences in gender-wise occupation and responses to the recommended credit cards.

In [None]:
(ggplot(train,aes(x='Occupation',fill='Is_Lead'))+
geom_bar(position='dodge')+
geom_text(
aes(label=after_stat('count')),va='baseline',size=12,
stat='count',position=position_dodge(width=1))+
labs(x='Occupation',y='Count')+
facet_wrap('Gender',ncol=2,scales='free')+
theme_dark()+
theme(figure_size=(14,5),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'), 
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

#### The above-stacked bar chart explains that most salaried customers are females and moreover less interested in recommended credit cards.

#### Self-employed and customers who are having other occupations are highly interested in recommended credit cards.

### Let's compare the age group and type of occupations and check customer's responses on recommended credit cards.

In [None]:
(ggplot(train,aes(x='age_bin',fill='Is_Lead'))+
geom_bar(position='dodge')+
labs(y='Count')+
facet_grid('Gender~Occupation',scales='free')+
theme_dark()+
theme(figure_size=(14,5),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'), 
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))


In [None]:
train.groupby(['Occupation','Gender','age_bin','Is_Lead',])['Is_Lead'].agg({'count'}).reset_index().pivot_table(index=['Gender','Is_Lead','Occupation'], 
                    columns=['age_bin']).style.apply(highlight_max)

#### The above bar chart and pivot table explains that most of the salaried customers are in the age range of **22-31**.

#### **40-49 and 49-58** age range customers are mostly **self-employed and interested in recommended credit cards**.

#### The age range of customers who are entrepreneurs is between **21-58** 

#### The customers who are older than **67** years and they are occupations not fall under the category of salaried, self-employed, and entrepreneur.


### Let's see Region-wise customers occupations.

In [None]:
(ggplot(train,aes(x='Region_Code',fill='Occupation'))+
geom_bar()+
labs(y='Count')+
facet_wrap('Occupation',ncol=1)+
theme_dark()+
theme(figure_size=(14,8),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'), 
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

In [None]:
train.groupby(['Region_Code','Occupation'])['Occupation'].agg({'count'}).reset_index().pivot_table(index='Region_Code', 
                    columns=['Occupation'],values='count').style.apply(highlight_max)

#### Highest number of entrepreneurs customers are from the regions of 268 283, and 284.

#### Let's see how many various channels are used by customers to interact with the bank. Also, let see if there is any difference in customer responses based on those channels.

In [None]:
(ggplot(train,aes(x='Channel_Code',fill='Is_Lead'))+
geom_bar(position='dodge')+
geom_text(
aes(label=after_stat('count')),va='baseline',size=12,
stat='count',position=position_dodge(width=1))+
labs(x='Channel_Code',y='Count')+
#facet_wrap('Occupation',ncol=1)+
theme_dark()+
theme(figure_size=(14,8),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'), 
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

#### The above-stacked bar chart explains that most customers have interacted with the bank through channel **X1**.

#### The customers who are used channel **X3** to interact with the banks are highly interested in recommended credit cards.

### Let's see if there is any difference in the channel of interacting with the bank by gender. Also, let's see is there any change in the customer's interest.

In [None]:
(ggplot(train,aes(x='Channel_Code',fill='Is_Lead'))+
geom_bar(position='dodge')+
geom_text(
aes(label=after_stat('count')),va='baseline',size=12,
stat='count',position=position_dodge(width=1))+
labs(x='Channel_Code',y='Count')+
facet_wrap('Gender',ncol=2,scales='free')+
theme_dark()+
theme(figure_size=(14,8),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'), 
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

#### The above dodge bar explains that more number customers are using the **X1** channel to interact with the bank. In that, most of them are females and also show a higher interest than male customers in recommended credit cards.

#### More number of interested customer using **X2 and X3** to interact with the bank.

### Let's see the way of interacting with the bank changes will base on the gender-wise customer's occupations?

In [None]:
(ggplot(train,aes(x='Channel_Code',fill='Gender'))+
geom_bar(position='dodge')+
geom_text(
aes(label=after_stat('count')),va='baseline',size=8,
stat='count',position=position_dodge(width=1))+
labs(x='Channel_Code',y='Count')+
facet_grid('Is_Lead~Occupation',scales='free')+
theme_dark()+
theme(figure_size=(12,10),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'), 
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

#### The above facet bar chart explains that the salaried customers are mostly using the X1 channel to interact with the bank.

#### Self_employed customers are mostly using the channel X2, X3 to interact with the bank.

#### Customer's who are having other occupations are mostly using X1, X2, and X3 channels to interact with the bank.

### Let's see various age groups of customers and their channel of communication to the bank.

In [None]:
(ggplot(train,aes(x='Channel_Code',fill='Channel_Code'))+
geom_bar()+
geom_text(
aes(label=after_stat('count')),va='bottom',size=8,
color='darkblue',fontweight='bold',
stat='count',position=position_stack(vjust=0.5))+
labs(y='Count')+
facet_wrap('age_bin',ncol=2,scales='free')+
theme_dark()+
theme(figure_size=(12,12),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'), 
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))


#### The above bar chart explains that the customers who are in the age range 22 to 31 are mostly using channel X1 to interact with the bank.

#### 40 years and old age customers are mostly using X2 and X3 channels to interact with the bank.

### Let's see which channel of interaction is popular in all the regions.

In [None]:
(ggplot(train,aes(x='Region_Code',fill='Channel_Code'))+
geom_bar()+
facet_wrap('Channel_Code',ncol=1)+
theme_dark()+
theme(figure_size=(12,8),
legend_position='none',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'), 
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))



#### The customers who are from regions 253,268,283,284 are mostly using X1, X2, and X3 channels to interact with the bank.


#### Let's see how long been the customers are in the bank's book and see their response pattern on recommended credit cards.

#### The vintage details in months so let's convert it into year format.

In [None]:
#month to year conversion
def mon_to_yr(col):
  """
  if month is greater than or equal to 12 divide by 12 else return the input value
  """
  if col>=12:
    return round(col/12,1)
  else:
    return col/10


In [None]:
train['vintage_yr']=train['Vintage'].apply(mon_to_yr)

In [None]:
vintage_bin=bin_w(train['vintage_yr'])
(ggplot(train,aes(x='vintage_yr',fill='Is_Lead'))+
geom_histogram(binwidth=vintage_bin,color='black')+
scale_x_continuous(breaks=range(0,15,1))+
facet_wrap('Is_Lead',ncol=1)+
theme_dark()+
theme(figure_size=(12,8),
legend_position='none',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'), 
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))


#### The above histogram explains that the customers who have been in bank books for 1 to 3 years and 1 to 8 year means they are highly interested in recommended credit cards.

#### Most of the eligible customers are have been in the bank book's are 1 to 3 years.

#### Let's bin the vintage years by using pandas cut.

In [None]:
train['vintage_yr_bin']=pd.cut(train['vintage_yr'],bins=5 ,labels=['0.7-2.8','2.8-4.9','4.9-7.0','7.0-9.1','9.1-11.2'])

#### Let's compare gender-wise customer responses with vintage year category.

In [None]:
(ggplot(train,aes(x='vintage_yr_bin',fill='Is_Lead'))+
geom_bar(position='dodge')+
geom_text(
aes(label=after_stat('count')),va='baseline',size=8,
color='darkblue',fontweight='bold',angle=40,
stat='count',position=position_dodge(width=1))+
labs(x='Vintage_Year_Category',y='Count')+
facet_wrap('Gender',ncol=2,scales='free')+
theme_dark()+
theme(figure_size=(10,8),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'),
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

#### The above chart explains more eligible customers are who have been in the category of 7 months to 3 years.

#### The male customers who have been in the bank for 7 to 9 years are highly interested in recommended credit cards.

### Let's compare the gender-wise customer age category with vintage year category.

In [None]:
(ggplot(train,aes(x='vintage_yr_bin',fill='Is_Lead'))+
geom_bar(position='dodge')+
geom_text(
aes(label=after_stat('count')),va='center',size=8,
color='darkblue',fontweight='bold',
stat='count',position=position_dodge(width=1))+
labs(x='Vintage_Year_Category',y='Count')+
facet_grid('age_bin~Gender',scales='free')+
theme_dark()+
theme(figure_size=(11,12),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'),
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

#### The above dodge bar chart explains that the customers who are 40 years and above are have been in the bank books for 7 to 9 years and showing higher interest in recommended credit cards.

#### Let's see if the customer has any previous loans.

In [None]:
(ggplot(train,aes(x='Credit_Product',fill='Is_Lead'))+
geom_bar(position='dodge')+
geom_text(
aes(label=after_stat('count')),va='baseline',size=12,
color='darkblue',fontweight='bold',
stat='count',position=position_dodge(width=1))+
labs(x='Vintage_Year_Category',y='Count')+
theme_dark()+
theme(figure_size=(11,6),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'),
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))


#### The above dodge bar chart explains that if the customer has any previous loans means they are highly interested in recommended credit cards.

#### There are 29325 previous loan informations are missing.

### Let's compare customer's previous loan details with their occupations.

In [None]:
(ggplot(train,aes(x='Credit_Product',fill='Is_Lead'))+
geom_bar(position='dodge')+
geom_text(
aes(label=after_stat('count')),va='baseline',size=10,
color='darkblue',fontweight='bold',
stat='count',position=position_dodge(width=1))+
facet_wrap('Occupation',ncol=2,scales='free')+
labs(y='Count')+
theme_dark()+
theme(figure_size=(11,8),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'),
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))



#### The above chart explains that the entrepreneur customers are having fewer previous loans than customers who are having other occupations.

#### self-employed customers are having a high number of previous loans.

### Let's see which region customers have the highest previous loans.

In [None]:
(ggplot(train,aes(x='Region_Code',fill='Credit_Product'))+
geom_bar(position='dodge')+
facet_wrap('Credit_Product',ncol=1)+
labs(y='Count')+
theme_dark()+
theme(figure_size=(11,12),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'),
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

#### customers who are from regions 254, 268, and 283 having a higher number of previous loans in the bank.

### Let's see customer's average account balance distribution using histogram.

In [None]:
#optimal bin width for Avg_Account_Balance

avg_bal_bin=bin_w(train['Avg_Account_Balance'])

(ggplot(train,aes(x='Avg_Account_Balance'))+
geom_histogram(binwidth=avg_bal_bin,fill='seagreen',color='black')+
labs(y='Count',title='Avg_Account_Balance Distribution histogram')+
theme_dark()+
theme(figure_size=(11,7),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'),
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

In [None]:
(ggplot(train,aes(x='0',y='Avg_Account_Balance'))+
geom_boxplot(color='seagreen',fill='none',size=2.5)+
labs(x='',y='',title="Avg_Account_Balance Distribution Boxplot")+
theme_void()+
theme(figure_size=(8,5),
panel_grid=element_blank(),
axis_ticks=element_blank(), 
plot_background=element_rect(fill='pink'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_y=element_text(style='normal',size=14,weight='bold'),
axis_text_x=element_blank(),
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

In [None]:
print("The Avg_Account_Balance Median is:",train['Avg_Account_Balance'].median())

In [None]:
print("The Avg_Account_Balance Mode is:",st.mode(train['Avg_Account_Balance']))

In [None]:
print('Avg_Account_Balance Summary\n',train['Avg_Account_Balance'].describe())

#### The above histogram shows that the customer's Avg_Account_Balance distribution right skewed.

#### The boxplot shows there are **ouliers**.

### Let's see if there is any difference in customer Avg_Account_Balance distribution and their responses on recommended credit card categories.

In [None]:
(ggplot(train,aes(x='0',y='Avg_Account_Balance'))+
geom_boxplot(color='seagreen',fill='none',size=1.5)+
labs(x='',y='',title="Avg_Account_Balance Distribution Boxplot")+
facet_wrap('Is_Lead')+
theme_dark()+
theme(figure_size=(11,7),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'),
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_blank(),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

In [None]:
train.groupby(['Is_Lead'])['Avg_Account_Balance'].agg(['min','mean','median','max','std'])

#### The above histogram shows that there is no significant change in both balance distributions.

### Let's see is there a difference in customer average balance based on their type of occupation.

In [None]:
(ggplot(train,aes(x='0',y='Avg_Account_Balance'))+
geom_boxplot(color='seagreen',fill='none',size=1.5)+
labs(x='',y='',title="Avg_Account_Balance Distribution Boxplot")+
facet_wrap('Occupation')+
theme_dark()+
theme(figure_size=(11,7),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'),
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_blank(),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

In [None]:
train.groupby(['Occupation'])['Avg_Account_Balance'].agg(['min','mean','median','max','std'])

### Let's see how many customers have been active in the past 3 months.

In [None]:
(ggplot(train,aes(x='Is_Active',fill='Is_Lead'))+
geom_bar(position='dodge')+
geom_text(
aes(label=after_stat('count')),va='baseline',size=10,
color='darkblue',fontweight='bold',
stat='count',position=position_dodge(width=1))+
labs(y='Count')+
theme_dark()+
theme(figure_size=(11,8),
legend_position='right',  
panel_grid=element_blank(),
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

#### The above dodge bar chart explains that the customers who have not been active in the past three months are highly interested in recommended credit cards.

#### Let's see customer's occupation-wise how many of them are have been active in the past 3 months and see their responses in recommended credit cards.

In [None]:
(ggplot(train,aes(x='Is_Active',fill='Is_Lead'))+
geom_bar(position='dodge')+
geom_text(
aes(label=after_stat('count')),va='baseline',size=10,
color='darkblue',fontweight='bold',
stat='count',position=position_dodge(width=1))+
facet_wrap('Occupation')+
labs(y='Count')+
theme_dark()+
theme(figure_size=(11,8),
legend_position='right',  
panel_grid=element_blank(),
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_text(style='normal',size=12,weight='bold',rotation=70),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

#### The above dodge bar chart explains that the customers who are self-employed and have not been active in the past 3 months are highly interested in recommended credit cards.

### Let's compare the customer's past three months' account activity and their average account balance.

In [None]:
(ggplot(train,aes(x='0',y='Avg_Account_Balance'))+
geom_boxplot(color='seagreen',fill='none',size=1.5,outlier_color='red',outlier_alpha=0.1)+
labs(x='',y='')+
facet_wrap('Is_Active')+
theme_dark()+
theme(figure_size=(11,7),
legend_position='right',  
panel_grid=element_blank(),
subplots_adjust={'hspace': 0.4,'wspace': 0.4},
#plot_background=element_rect(fill='lightgrey'),
panel_background=element_rect(fill='none'),    
plot_title=element_text(style='normal',size=16,weight='bold'),      
axis_text_x=element_blank(),
axis_text_y=element_text(style='normal',size=12,weight='bold'),    
axis_ticks=element_blank(),
legend_background=element_blank(),    
axis_title=element_text(style='normal',size=14,weight='bold'),
strip_text=element_text(style='normal',size=14,weight='bold')))

In [None]:
train.groupby(['Is_Active'])['Avg_Account_Balance'].agg(['min','mean','median','max','std'])

#### The above boxplot explains that there is a difference between the customer's average balance and the past 3 months' account activity.

### Let's analyze the missing values

In [None]:
import missingno as msno

In [None]:
train.isnull().sum()

#### Credit product column has 29325 missing values.

#### Dendrogram-reveals deeper trend than pair-wise analysis heatmap

In [None]:
msno.dendrogram(train.iloc[:,1:10])

#### The dendrogram explains that the column credit product and other column has the highest distance so cannot predict the missing value based on the other columns.