## Dataset Description

The dataset is linked to a direct marketing campaign conducted by a Portuguese banking organization. Marketing campaign was based on phone calls. Repeated contact with the same customer is often necessary to assess whether a bank's term deposit will be a "yes" or a "no". In the dataset, there are 21 different attributes and 41,188 individual customer records.
**The aim is to predict whether and what kind of customers buy time deposits, using the 'y' variable for subscriptions.** 

## Input Varibles

1. age (numeric)
2. job : type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")
3. marital : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or      widowed)
4. education (categorical:basic.4y","basic.6y","basic.9y","high.school","illiterate",       "professional.course","university.degree","unknown")
5. default: has credit in default? (categorical: "no","yes","unknown")
6. housing: has housing loan? (categorical: "no","yes","unknown")
7. loan: has personal loan? (categorical: "no","yes","unknown")
   
    *related with the last contact of the current campaign*:
8. contact: contact communication type (categorical: "cellular","telephone") 
9. month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
10. day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")
11. duration: last contact duration, in seconds (numeric). Important note:  this attribute highly affects the output target 
      (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the 
      call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the 
      intention is to have a realistic predictive model.
  
    *other attributes*:
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means 
      client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")
  
    *social and economic context attributes*
16. emp.var.rate: employment variation rate - quarterly indicator (numeric)
17. cons.price.idx: consumer price index - monthly indicator (numeric)     
18. cons.conf.idx: consumer confidence index - monthly indicator (numeric)     
19. euribor3m: euribor 3 month rate - daily indicator (numeric)
20. nr.employed: number of employees - quarterly indicator (numeric)
  
     *Output variable (desired target)*
21. y - has the client subscribed a term deposit? (binary: "yes","no")



# 1. Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy as sp
import matplotlib.pyplot as plt
import re
import csv

**Loading the file**

In [None]:
%%time
df=pd.read_csv('/kaggle/input/bank-marketing/bank-additional-full.csv', delimiter=';')
df


# 2. Dataset Description

In [None]:
%%time
df.describe(include='all')

### Data Quality

**Checking Data Type**

In [None]:
df.info()

**Checking empty values**

In [None]:
df.isnull().sum()

In [None]:
df.shape

# 3. Data Transformation 

**Renaming columns**

In [None]:
df_v2=df.rename(columns={'marital':'marital_status','education':'educational_attainment',
                         'default':'credit_in_default','housing':'housing_loan','loan':'personal_loan',
                         'contact':'contact_type','day_of_week':'last_contact_day','month':'last_contact_month',
                         'duration':'last_contact_duration','campaign':'current_camp_contact_count',
                         'pdays':'days_after_previous_camp','previous':'previous_camp_contact_count',
                         'poutcome':'previous_outcome','y':'current_outcome','emp.var.rate':'employment_variation_rate',
                         'euribor3m':'euro_interbank_rate','nr.employed':'number_of_employees',
                         'cons.price.idx':'consumer_price_index', 'cons.conf.idx':'consumer_confidence_index',
                         'days_after_previous_camp_bool':'contacted_or_not'})
df_v2.columns

**Checking duplicates**

In [None]:
df_v2.duplicated().value_counts()

In [None]:
df_v2[df_v2.duplicated()]

In [None]:
df_v2=df_v2.drop_duplicates()
df_v2.duplicated().value_counts()
df_v2.info()

**Pdays**

Adding a new column for pdays to show whether the client was contacted or not (to remove 999 encoding)

In [None]:
df_v2['days_after_previous_camp_bool']=np.where(df_v2['days_after_previous_camp']== 999, 'no', 'yes')
df_v2['days_after_previous_camp_bool'].value_counts()
df_v2

In [None]:
df_v2=df_v2.rename(columns={'days_after_previous_camp_bool':'contacted_or_not'})
df_v2

**Converting seconds to hours:minutes:seconds**

In [None]:
df_v2['last_contact_duration_min']=df['duration']/60
df_v2['last_contact_duration_min'].max()

**Converting binary variables to 1 and 0 for visualization of bivariate and multivariate data**

In [None]:
df_v2 = df_v2.replace(to_replace = ['yes','no'],value = ['1','0'])
df_v2


**Creating bins for age**

In [None]:
def age_groups(row):
    if row['age'] < 25:
        return 'under 25'
    elif row['age'] >= 25 and row['age'] < 35:
        return '25-35'
    elif row['age'] >= 35 and row['age'] < 45:
        return '35-45'
    elif row['age'] >= 45 and row['age'] < 55:
        return '45-55'
    elif row['age'] >= 55 and row['age'] < 65:
        return '55-65'
    else:
        return 'above 65'
df_v2['age_groups'] = df_v2.apply(age_groups, axis=1)
df_v2

**Creating bins for duration**

In [None]:
def duration_groups(row):
    if row['last_contact_duration_min'] < 0:
        return '0'
    elif row['last_contact_duration_min'] > 0 and row['last_contact_duration_min'] < 1:
        return 'under 1 min'
    elif row['last_contact_duration_min'] > 1 and row['last_contact_duration_min'] < 2:
        return '1-2 min'
    elif row['last_contact_duration_min'] > 2 and row['last_contact_duration_min'] < 4:
        return '2-4 min'
    elif row['last_contact_duration_min'] > 4 and row['last_contact_duration_min'] < 6:
        return '4-6 min'
    elif row['last_contact_duration_min'] > 6 and row['last_contact_duration_min'] < 8:
        return '6-8 min'
    else:
        return 'above 8'


df_v2['duration_min_groups'] = df_v2.apply(duration_groups, axis=1)
df_v2

**Capping outliers**

In [None]:
df_v2.describe(percentiles=[0.01,0.05,0.10,0.25,0.50,0.75,0.85,0.9,0.99])

In [None]:
df_v3=df_v2.clip(lower=df_v2.quantile(0.01),upper=df_v2.quantile(0.99), axis=1)

df_v3.describe(percentiles=[0.01,0.05,0.10,0.25,0.50,0.75,0.85,0.9,0.99])

**Encoding of categorical variables**

In [None]:
df_v2_encoded=pd.get_dummies(df_v2)
df_v2_encoded.columns

**Results:**
- **the data types are correct;**
- **there are no missing values;**
- **duplicates are removed;**
- **there are outliers but they are a natural part of the population we are about to study. The dataframe with removed outliers is df_v3 and the dataframe with outliers is df_v2.**

# 4. Univariate Analysis


### Custom Function for Data Plotting

In [None]:
# Function to plot numeric variables
sns.set(rc = {'figure.figsize':(15,8)})
def plot_numeric(field, xlabel, ylabel):   
    # Histogram
    plt.hist(df_v2[field], bins='auto', color='cadetblue', edgecolor='grey', histtype='bar', rwidth=1)
    plt.title(field.capitalize(), fontsize=18)
    plt.xlabel(xlabel.capitalize(), fontsize = 14)
    plt.ylabel(ylabel, fontsize = 14)
    plt.xticks(fontsize = 12, rotation = 75)
    plt.style.use('seaborn-whitegrid')
    plt.grid(color='white')
    plt.rcParams['axes.facecolor'] = 'lavender'
    plt.show()
    
     # Frequency table
    Absolute = df_v2[field].value_counts(ascending=False)
    Percent = round((df_v2[field].value_counts(normalize=True))*100,2)
    field_pd=pd.DataFrame({'counts': Absolute, 'percentage': Percent})
    print(field_pd)
    
    
    #Probability Distribution
    sns.kdeplot(data=df_v2[field], color='cadetblue',shade='paleturquoise')
    plt.xlabel(xlabel.capitalize(), fontsize = 14)
    plt.grid(color='white')
    plt.rcParams['axes.facecolor'] = 'lavender'
    plt.show()
    
    #Box-plot
    sns.boxenplot(data=df_v2[field],color='cadetblue', orient="h")
    plt.xlabel(xlabel.capitalize(), fontsize = 14)
    plt.ylabel(ylabel, fontsize = 14)
    plt.grid(color='white')
    plt.rcParams['axes.facecolor'] = 'lavender'
    plt.show()
    
       
# Function to plot categorical variables
def plot_object(field, xlabel, ylabel):
    
    #Histogram
    plt.hist(df_v2[field], bins='auto',color='cadetblue', edgecolor='grey', histtype='bar', rwidth=1)
    plt.xlabel(xlabel, fontsize = 14)
    plt.ylabel(ylabel, fontsize = 14)
    plt.title(field.capitalize(), fontsize=18)
    plt.xticks(fontsize = 12, rotation = 75)
    plt.rcParams['axes.facecolor'] = 'lavender'
    plt.grid(color='white')
    plt.show()
    
    #Frequency Table
    Absolute = df_v2[field].value_counts(ascending=False)
    Percent = round((df_v2[field].value_counts(normalize=True))*100,2)
    field_pd=pd.DataFrame({'counts': Absolute, 'percentage': Percent})
    print(field_pd)
    
# Function to plot visuals depending on data type
def plot_field(df_v2, field, xlabel, ylabel):
    if df_v2[field].dtype == 'int64' or df_v2[field].dtype =='float64':
        plot_numeric(field, xlabel, ylabel)
    elif df_v2[field].dtype == 'object':
        plot_object(field, xlabel, ylabel)   


**1. Age**

 - The biggest three age groups that were targeted in the campaign are "25-35", "35-45" and "45-55"
 - The smallest group is "above 65"

In [None]:
field="age"
xlabel="Age"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

In [None]:
field="age_groups"
xlabel="Age"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**2. Job**

Type of job  (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired", "self-employed","services","student","technician","unemployed","unknown")

- The biggest three groups are represented by admininistrative jobs, blue-collar, technicians. This professions make up approximately more than 50% of the whole group
- The smallest three groups are represented by unemployed, students and an uknown group. The marketing team assumed that these groups might not have savings to deposit.
- The campaign was not targeting self-employed and entrepreneurs. This could mean that the department was targeting mainly individuals and not legal entities; also it can mean that deposits offers would be of interest only to individuals and not self-employed or enterpreneurs.
- Interestingly, groups "management","retired" and "services" were not actively approached despite the fact that these groups might have savings to invest into deposits.

In [None]:
field="job"
xlabel="Job Category"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**3. Marital: marital status**

Type categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)

 - Most of the people in this dataset are married and single

In [None]:
field="marital_status"
xlabel="Marital Status"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**4. Education**

Type categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")*

 - Around 30% of dataset have higher education
 - Around 23 % of dataset have only high school diploma
 - The rest of the dataset have only 4 to 9 years of basic education or professional courses 



In [None]:
field="educational_attainment"
xlabel="Education Category"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**5. Default: has credit in default?** 

Type categorical: "no","yes","unknown"

 - Around 80% of the dataset do not have credit in defaults, i.e. have not broken terms of credit agreement
 - It is unknown about the rest of the dataset
 - All-in-all it seems that sales agents avoided people with financial troubles and reached out to people who they did not know whether they had default or not
     

In [None]:
field="credit_in_default"
xlabel="Has credit in default?"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**6. Housing: has housing loan?** 

Type categorical: "no","yes","unknown"

 - Around 50% of people have housing loans and around 45% do not
 - It seems that having a housing loan was not indicative of failure for sales agents as they did not target only people without housing loans. In comparison with default status - where sales agents targeted only people without default credits.


In [None]:
field="housing_loan"
xlabel="Has housing loan?"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**7. loan: has personal loan?** 

Type categorical: "no","yes","unknown")

 - Presence of personal loan seems to be a negative factor for sales agents as they targeted primarily people without personal loan.
 - The logic of sales agents is becoming more clear - having a personal loan or default credit makes people less suitable candidates for opening a deposit, however having a housing loan is not a barrier.

In [None]:
field="personal_loan"
xlabel="Has personal loan?"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

###   Variables related with the last contact of the current campaign:

**8. contact: contact communication type**

Type categorical: "cellular","telephone"

 - Around 64% of people were reached out to their cell phones and the rest - 36%- through line phones


In [None]:
field="contact_type"
xlabel="Contact communication type"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**9. month: last contact month of year** 
    
Type categorical: "jan", "feb", "mar", ..., "nov", "dec"

 - The number of calls starts sligthly increasing in **spring**, i.e. in April with 3000 calls, spikes in May with 14000, and stays around the same till the end of **summer** from 5000 till 7000 calls.
 - There are 0 calls are done in January of February 
 - There are less then 1000 calls are done in September, October and December. Only in November there were 4000 calls - the highest number in **autumn-winter** season.
 - My assumption is sales agents bother clients less prior to and after winter holidays in December and January. And, probably, start planning the campaign in February


In [None]:
df_v2['last_contact_month_num'] = df_v2['last_contact_month'].replace(['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul','aug','sep','oct','nov','dec'], ['1-jan', '2-feb', '3-mar', '4-apr', '5-may', '6-jun', '7-jul','8-aug','9-sep','10-oct','11-nov','12-dec'])
df_v2

In [None]:
field="last_contact_month_num"
xlabel="Last contact month of year"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**10. Day of week: last contact day of the week**
    
Type categorical: "mon","tue","wed","thu","fri"

- Sales agents made around 8000 calls on all working days

In [None]:
field="last_contact_day"
xlabel="Last contact day of the week"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**11. Duration: last contact duration, in seconds (numeric)**. 

*Important note:  this attribute highly affects the output target (e.g., if duration=0 then y="no"). 
Yet, the duration is not known before a call is performed. 
Also, after the end of the call y is obviously known. 
Thus, this input should only be included for benchmark purposes and 
should be discarded if the intention is to have a realistic predictive model.*

 - 75% of calls lasted less then 319 second (5 minutes), only 25% of calls lasted more then 319 seconds (5 minutes)
 - Median is much lower than the mean, it means that there are a lot of calls with duration lower than mean
 - Maximum duration is around 81 min (1 hour and 20 minutes)
 - The highest duration group is "2-4 min" with 30% of all calls and "1-2 min" with 21%
 - There were around 10% of calls less than 1 minute
 - Around 14% calls lasted more than 8 minutes

In [None]:
field="last_contact_duration_min"
xlabel="Last contact duration"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

In [None]:
field="duration_min_groups"
xlabel="Last contact duration"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

In [None]:
df_v2['last_contact_duration_min'].describe()

In [None]:
df_v2['last_contact_duration'].max()

## Other attributes

**12. Campaign**

Number of contacts performed during this campaign and for this client (numeric, includes last contact)

 - 43% of clients were called once, 25% of people - twice and 13% - 3 times
 - Someone was called 56 times!



In [None]:
field="current_camp_contact_count"
xlabel="Number of contacts during campaign"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**13. Contacted or not**

 - 96% of people were contacted for the first time

In [None]:
field="contacted_or_not"
xlabel="Was the client contacted during a previous campaign?"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**14. Previous**

Number of contacts performed before this campaign and for this client (numeric)

 - 86% of clients were not contacted previously, this mismatches with the previous variable where it says that 96% of people were contacted for the first time
 - During the previous campaign, clients were contacted maximum 7 times

In [None]:
field="previous_camp_contact_count"
xlabel="Number of contacts performed before this campaign and for this client "
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**15. Previous outcome**

Outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")

 - There is no information for 86% of clients which matches the indicator before that says that 86% of clients were contacted for the first time
 - Only 3,3% of clients subscribed to deposits in previous campaign

In [None]:
field="previous_outcome"
xlabel="Outcome of the previous marketing campaign"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

#### Social and economic context attributes

**16. Employment variation rate**

Quarterly indicator (numeric)

**Emp.var.rate definition**: *Cyclical employment variation is essentially the variation of how many people are being hired or fired due to the shifts in the conditions of the economy. When the economy is in a recession or depression, people should be more conservative with their money and how the spend it because their financial future is less clear due to cyclical unemployment. When the economy is at its peak, individuals can be more open to risky investments because their employment options are greater (That means that the employment rate itself is not given but rather its variation.)*.

Employment variation rate varies from -3.4 till 1.4, here the mean is significantly lower than the median which means there are many negative values that impact the mean. 

- 39% of people were contacted when the rate was positive 1.4%;
- 22% of people were contacted when the rate was negative 1.8%;
- 19% of people were contacted when the rate was positive 1.1%

It seems logical that sales agents reached out during positive employment variation rate, however its not clear why they called when the rate was negative 1.8%


In [None]:
df_v2['employment_variation_rate'].describe()

In [None]:
field="employment_variation_rate"
xlabel="Employment variation rate - quarterly indicator "
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**17. Consumer price index**

Monthly indicator (numeric)  

**Consumer Price Index Definition**: *Based on the formula, i.e. if the index is below 100 that means that the cost of market basket has decreased in comparison with the base year which increases purchasing power of currency, if the index is above 100 that means that the cost of market basket increased and it decreases purchasing power of currency.Inflation is an increase of the price of goods and services in general terms. The Consumer Price Index is a measure of the inflation as experienced by people in their day-to-day life. CPI is just a part of inflation just like GDP, Cost-of-living indices, Producer price indices (PPIs), Commodity price indices and Core price indices.*

- Consumer price index which calculates monthly changes of market basket varies from 92.2 to 94.7, the mean and the median are quite close so that means there are no outliers that impact singificantly the mean. 
- There are no values above 100, i.e. there was no deflation of market basket price and no depreciation of currency
- Sales agents mostly called the clients when the rate was between 93 and 95.

In [None]:
df_v2['consumer_price_index'].describe()

In [None]:
field="consumer_price_index"
xlabel="Consumer price index - monthly indicator (numeric)  "
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**18. Consumer confidence index**

 Monthly indicator (numeric)  
 
 

**Consumer Confidence Index Definition**:  *Consumer confidence indicator provides an indication of future developments of households’ consumption and saving, based upon answers regarding their expected financial situation, their sentiment about the general economic situation, unemployment and capability of savings. An indicator above 100 signals a boost in the consumers’ confidence towards the future economic situation, as a consequence of which they are less prone to save, and more inclined to spend money on major purchases in the next 12 months. Values below 100 indicate a pessimistic attitude towards future developments in the economy, possibly resulting in a tendency to save more and consume less*. 

- This index varies from -50 to -26, the mean and the median are quite the same, i.e. there are no outliers that significsntly impact the mean;
- The Portuguese economy belongs to developing economies and was significantly hit by the crisis of 2008, therefore such negative rates.

In [None]:
df_v2['consumer_confidence_index'].describe()

In [None]:
field="consumer_confidence_index"
xlabel="Consumer confidence index - monthly indicator (numeric)  "
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**19. European Interbank Rate**

3 months rate - daily indicator (numeric)


**Euribor 3 months rate Definition**: *Euribor is short for Euro Interbank Offered Rate. The Euribor rates are based on the interest rates at which a panel of European banks borrow funds from one another. In the calculation, the highest and lowest 15% of all the quotes collected are eliminated. The remaining rates will be averaged and rounded to three decimal places. Euribor is determined and published at about 11:00 am each day, Central European Time. Why is Euribor important? The Euribor rates are important because these rates provide the basis for the price or interest rate of all kinds of financial products, like interest rate swaps, interest rate futures, **saving accounts** and mortgages*. 

- Euribor3m varies from 0.6 to 5.The mean (3.6) is significantly higher than the median (4.9), i.e. there are rates that are significantly low and impact the mean by making it lower than the median.
- High interest rates can impact deposit percentage. So sales agents should be reaching out during higher interest rates because saving become more attractive. But looking at the data, it seems there was no special startegy related to euribor index

In [None]:
df_v2['euro_interbank_rate'].describe()

In [None]:
field="euro_interbank_rate"
xlabel="Euribor 3 month rate"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**20. Number of employees**

 Quarterly indicator (numeric)
 
 - Number of employees variates around 5000
 

In [None]:
field="number_of_employees"
xlabel="Number of employees"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

**Output Variable: Current outcome**

Has the client subscribed to a term deposit?**

(binary: "yes","no")

- Only 11.27% of customers subscribed to deposits

In [None]:
field="current_outcome"
xlabel="Has the client subscribed a term deposit?"
ylabel="Count"

plot_field(df_v2, field, xlabel, ylabel)

# 5. Bivariate Analysis

### Numeric vs numeric analysis

**Correlation Matrix**

In [None]:
corr = df_v2.corr().round(1)
print(corr)

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(df_v2.corr(),cmap="RdBu",annot=True, linewidths=.5)


- All macroeconomic indicators 'Employment Variation Rate', 'Consumer Confidence Index' and 'Consumer Price Index' are highly correlated. **This can impact the model because of multicollinearity and it will be required to remove some of them;**
- 'Previous Campaign Contact Count' has a strong negative correlations with all three macroeconomic indicators and 'Number of employees'.

**Signifiance of correlation**

In [None]:
from scipy.stats import pearsonr
import numpy as np
corr = df_v2.corr()
pval = df_v2.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*corr.shape)
p = pval.applymap(lambda x: ''.join(['*' for t in [0.01,0.05,0.1] if x<=t]))
corr.round(2).astype(str) + p

**Pairplot**

Observations:
- 'Last Contact Duration' and 'Age': Population of clients who have subscribed are in higher threshold than clients who have not -> selling deposits requires longer conversations;
- 'Current Campaign Contact Count' and 'Age': Calling frequently, more than 10 times a single customer does not result in subscriptions;
- 'Employment Varitation Rate' and 'Age': Clients subscribe during more turbulent labor market conditions when the employment varation rate was negative from -1 to -3 with heterogeneous age groups;
- 'Euro Interbank Rate' and 'Age': More clients below 50 years old subscribed to deposits during beneficial times of higher interbank rates, heterogeneous age groups subscribed when the rate was from 1 to 2;


- 'Current Campaign Contact Count' and 'Last Contact Duration': There are two distinct populations of clients who were frequently contacted, more than 10 times and whose call duration was shorter that the second group, and 
the second group that subscribed was called less than 10 times and had longer conversations that - **sales agents should not contact clients more than 10 times**;

- 'Employment Variation Rate' Density Plot: more clients subscribed when the rate was negative and the employment options narrowed which motivated the clients to save more;
- 'Consumer Confidence Index' and 'Consumer Price Index' did not show any interesting correlations.

In [None]:
%%time
sns.pairplot(df_v2, hue = 'current_outcome',palette=["#b01c5a","#1cadb0"])
sns.set(rc = {'figure.figsize':(60,60)})
plt.show()

### Numerical vs categorical analysis

Observations:
- There are no particular insights except that older people subscribe more frequently and that call duration for those who subscribed is obviously longer.

In [None]:
g=sns.boxplot(x='educational_attainment',y='age', data=df_v2,hue='current_outcome', palette=["#b01c5a","#1cadb0"])
sns.set(rc = {'figure.figsize':(15,8)})
g.set_xlabel('Educational Attainment')
g.set_ylabel('Age')
g.set_title('Age Distribution by Education', fontsize=18)
plt.show(g);

In [None]:
g=sns.boxplot(x='job',y='age', data=df_v2,hue='current_outcome', palette=["#b01c5a","#1cadb0"])
sns.set(rc = {'figure.figsize':(15,8)})
g.set_xlabel('Job')
g.set_ylabel('Age')
g.set_title('Age Distribution by Jobs',fontsize=18)

plt.show(g);

In [None]:
g=sns.boxplot(x='educational_attainment',y='last_contact_duration_min', data=df_v2,hue='current_outcome', palette=["#b01c5a","#1cadb0"])
sns.set(rc = {'figure.figsize':(15,8)})
g.set_xlabel('Educational Attainment')
g.set_ylabel('Last Contact Duration in Min')
g.set_title('Last Contact Duration in Min Distribution by Education')

plt.show(g);

In [None]:
g=sns.boxplot(x='job',y='last_contact_duration_min', data=df_v2,hue='current_outcome', palette=["#b01c5a","#1cadb0"])
sns.set(rc = {'figure.figsize':(15,8)})
g.set_xlabel('Job')
g.set_ylabel('Last Contact Duration in Min')
g.set_title('Last Contact Duration in Min Distribution by Jobs')

plt.show(g);

### Categorical vs Categorical

In [None]:
def plot_bivariate(field, xlabel, ylabel):
    #countplot
    ax=sns.countplot(y=df_v2[field], hue='current_outcome', data=df_v2, order=df_v2[field].value_counts().index, palette=["#b01c5a","#1cadb0"])
    sns.set(rc = {'figure.figsize':(20,10)})
    sns.move_legend(ax, "lower right")
    total = len(df['job'])
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_width()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y))
           
   
    plt.title(field.capitalize(), fontsize=16)
    plt.xlabel(xlabel.capitalize(), fontsize = 12)
    plt.ylabel(ylabel, fontsize = 11)
    plt.style.use('seaborn-whitegrid')
    plt.grid(color='white')
    plt.rcParams['axes.facecolor'] = 'lavender'
    plt.show()
    
    #crosstab
    data_crosstab = pd.crosstab(df_v2[field], df_v2['current_outcome'], margins =True)
    print(data_crosstab)

**Job Categories by Subscription**

- **Admin positions, blue-collars and technicians** subscribed the most but they were also contacted and targeted the most;
- Students and retired subscribed proportionally more frequently despite of not being contacted many times - **students and retired should get a special attention in the next campaign**

In [None]:
field="job"
xlabel="Job"
ylabel="Count"

plot_bivariate(field, xlabel, ylabel)

**Marital Status by Subscription**

There are no striking differences

In [None]:
field="marital_status"
xlabel="Marital Status"
ylabel="Count"

plot_bivariate(field, xlabel, ylabel)

**Educational Attainment by Subscription**

- Clients with **university degree, high school diploma and professional courses** subscribed the most;

In [None]:
field="educational_attainment"
xlabel="Educational Attainment"
ylabel="Count"

plot_bivariate(field, xlabel, ylabel)

**Age Groups**

- **25-35, 35-45, 45-55** subscribed the most as they were targeted the most;
- Conversion of **Under 25, Abovr 65** is high despite of not being targeted widely.

In [None]:
field="age_groups"
xlabel="Age Groups"
ylabel="Count"

plot_bivariate(field, xlabel, ylabel)

**Duration in Minutes in Groups/Bins**

- **Above 8 minutes** is the duration of call that converted the most customers
- **4-6 minutes** is the second by conversion

-> **sales agents should have conversation plans and prompts that would be enough to hold a conversation for 6-8 minutes or longer. As there is a possibility for calls to get lenghty, better to book a timeslot in advance instead of cold calling**

In [None]:
field="duration_min_groups"
xlabel="Duration in Minutes in Groups/Bins"
ylabel="Count"

plot_bivariate(field, xlabel, ylabel)

**Credit in Default by Subscription**

Sales agents did not contact clients with defaulted credits

In [None]:
field="credit_in_default"
xlabel="Credit in default"
ylabel="Count"

plot_bivariate(field, xlabel, ylabel)

**Housing Loan by Subscription**

 Having a housing loan is a neutral indicator


In [None]:
field="housing_loan"
xlabel="Housing loan"
ylabel="Count"

plot_bivariate(field, xlabel, ylabel)

**Personal Loan by Subscription**

- Clients that have a personal loan are not likely to subscribe, **it is more efficient to target clients without loans**;


In [None]:
field="personal_loan"
xlabel="Personal Loan"
ylabel="Count"

plot_bivariate(field, xlabel, ylabel)

**Contact Type by Subscription**

People that were contacted through a cellular phone subscribed more frequently but that does nor imply that agents should call only cellular phones

In [None]:
field="contact_type"
xlabel="Contact Type"
ylabel="Count"

plot_bivariate(field, xlabel, ylabel)

**Last Contact Month by Subscription**

- Sales agents were the most active in May and made 34% of calls in that month;
- The trend seems to increase in spring, hold high in summer and steadily drop in autumn;
- **Sales agents should start the campaign in March and wrap up towards November**.

In [None]:
field="last_contact_month"
xlabel="Last Contact Month"
ylabel="Count"

plot_bivariate(field, xlabel, ylabel)

**Last Contact Day by Subscription**

There are no interesting insights


In [None]:
field="last_contact_day"
xlabel="Last Contact Day"
ylabel="Count"

plot_bivariate(field, xlabel, ylabel)

**Previous Outcome by Subscription**
 
 A lot of unknown values - can be discarded


In [None]:
field="previous_outcome"
xlabel="Previous Outcome"
ylabel="Count"

plot_bivariate(field, xlabel, ylabel)

# 6. Recommendations

1. Having a credit in default and Personal Loan can be a meaningul variable for the future sales campaign and **sales agents can save time by targeting the clients without defaults and personal loans**;
2. **Sales agents should start the campaign in March, continue till the end of summer and wrap up towards November, before the holidays**. It should also be investigated internally why so many calls have been done in May.
3. All macroeconomic indicators 'Employment Variation Rate', 'Consumer Confidence Index' and 'Consumer Price Index' are highly correlated. **This can impact the model because of multicollinearity and it will be required to remove some of them;**
4. Among macroeconomic variables only Employment Variation Rate and Euribor Rate have enough variation and showed interesting patterns in visuals, other variables did not show any interesting correlations and  can be removed.
5. 'Last Contact Duration' and 'Age': Population of clients who have subscribed are in higher threshold than clients who have not -> **selling deposits requires longer conversations**;
6. 'Employment Varitation Rate' and 'Age': Clients subscribe during more turbulent labor market conditions when the employment varation rate was negative from -1 to -3 with heterogeneous age groups -> **sales agents should more actively engage in campaigns when the rate is negative/falling as the employment options narrow - clients are more motivated to save**;
7. 'Euro Interbank Rate' and 'Age': More clients below 50 years old subscribed to deposits during beneficial times of higher interbank rates, heterogeneous age groups subscribed when the rate was from 1 to 2 ->**sales agents should more actively engage in campaigns when the rate is high**;
8. 'Current Campaign Contact Count' and 'Last Contact Duration': There are two distinct populations of clients who were frequently contacted, more than 10 times and whose call duration was shorter that the second group, and the second group that subscribed was called less than 10 times and had longer conversations that -> **sales agents should not contact clients more than 10 times**;
9. Job Categories by Subscription: Admin positions, blue-collars and technicians subscribed the most but they were also contacted and targeted the most; Students and retired subscribed proportionally more frequently despite of not being contacted many times -> **Students and retired should get a special attention in the next campaign**
10. Clients with university degree, high school diploma and professional courses subscribed the most;
11. **25-35, 35-45, 45-55 subscribed the most** as they were targeted the most; **Conversion of Under 25, Above 65 is high** despite of not being targeted widely;
12. Above 8 minutes is the duration of call that converted the most customers; 4-6 minutes is the second by conversion -> **sales agents should have conversation plans and prompts that would be enough to hold a conversation for 6-8 minutes or longer. As there is a possibility for calls to get lenghty, better to book a timeslot in advance instead of cold calling**
