![](https://csbcorrespondent.com/sites/default/files/styles/blog_feature_full/public/blog/BANK%20MARKETING%20ANALYTICS.jpg?itok=SwPf4x34)

### Introduction
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

### Input variables:

### Bank client data
1. **age** (numeric)
2. **job** : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. **salary** : amount of salary (numeric)
4. **marital** : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
5. **education** (categorical: 'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown')
6. **targeted** : has been targeted for subscription of term deposit? (categorical: 'no','yes')
7. **default** : has credit in default? (categorical: 'no','yes','unknown')
8. **balance** : balance in the account (numeric)
9. **housing** : has housing loan? (categorical: 'no','yes','unknown')
10. **loan** : has personal loan? (categorical: 'no','yes','unknown')

### Related with the last contact of the current campaign
11. **contact** : contact communication type (categorical: 'cellular','telephone')
12. **day** : last contact day of the week (categorical: '1:mon','2:tue','3:wed','4:thu','5:fri')
13. **month** : last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
14. **duration**: last contact duration, in seconds (numeric)

Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

### Other attributes
15. **campaign**: number of contacts performed during this campaign and for this client (numeric, includes last contact)
16. **pdays**: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
17. **previous**: number of contacts performed before this campaign and for this client (numeric)
18. **poutcome**: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')


### Output variable (desired target)
19. **response** - has the client subscribed a term deposit? (binary: 'yes','no')

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcdefaults()
import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Importing Dataset

In [None]:
df = pd.read_csv('/kaggle/input/banckingmarket/bank.csv')
X1 = df[['job','balance']]

# Describe the pdays column, make note of the mean, median and minimum values. Anything fishy in the values?


In [None]:
df.head()

In [None]:
# NOT CONSIDERING -1 VALUES IN pdays COLUMN
values = []
for i,row in df.iterrows():
    if row["pdays"] > -1:
        values.append(row["pdays"])

In [None]:
quartile = [0.25,0.50,0.75]
quartiles = []
index = ['count','mean','std','min','25%','50%','75%','max']
for i in quartile:
    quartiles.append(np.quantile(values,i))
summary = [len(values),
           np.mean(values),
           np.std(values),
           np.min(values),
           quartiles[0],
           quartiles[1],
           quartiles[2],
           np.max(values)
]
for i,j in zip(summary,index):
    print(f'{j}   {i}')

In [None]:
pdays_mean = np.mean(values)
pdays_median = np.median(values)

The difference in mean is 184.38

# Plot a horizontal bar graph with the median values of balance for each education level value. Which group has the highest median?

In [None]:
# Replacing unkown by the minimum level of education in the education column
df.loc[(df['education']=='unknown'),'education'] = 'primary'
print(df['education'].value_counts())

### Creating different datasets for each education level

In [None]:
df1 = df[df['education'] == 'secondary']
df2 = df[df['education'] == 'tertiary']
df3 = df[df['education'] == 'primary']

### Calculating mean of column 'balance' for each education level

In [None]:
med1 = np.median(df1['balance'])  # 392.0
med2 = np.median(df2['balance'])  # 577.0
med3 = np.median(df3['balance'])  # 432.0
compare = [med1,med2,med3]
edu_list = ['primary','secondary','tertiary']
print(f'Tertiary education has the highest median that is {med2}')

In [None]:
plt.figure(figsize = (5,3))
plt.barh(edu_list, compare, align='center', alpha=0.5)
for index, value in enumerate(compare):
    plt.text(value, index, str(value))
print('Horizontal bar graph displaying the median values of column "balance" for the different levels of education')

# Make a box plot for pdays. Do you see any outliers?

In [None]:
fig, ax = plt.subplots()

my_data = [df['pdays'],values]
ax.boxplot(my_data)
plt.show()
print('Described pdays column using boxplot. \n 1. Considering all the values of pdays column including "-1"\n 2. Considering only the non-negative values of pdays column')

# First, perform bi-variate analysis to identify the features that are directly associated with the target variable

### Plotting the categorical variables

In [None]:
for i in ['marital','education','targeted','default','housing','loan','month','poutcome']:
    df[i].unique()
    fig, ax = plt.subplots()
    fig.set_size_inches(6,3)
    sns.countplot(x = i, data = df)
    ax.set_xlabel(i)
    ax.set_ylabel('Count')
    ax.set_xticklabels(ax.get_xticklabels(),rotation= 45)
    sns.despine()

### ViolinPlot for jobs and balance column

In [None]:
plt.figure(figsize=(13,8))
sns.countplot(X1['job'])
plt.show()
plt.figure(figsize=(13,8))
sns.violinplot(
    x='job',
    y='balance',
    data=X1
)
plt.show()
print('In the graphs above, we have displayed the distribution of jobs among the customers through a countplot graph and the balance the customers of different professions have using violin plot')

Through the countplot we can see that most of the customers are in blue-collar profession.
Through the violin plot, we can infer that a few customers working in management have the highest balance as compared to other jobs

In [None]:
plt.figure(figsize=(16,10))
sns.heatmap(df.corr(), annot = True)

Here, we have drawn a heatmap to establish correlation between the features and the 'response' variable

# Convert the response variable to a convenient form


In [None]:
# Encoding the 'response' variable with 1 and 0
df1.loc[(df1['response']=='yes'),'response'] = 1
df1.loc[(df1['response']=='no'),'response'] = 0
df1['response']=df1['response'].astype('int64')

 # Make suitable plots for associations with numerical features and categorical features’

In [None]:
# Replacing unknown by mode of the job column
df.loc[(df['job']=='unknown'),'job'] = df['job'].mode().get(0)

# Replacing unkown by the minimum level of education in the education column
df.loc[(df['education']=='unknown'),'education'] = 'primary'

sns.pairplot(data=df,x_vars = ['job','marital','education','targeted','default','housing','loan','day','previous'],
             y_vars = ['age','salary','balance','month','duration','campaign','pdays','previous'])

 Here, we have plotted the categorical variables against numerical variables.

# Are the features about the previous campaign data useful?


In [None]:
plt.subplot(2,1,1)
df['previous'].value_counts().nlargest(5).plot(kind='barh')
plt.xlabel('Count')
plt.ylabel('Previous Contact')

plt.subplot(2,1,2)
df['poutcome'].value_counts().nlargest(5).plot(kind='barh')
plt.xlabel('Count')
plt.ylabel('Previous Outcome')

From these graphs, it is clear that whoever is not previously contacted is marked outcome as Unknown.

# Are pdays and poutcome associated with the target?

In [None]:
df1=df[df['poutcome']==1]
df2=df[df['poutcome']==0]

print('Response of people who is marked success in previous Campaign\n',df1['response'].value_counts(),'\n')
print('Response of people who is marked failure in previous Campaign\n',df2['response'].value_counts())


As we can see, that those who responded positively in previous campaign sill have high response in current campaign, while those who responded negatively still have same opinion.

This shows that, the results of previous campaign is still affecting the current campaign.

In [None]:
df['pdays'].value_counts().nlargest(5).plot(kind='barh')
plt.xlabel('Count')
plt.ylabel('Previous Contct Days')

From the graph of 'previous' and 'poutcome', its clear that people who is not contacted before in previous campagins are marked '-1'

# Before the predictive modeling part, make sure to perform –


##  The necessary transformations for the categorical variables and the numeric variables

In [None]:
for i in df:
    if df[i].dtypes == object and i != 'contact':
        df[i] = df[i].astype('category').cat.codes

Here we have converted categorical columns into numerical columns

##  Handle variables corresponding to the previous campaign


There is only one feature which is doubtfull which is pdays because of -1 value, as 'no previous contact person is marked with 999' but while checking the values, there is no record of '999'

While according to the 'previous' and 'poutcome' coulumn, the records which should marked to be '999' re marked as '-1'. There is no need to change the value of -1 to 999 as no other value is olliding with '-1'. So -1 can be considered as it is.   

# CLEANING THE DATASET

In [None]:
per = df['contact'].value_counts()['unknown']
total = df['contact'].count()
print(f'Null values percentage = {per/total*100}')
df.drop(['contact'],axis=1,inplace = True)

The null-percentage of contact column is almost 30% of the total values of the column therefore, we drop 'contacts' column

## Train test split

In [None]:
from sklearn.model_selection import train_test_split
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 0)

# LOGISTIC REGRESSION

### Scaling the features

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### RFE TO REMOVE UNNECESSARY FEATURES

In [None]:
# Splitting df into two dataframes X and y
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

# Extracting the columns of X and storing them in 'cols' list
cols = list(X.columns)

# Importing the necessary libraries
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

rfe = RFE(estimator = LogisticRegression())
rfe.fit(X,y)
X = rfe.transform(X)

temp = pd.Series(rfe.support_,index = cols)
selected_features_rfe = temp[temp==True].index

X=df[selected_features_rfe]
X.head()

### Calculating VIF

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
def calc_VIF(X):
    vif['variables'] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return(vif)

calc_VIF(df)
important_features=[]
for i,row in vif.iterrows():
    if row["VIF"] < 2.5 and row["variables"] != "response":
        print(f'{row["variables"]} ----> {row["VIF"]}')
        important_features.append(row["variables"])

The features shown above are the best features according to VIF, where Variance inflation factor is a measure of the amount of multicollinearity in a set of multiple regression variables. A high VIF indicates that the associated independent variable is highly collinear with the other variables in the model

### Calculating p-value

In [None]:
type(X)
import statsmodels.api as sm
from scipy import stats
X2 = sm.add_constant(df)
est = sm.OLS(y,X2)
est2 = est.fit()
print(est2.summary())

The smaller the p-value shows that that feature is not suitable for the model as it violates the null-hypothesis which is the feature is good for the model. 

The larger the p-value the better the feature is for the model. From the above table  previous, housing, education and marital are some of the best features

# In this model, we will follow the features provided by VIF 

In [None]:
X = df[important_features]
features = X.columns
X.head()

#### Converting X, y dataframe into arrays

In [None]:
X=X.values
y = df['response'].values

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
# Training the model
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train,y_train)

# Testing the model
y_pred=classifier.predict(X_test)

# Printing the accuracy score
print(accuracy_score(y_test,y_pred))

# Printing the confusion matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)

### k-Fold Cross Validation

In [None]:
from sklearn.model_selection import KFold
import numpy
from sklearn.model_selection import cross_val_score
cv = KFold(n_splits = 10, random_state = 1, shuffle = True)
scores = cross_val_score(classifier,X,y,scoring = 'accuracy', cv = cv, n_jobs = -1)
# report performance
print('Accuracy: %.3f (%.3f)' % (numpy.mean(scores), numpy.std(scores)))

### Precision, Accuracy and Recall of our model

In [None]:
recall = cm[0][0]/(cm[0][0] + cm[1][0])
precision = cm[0][0]/(cm[0][0] + cm[0][1])
numpy.mean(scores)
print(f'Recall is -> {recall}\nPrecision is -> {precision}\nAccuracy is -> {numpy.mean(scores)}')

### Important Features

In [None]:
features


# RANDOM FOREST CLASSIFICATION

In [None]:
df = pd.read_csv('/kaggle/input/banckingmarket/bank.csv')
df.drop(['contact'],axis = 1, inplace = True)

for i in df:
    if df[i].dtypes == object:
        df[i] = df[i].astype('category').cat.codes

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

accuracy=[]
estimators_count=[]
for i in range(1,50,2):
    rf = RandomForestClassifier(n_estimators = i, criterion = 'entropy', random_state = 0)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    accuracy.append(accuracy_score(y_test,y_pred))
    estimators_count.append(i)
    print(f'{i} {accuracy_score(y_test,y_pred).round(4)}')
    
plt.plot(estimators_count,accuracy)
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.title('No_of_Estimators VS Accuracy')
plt.grid(b=None)
plt.show()

In [None]:
classifier = RandomForestClassifier(n_estimators = 20, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
 
y_pred = classifier.predict(X_test)

### Accuracy using  k-fold Cross Validation

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
cv = KFold(n_splits = 10, random_state = 1, shuffle = True)
scores = cross_val_score(classifier,X,y,scoring = 'accuracy', cv = cv, n_jobs = -1)
# report performance
print('Accuracy: %.4f (%.4f)' % (numpy.mean(scores), numpy.std(scores)))

### Creating the confusion matrix and printing the recall, precision and accuracy of the model

In [None]:
cm = confusion_matrix(y_test, y_pred)
accuracy_score(y_test, y_pred)

recall = cm[0][0]/(cm[0][0] + cm[1][0])
precision = cm[0][0]/(cm[0][0] + cm[0][1])
np.mean(scores)
print(f'Recall is -> {recall.round(4)}\nPrecision is -> {precision.round(4)}\nAccuracy is -> {numpy.mean(scores).round(4)}')

### Displaying the feature ranking

In [None]:
importances = classifier.feature_importances_
feature_names = df.iloc[:,:-1].columns
indices = np.argsort(importances)[::-1]

print("Feature ranking\n")
for i in range(X_train.shape[1]):
    print("%d.   %s   = %f" % (i + 1, df.columns[indices[i]], importances[indices[i]]))

# Compare the performance of the Random Forest and the logistic model 

The evaluation is done above just after the prediction. The results for RF and Logistic Regression using k-Fold Cross Validation are .9002 and .888 respectively which clearly shows that the RF is working better than the Logistic Regresion but, on the other hand, RF is taking more time to training.


Here we have chosen k-Fold Cross validation as the response variable have large amount of 'no' as compared to 'yes', which shows that the dataset is not balanced. IFf we use accuracy as our metric, any random model can give a very good accuracy, but at the end it would be a random model. TO conquer this problem, we are using k_Fold Cross Validation. 

RF has better performance than the Logisitc model as confirmed by k-Fold Cross Validation result.

In Logistic regression, for selection of features, we have followed VIF and for RF, we have used its inbuilt feature_importance_ attribute to check the features which both the models are using for training and prediction.

In VIF, we have default, balance, loan, duration, campaign and previous as important features while in RF we have duration, balance, age, day, month, pdays as important features. Therefore, we can say that only 2 features as common from the two models.

Also, according to the EDA, previous and pdays are having same values but on different scales. So, this can also be considered as common.