# Defining the problem

Our goal in this notebook is to classify clients who are and are not willing to subscribe to a term deposit based on the given dataset (target column 'y')

Here is the plan:
1. Data Engineering: check data correctness, fill unknown data cells, mofidy and convert data properties for calculation
2. Exploratory Data Analysis: analyzing data to filter out some main patterns or characteristics
3. Training Models
4. Evaluation

# 1. Data Engineering

Importing libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.plotting import scatter_matrix

Read the dataset

In [None]:
data = pd.read_csv("../input/banking-dataset-marketing-targets/train.csv", sep =";")
data.sample(10)

In [None]:
data.info()

Two things can be seen here:

1. Dataset contains two datatypes: int64 (numerical) and object which is expressed as a string, knowing datatype helps us with using correct operations on correct data columns later on
2. There is no null value (every columns has 45211 non-null values)

Check if there are any outliers (for example age value > 100).

In [None]:
data.describe(include='all')

But seems like for numerical values according to min and max value all of them are reasonable values.

It can be concluded that numerical columns are complete because there is no null values and no outliers.

We will now check validity of string values

In [None]:
stringdata = data.select_dtypes(include="object")
for column in stringdata:
    print(stringdata[column].value_counts())
    print ("-" * 20)

Columns job, education, contact and poutcome have "unknown" values, which have to be filled.

Job column: Ignore the 288 rows with unknown job because firstly there seems to be no reasonable values to be replaced ("blue collar" and "management" frequencies are only about 25%), secondly this is a small amount of data comparing to our dataset size;

Poutcome column: Ignore this property since most of data are unknown (36,959 out of 45,211);

Education and Contact columns will be filled with the reasonable most frequent values

In [None]:
data = data[data['job'] != 'unknown']

data.drop('poutcome', axis = 1, inplace = True)

data['education'].replace("unknown", data['education'].mode()[0], inplace = True)
data['contact'].replace("unknown", data['contact'].mode()[0], inplace = True)

Two versions of data are created:
1. data_visual is for exploratory data analysis
2. data_calc is for calculations 

In [None]:
data_visual = data.copy(deep = True)
data_visual['y'].replace("no", 0, inplace = True)
data_visual['y'].replace("yes", 1, inplace = True)
data_visual

In [None]:
intdata = data.select_dtypes(include="int64")
for column in intdata:
    data[column + "_bin"] = pd.cut(data[column], 8)
    data.drop(column, axis = 1, inplace = True)
    
label = LabelEncoder()
data_calc = pd.DataFrame()
for column in data:
    data_calc[column] = label.fit_transform(data[column])
    
data_calc

# 2. Exploratory Data Analysis

In this part we will anaylize data by going through several graphs

In [None]:
plt.figure(figsize = (17, 5))
sns.distplot(data_visual.loc[data_visual.y == 0, 'age'], label = "Not Subscribed", hist = False)
sns.distplot(data_visual.loc[data_visual.y == 1, 'age'], label = "Subscribed", hist = False)
plt.title("Age Distribution by Subscription")

plt.figure(figsize = (17, 5))
sns.distplot(data_visual.loc[data_visual.y == 0, 'duration'], label = "Not Subscribed", hist = False)
sns.distplot(data_visual.loc[data_visual.y == 1, 'duration'], label = "Subscribed", hist = False)
plt.title("Duration of Last Time Contact Distribution by Subscription")

It can be concluded that 60 year and older clients, young people around 20 year old and clients whose last time contact was longer than 500 seconds tends to agree to subscribe term deposit

In [None]:
plt.figure( figsize = (20, 5))
sns.barplot(data = data_visual, x = 'job', y = 'y')
plt.xlabel("Job", fontsize = 14)
plt.ylabel("Probability", fontsize = 14)
plt.title("Subscribe Probability by Job", fontsize = 14)

plt.figure( figsize = (20, 5))
plt.subplot(121)
sns.barplot(data = data_visual, x = 'marital', y = 'y')
plt.xlabel("Marital Situation", fontsize = 14)
plt.ylabel("Probability", fontsize = 14)
plt.title("Subscribe Probability by Marital Situation", fontsize = 14)

plt.subplot(122)
sns.barplot(data = data_visual, x = 'education', y = 'y')
plt.xlabel("Education", fontsize = 14)
plt.ylabel("Probability", fontsize = 14)
plt.title("Subscribe Probability by Education", fontsize = 14)

It can be concluded that, groups of students and retired people, single and people with higher education tend to subscribe the term deposit 

In [None]:
plt.figure( figsize = (20, 8))
sns.violinplot(x = 'job', y = 'age', hue = 'y', data = data_visual, split = True)
plt.xlabel("Job", fontsize = 14)
plt.ylabel("Age", fontsize = 14)
plt.title("Age Distribution by Job, Divided by Subscription (0) for Not Subscribed, (1) for Subscribed", fontsize = 15)

In every job older people, especially Housmaid tend to accept term deposit subscription

In [None]:
plt.figure( figsize = (20, 8))

plt.subplot(121)
sns.boxenplot(x = 'housing', y = 'age', hue = 'y', data = data_visual)
plt.xlabel("Housing", fontsize = 14)
plt.ylabel("Age", fontsize = 14)
plt.title("Age Distribution by Housing Loan", fontsize = 15)

plt.subplot(122)
sns.boxenplot(x = 'loan', y = 'age', hue = 'y', data = data_visual)
plt.xlabel("Loan", fontsize = 14)
plt.ylabel("Age", fontsize = 14)
plt.title("Age Distribution by Personal Loan", fontsize = 15)

In [None]:
color = sns.diverging_palette(250, 6, as_cmap = True)

plt.figure(figsize = (14, 10))
sns.heatmap(data_visual.corr(), cmap = color, annot = True)
plt.title("Features Correlation", fontsize = 15)

From this Heatmap we can conclude a lot of things, for example:
1. "duration" is the most positive correlated feature to Target y
2. "campaign" is the most negative correlated feature to Target y
3. "previous" and "pdays" are positive correlated to each other


# 3. Training Models
The plan is, we devide our dataset into train data and test data, and use different machine learning algorithms to train models

In [None]:
trainx, testx, trainy, testy = model_selection.train_test_split(data_calc.loc[:, data_calc.columns != 'y'], data_calc['y'], random_state = 0)

In [None]:
MLA = [
       ensemble.AdaBoostClassifier(),
       ensemble.BaggingClassifier(),
       ensemble.GradientBoostingClassifier(),
       ensemble.RandomForestClassifier(),
       linear_model.LogisticRegressionCV(),  
       linear_model.SGDClassifier(),
       naive_bayes.GaussianNB(),
       neighbors.KNeighborsClassifier(),
       tree.DecisionTreeClassifier(),
       tree.ExtraTreeClassifier(),
]

name = []
testscore = []
for alg in MLA:
    name.append(alg.__class__.__name__)
    alg.fit(trainx, trainy)
    testscore.append(alg.score(testx, testy))
    
comparison = pd.DataFrame({"name": name, "testscore": testscore})

# 4. Evaluation

Now we test our models on test datas and sort them

In [None]:
comparison = comparison.sort_values(by = "testscore", ascending = False)
comparison

Gradient Boosting Classifier is the best performed model with exactibility 88,79%