# Practice 1

We will use a marketing data set of a Portuguese banking institution. The classification goal is to predict whether the client will subscribes a term deposit. (vadeli mevduat)

Data set: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

# Attribute Information:

**Input variables:**
## bank client data:
1 - age (numeric)

2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6 - housing: has housing loan? (categorical: 'no','yes','unknown')

7 - loan: has personal loan? (categorical: 'no','yes','unknown')

## related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')

9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

## other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

## social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric)

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

**Output variable (desired target):**
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("bank-additional-full.csv", delimiter=";")
df.head(10)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
5,45,services,married,basic.9y,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
6,59,admin.,married,professional.course,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
7,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
8,24,technician,single,professional.course,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
9,25,services,single,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [2]:
# Pre-processing and feature selection
# We will build a model considering variables related to the customer itself. 
# Therefore the accuracy of the model shows the explanatoriness of the chosen features.

df = df[["age", "job", "marital", "education", "default", "housing", "loan", "y"]]
df.tail(10)

Unnamed: 0,age,job,marital,education,default,housing,loan,y
41178,62,retired,married,university.degree,no,no,no,yes
41179,64,retired,divorced,professional.course,no,yes,no,no
41180,36,admin.,married,university.degree,no,no,no,no
41181,37,admin.,married,university.degree,no,yes,no,yes
41182,29,unemployed,single,basic.4y,no,yes,no,no
41183,73,retired,married,professional.course,no,yes,no,yes
41184,46,blue-collar,married,professional.course,no,no,no,no
41185,56,retired,married,university.degree,no,yes,no,no
41186,44,technician,married,professional.course,no,no,no,yes
41187,74,retired,married,professional.course,no,yes,no,no


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 8 columns):
age          41188 non-null int64
job          41188 non-null object
marital      41188 non-null object
education    41188 non-null object
default      41188 non-null object
housing      41188 non-null object
loan         41188 non-null object
y            41188 non-null object
dtypes: int64(1), object(7)
memory usage: 2.5+ MB


In [4]:
df["y"].value_counts()

no     36548
yes     4640
Name: y, dtype: int64

In [5]:
# Getting number of uniq values for each col

for col in df.columns[1:-1]:
    print(col.upper() + "\n", df[col].value_counts(), "\n")

JOB
 admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
unknown            330
Name: job, dtype: int64 

MARITAL
 married     24928
single      11568
divorced     4612
unknown        80
Name: marital, dtype: int64 

EDUCATION
 university.degree      12168
high.school             9515
basic.9y                6045
professional.course     5243
basic.4y                4176
basic.6y                2292
unknown                 1731
illiterate                18
Name: education, dtype: int64 

DEFAULT
 no         32588
unknown     8597
yes            3
Name: default, dtype: int64 

HOUSING
 yes        21576
no         18622
unknown      990
Name: housing, dtype: int64 

LOAN
 no         33950
yes         6248
unknown      990
Name: loan, dtype: int64 



In [6]:
# Seeing that most of the customers have either "no" or "unknown" default, we can adjust our feature selection
# Then we'll get rid of unknown values since they are very rare

df = df[["age", "job", "marital", "education", "housing", "loan", "y"]]
df.replace("unknown", value=np.NaN, inplace=True)
df = df.dropna()

df.head()

Unnamed: 0,age,job,marital,education,housing,loan,y
0,56,housemaid,married,basic.4y,no,no,no
1,57,services,married,high.school,no,no,no
2,37,services,married,high.school,yes,no,no
3,40,admin.,married,basic.6y,no,no,no
4,56,services,married,high.school,no,yes,no


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38245 entries, 0 to 41187
Data columns (total 7 columns):
age          38245 non-null int64
job          38245 non-null object
marital      38245 non-null object
education    38245 non-null object
housing      38245 non-null object
loan         38245 non-null object
y            38245 non-null object
dtypes: int64(1), object(6)
memory usage: 2.3+ MB


In [8]:
# Now let's convert the categorical data
from sklearn.preprocessing import LabelEncoder

# Label encoder order is alphabetical
le = LabelEncoder()

# Dealing with categorical data
for col in df.columns[1:]:
    df[col] = le.fit_transform(df[col])
    
# Categorizing age
df.loc[df['age'] <= 32, 'age'] = 1
df.loc[(df['age'] > 32) & (df['age'] <= 47), 'age'] = 2
df.loc[(df['age'] > 47) & (df['age'] <= 70), 'age'] = 3
df.loc[(df['age'] > 70) & (df['age'] <= 98), 'age'] = 4

df.head()

Unnamed: 0,age,job,marital,education,housing,loan,y
0,3,3,1,0,0,0,0
1,3,7,1,3,0,0,0
2,2,7,1,3,1,0,0
3,2,0,1,1,0,0,0
4,3,7,1,3,0,1,0


In [9]:
# Converting the data into np arrays
X = np.asanyarray(df[["age", "job", "marital", "education", "housing", "loan"]])
y = np.asanyarray(df["y"])

In [10]:
# Train-test split: 80-20
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=19)

In [13]:
# Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

for i in range(1, 50):
    dtree = DecisionTreeClassifier(criterion="entropy", max_depth=i)
    dtree.fit(X_train, y_train)
    pred = dtree.predict(X_test)
    
    print("Max depth: %d -- Accuracy: %f" % (i, accuracy_score(y_test, pred)))
    print(confusion_matrix(y_test, pred))

Max depth: 1 -- Accuracy: 0.894235
[[6840    0]
 [ 809    0]]
Max depth: 2 -- Accuracy: 0.894235
[[6840    0]
 [ 809    0]]
Max depth: 3 -- Accuracy: 0.894235
[[6836    4]
 [ 805    4]]
Max depth: 4 -- Accuracy: 0.893842
[[6835    5]
 [ 807    2]]
Max depth: 5 -- Accuracy: 0.894235
[[6835    5]
 [ 804    5]]
Max depth: 6 -- Accuracy: 0.894235
[[6833    7]
 [ 802    7]]
Max depth: 7 -- Accuracy: 0.894365
[[6822   18]
 [ 790   19]]
Max depth: 8 -- Accuracy: 0.893973
[[6818   22]
 [ 789   20]]
Max depth: 9 -- Accuracy: 0.893450
[[6813   27]
 [ 788   21]]
Max depth: 10 -- Accuracy: 0.893189
[[6808   32]
 [ 785   24]]


In [14]:
# Scaling the data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

In [21]:
# Logistic Regression Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, classification_report

logreg = LogisticRegression(solver="liblinear")
logreg.fit(X_train, y_train)

log_pred = logreg.predict(X_test)
log_proba = logreg.predict_proba(X_test)

# Accuracy score
print(accuracy_score(y_test, log_pred))

# Confusion matrix
print(confusion_matrix(y_test, log_pred))

# Classification report
print(classification_report(y_test, log_pred))

# Log loss
print(log_loss(y_test, log_proba))

0.8942345404628056
[[6840    0]
 [ 809    0]]
              precision    recall  f1-score   support

           0       0.89      1.00      0.94      6840
           1       0.00      0.00      0.00       809

    accuracy                           0.89      7649
   macro avg       0.45      0.50      0.47      7649
weighted avg       0.80      0.89      0.84      7649

0.33563291070367046


  'precision', 'predicted', average, warn_for)
