# AI Saturdays Training Exercise - Bank Classifier
The bank's marketing campaigns depend on customer data. The size of this data is so large that it is impossible for a data analyst to extract good information that can aid in the decision-making process.

Machine learning models are fully assisting in the performance of these campaigns. 

## Dataset

This dataset is related to the direct marketing campaigns of a Portuguese banking institution. Marketing campaigns were based on phone calls. Often, more than one contact with the same customer was required, in order to access whether the product ('yes') or not ('no') was subscribed.

The objective is to predict whether the customer will subscribe (yes / no) to a term deposit, building a classification model using decision trees.

## Summay of data
### Categorical Variables :
job : admin,technician, services, management, retired, blue-collar, unemployed, entrepreneur, housemaid, unknown, self-employed, student

marital : married, single, divorced

education: secondary, tertiary, primary, unknown

default : yes, no

housing : yes, no

loan : yes, no

deposit : yes, no (Dependent Variable)

contact : unknown, cellular, telephone

month : jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec

poutcome: unknown, other, failure, success


### Numerical Variables:
age
balance
day
duration
campaign
pdays
previous

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn import datasets
from io import StringIO
from sklearn.tree import export_graphviz
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics
%matplotlib inline

In [4]:
# Create dataframe from .csv
df = pd.read_csv('bank.csv')

# Show number of rows and columns of the dataframe
print("Rows " + str(df.shape[0]) + " Cols: " + str(df.shape[1]))

# Show the first 10 rows (TO-DO)

Filas: 11162 Cols: 17


In [5]:
# Find number of unique values in each column


In [6]:
# Check for null values ​​in the dataset


In [7]:
# Show general dataframe information


In [14]:
# Basic analytical description of the dataframe


In [16]:
# Age distribution


###  Convert categorical data 

In [None]:
# Make a copy first !!
bank_data = bank.copy()


#### Job

In [None]:
# Browse People who made a deposit Vs Job Category
jobs = ['management','blue-collar','technician','admin.','services','retired','self-employed','student',\
        'unemployed','entrepreneur','housemaid','unknown']

for j in jobs:
    print("{:15} : {:5}". format(j, len(bank_data[(bank_data.deposit == "yes") & (bank_data.job ==j)])))

In [17]:
# Different types of job categories and their accounts
bank_data.job.value_counts()


In [None]:
# Combine similar jobs into categories
bank_data['job'] = bank_data['job'].replace(['management', 'admin.'], 'white-collar')
bank_data['job'] = bank_data['job'].replace(['services','housemaid'], 'pink-collar')
bank_data['job'] = bank_data['job'].replace(['retired', 'student', 'unemployed', 'unknown'], 'other')

In [None]:
# New value counts
bank_data.job.value_counts()

#### poutcome

In [None]:
bank_data.poutcome.value_counts()

In [None]:
# Combining "unknown" and "other" as "other" is not really compatible with "success" or "failure
bank_data['poutcome'] = bank_data['poutcome'].replace(['other'] , 'unknown')
bank_data.poutcome.value_counts()

#### contact

In [None]:
# Drop 'contact' 
bank_data.drop('contact', axis=1, inplace=True)

#### default

In [None]:
# values for "default" : yes/no
bank_data["default"]
bank_data['default_cat'] = bank_data['default'].map( {'yes':1, 'no':0} )
bank_data.drop('default', axis=1,inplace = True)

#### housing, loan, deposit

In [None]:
# values for "housing" : yes/no


In [None]:
# values for "loan" : yes/no


In [None]:
# values for "deposit" : yes/no


In [18]:
# pdays: number of days that passed after the client was last contacted from a previous campaign
# -1 means that the client was not previously contacted

print("Customers that have not been contacted before:", len(bank_data[bank_data.pdays==-1]))
print("Maximum values on padys    :", bank_data['pdays'].max())

In [None]:
# The map padys=-1 en un valor grande (se usa 10000) para indicar que está tan lejos en el pasado que no tiene efecto
bank_data.loc[bank_data['pdays'] == -1, 'pdays'] = 10000


In [None]:
# Create a new column: recent_pdays 
bank_data['recent_pdays'] = np.where(bank_data['pdays'], 1/bank_data.pdays, 1/bank_data.pdays)

# Drop 'pdays'
bank_data.drop('pdays', axis=1, inplace = True)

### Convert to dummy values

In [None]:
# Convert categorical variables to dummies
bank_with_dummies = pd.get_dummies(data=bank_data, columns = ['job', 'marital', 'education', 'poutcome'], \
                                   prefix = ['job', 'marital', 'education', 'poutcome'])
bank_with_dummies.head()

In [None]:
bank_with_dummies.shape


In [None]:
bank_with_dummies.describe()


In [None]:
# Scatterplot showing age and balance
bank_with_dummies.plot(kind='scatter', x='age', y='balance');

# What do you interpret?

In [None]:
bank_with_dummies.plot(kind='hist', x='poutcome_success', y='duration');


In [None]:
# Personas que se inscriben en un depósito a plazo
bank_with_dummies[bank_data.deposit_cat == 1].describe()


In [None]:
# Bar chart of job Vs deposit
plt.figure(figsize = (10,6))
sns.barplot(x='job', y = 'deposit_cat', data = bank_data)

### Establish relationships between features


In [9]:
# Show variable correlation matrix
# Hint: explore plt.matshow and corr () from a dataframe

In [10]:
# Show correlations as a discrete function between different variables with a matrix
# useful for appreciating linear relationships

# Hint: explore pd.plotting.scatter_matrix

In [11]:
# Split the test in a certain proportion (experiment!)


In [13]:
# Define a classifier


# Train the classifier with the train dataset

# Predict values for independent test variables

# Calculate accuracy
# Hint: explore sklearn.metrics.accuracy_score


#### Compare Training and Testing scores for various tree depths used


In [None]:
#print('{:10} {:20} {:20}'.format('depth', 'Training score','Testing score'))
#print('{:10} {:20} {:20}'.format('-----', '--------------','-------------'))
#print('{:1} {:>25} {:>20}'.format(2, dt2_score_train, dt2_score_test))
#print('{:1} {:>23} {:>20}'.format("max", dt1_score_train, dt1_score_test))

In [None]:
# Uncomment below to generate the digraph Tree.
#tree.export_graphviz(dt2, out_file='tree_depth_2.dot', feature_names=features)

### Best result achieved by us -> Accuracy: 0.8943918426802622

