# <center>Portuguese Bank Marketing Analytics

# Table of Contents
1. **[Project Background](#1)**
2. **[Load the data and Examine the data](#2)**
3. **[Data Cleaning](#3)**
4. **[Exploratory Data Analysis](#4)**
5. **[Data Visualization](#5)**
6. **[Machine Learning: Classification](#6)**
7. **[Conclusion & Recommendations](#7)**

### 1. Project Background

Portuguese bank lost their revenue, and they wanted to investigate why their revenue declined. So, they can take necessary steps to solve bank problems. After deep analysis, they discovered that the main reason is that their clients are not depositing as frequently as before. Expressive to term deposits allow banks to hold onto a deposit for certain amount of time, so banks can invest in higher gain financial products to make a profit. Furthermore, banks also hold better chance to encourage term deposit clients into buying other products such as funds or insurance to further increase their revenues. Consequently, the Portuguese bank would like to identify existing clients that have higher chance to subscribe for a term deposit and focus marketing efforts on such clients.

###### This Jupyter Notebook loads, explores, and visualizes the Bank Marketing datasets. Also, builds and tests several predictive models, and then predict if client will subscribe a term deposit on test data using the best model.

In [1]:
# Current workspace
!pwd

/Users/suroor/Desktop/Springboard/Bank-Marketing-Client-Subscription-


In [2]:
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
from sklearn.utils import shuffle

%matplotlib inline

###  2. Load the data and Examine the data

In [3]:
#Load csv file to pd dataframe
bank_data = pd.read_csv("bank-additional-full.csv",sep=';')

FileNotFoundError: [Errno 2] File b'bank-additional-full.csv' does not exist: b'bank-additional-full.csv'

#### <font color=green>*Examine the data*</font>

In [None]:
# Columns information
bank_data.columns

In [None]:
#change column names
bank_data.rename(columns={'default': 'has_credit','housing':'housing_loan','loan':'personal_loan','y':'subscribed'}, inplace=True)
bank_data.columns

In [None]:
# print first five rows of bank_data
bank_data.head(10)

In [None]:
# display total number of rows and columns
bank_data.shape

In [None]:
# know bak_data information 
bank_data.info()

In [None]:
# Describe numeric bank_data
bank_data.describe()

In [None]:
# check the occurrence of each job in bank_data
bank_data['job'].value_counts()

### 3. Data Cleaning

<img src='https://www.officedepot.com/resource/blob/60908/0f84e67038ac7cd8966cacaaab7f92fb/know-what--and-when--to-purge-data.gif' width='500'>

#### *Check for Duplicates*

In [None]:
bank_data.duplicated().sum()

#### <font color=green>*Check invalid or corrupt data and remove it*</font>

In [None]:
'''remove duplicated rows'''
def clean_data(data):
    clean_data = data.drop_duplicates()
    return clean_data

In [None]:
clean_bank_data = clean_data(bank_data)
clean_bank_data.shape

In [None]:
clean_bank_data.info()

### 4. Exploratory Data Analysis

<img src='http://piesiecreativity.com/wp-content/uploads/2018/02/Online-online.gif' width='500'>

In [None]:
'''Divide varibles in categorical and numerical'''
categorical_vars = [col for col in clean_bank_data.columns if (clean_bank_data[col].dtype == 'object') & (col != 'subscribed')]
numeric_vars = ['age', 'duration','campaign','pdays','previous','emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m','nr.employed']
target_var = ['subscribed']

In [None]:
# seperate feature and traget variables
X = clean_bank_data.iloc[:,:-1]
y = clean_bank_data.iloc[:,-1]
X_list = list(X.columns)
X_list

In [None]:
# know target variable data
y.value_counts()

In [None]:
X['marital'].value_counts()

#### *Summarize numeric variable*

In [None]:
X.describe(include=[np.number])

#### *summarize categorical variable*

In [None]:
X.describe(include = ['O'])

#### *Find Correction of variables*

In [None]:
def get_correlation(data):
    return data.corr()

corr_data = get_correlation(X)
corr_data

### 5. Data Visualization

<img src='https://media0.giphy.com/media/3oKIPEqDGUULpEU0aQ/giphy.gif' width='500'>

In [None]:
plt.figure(figsize = (14, 8))
plt.subplot(2,2,1)
sns.boxplot(X.age, color='Orange')
plt.subplot(2,2,2)
sns.distplot(X.duration, color='green')
plt.subplot(2,2,3)
sns.distplot(X['nr.employed'], color='blue')
plt.subplot(2,2,4)
sns.distplot(X.campaign, color='red')
plt.show()

In [None]:
# Visualize all numerical variables
sns.set(style="darkgrid")
fig, ax = plt.subplots(2, 5, figsize=(35, 15))
for variable, subplot in zip(numeric_vars, ax.flatten()):
    sns.distplot(X[variable], ax=subplot, kde=False, hist=True,color='purple')
    for label in subplot.get_xticklabels():
        label.set_rotation(90)
plt.subplots_adjust(wspace=0.45, hspace=0.8)

In [None]:
# Visualize all categorical variables
sns.set(style="darkgrid")
fig, ax = plt.subplots(2, 5, figsize=(35, 15))
for variable, subplot in zip(categorical_vars, ax.flatten()):
    sns.countplot(X[variable], ax=subplot)
    for label in subplot.get_xticklabels():
        label.set_rotation(90)
plt.subplots_adjust(wspace=0.45, hspace=0.8)


In [None]:
fig, ax = plt.subplots(2, 5, figsize=(35, 15))
for variable, subplot in zip(categorical_vars, ax.flatten()):
    sns.countplot(x=variable,hue='subscribed',data=clean_bank_data,ax=subplot)
    for label in subplot.get_xticklabels():
        label.set_rotation(90)
plt.subplots_adjust(wspace=0.45, hspace=0.8)

In [None]:
# Visualize all numeric variables
#sns.pairplot(clean_bank_data, hue='subscribed')
X.head()

### 6. Machine Learning: Classification

<img src='https://cdn-images-1.medium.com/max/1600/0*NbQlrmQFOsjPFB-f.gif' width='500'>

In [None]:
# import LabelEncoder and instantiate object
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# apply LabelEncoder object on categorical columns
X = X.apply(le.fit_transform)
X.head(10)

In [None]:
#import OneHotEncoder and instantiate object
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
ohe.fit(X)
X = ohe.transform(X).toarray()
X.shape

In [None]:
X

In [None]:
y, unique = pd.factorize(y)
y

In [None]:
 #onehotlabels1 = pd.get_dummies(X)
 #onehotlabels1.shape

In [None]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 10)

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
print('Training Features Shape:', X_train.shape)
print('Training test Shape:', X_test.shape)
print('Testing Features Shape:', y_train.shape)
print('Testing test Shape:', y_test.shape)

In [None]:
# from sklearn.linear_model import LogisticRegression
# #from sklearn.naive_bayes import GaussiaNB
# from sklearn.naive_bayes import GaussianNB
# from sklearn.svm import SVC
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier


# models = [
#     ('LR',LogisticRegression()),
#     ('NB',GaussianNB()),
#     ('SVM', SVC()),
#     ('KNN',KNeighborsClassifier()),
#     ('DT',DecisionTreeClassifier()),
#     ('RF',RandomForestClassifier())
#     ]

# for name, model in models:
#     clf = model
#     clf.fit(X_train,y_train)
#     accuracy = clf.score(X_test,y_test)
#     print(name,': ',accuracy)

In [None]:
# Import the model we are using
from sklearn.ensemble import RandomForestClassifier
# Instantiate model with 50 decision trees
rf = RandomForestClassifier(n_estimators = 50, random_state = 42)

# Train the model on training data
rf.fit(X_train, y_train);

# Use the forest's predict method on the test data
y_predict = rf.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
# Model Accuracy, how often is the classifier correct?
accuracy = accuracy_score(y_test, y_predict)
print("Accuracy:", accuracy)

# Calculate the absolute errors
errors = abs(y_predict - y_test)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

#acc_random_forest = round(rf.score(X_train, y_train) * 100, 2)
#print(round(acc_random_forest,2,), "%")

print("Confusion Matrix:")
print(confusion_matrix(y_test,y_predict))

print("Classification Report:")
print(classification_report(y_test, y_predict))

### Cross Validation (K-fold)

In [None]:
from sklearn.model_selection import KFold, cross_val_score
K_fold = KFold (n_splits=10, shuffle=True,random_state=0)

### kNN

In [None]:
clf = RandomForestClassifier(n_estimators = 10)
score = cross_val_score(clf,X_test,y_test,cv=5,n_jobs=1,scoring='accuracy')
score

In [None]:
round(np.mean(score)*100,2)

In [None]:
#from pprint import pprint
## Look at parameters used by our current forest
#print('Parameters currently in use:\n')
#pprint(rf.get_params())

In [None]:
#from sklearn.model_selection import RandomizedSearchCV
## Number of trees in random forest
#n_estimators = [int(x) for x in np.linspace(start = 20, stop = 500, num = 10)]
## Number of features to consider at every split
#max_features = ['auto', 'sqrt']
## Maximum number of levels in tree
#max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
#max_depth.append(None)
## Minimum number of samples required to split a node
#min_samples_split = [2, 5, 10]
## Minimum number of samples required at each leaf node
#min_samples_leaf = [1, 2, 4]
## Method of selecting samples for training each tree
#bootstrap = [True, False]
## Create the random grid
#random_grid = {'n_estimators': n_estimators,
#               'max_features': max_features,
#               'max_depth': max_depth,
#               'min_samples_split': min_samples_split,
#               'min_samples_leaf': min_samples_leaf,
#               'bootstrap': bootstrap}
#pprint(random_grid)
#

In [None]:
#from sklearn.ensemble import RandomForestClassifier
#from sklearn.model_selection import RandomizedSearchCV
## Use the random grid to search for best hyperparameters
## First create the base model to tune
#rf1 = RandomForestClassifier()
## Random search of parameters, using 3 fold cross validation, 
## search across 100 different combinations, and use all available cores
#rf_random = RandomizedSearchCV(estimator = rf1, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
## Fit the random search model
#rf_random.fit(X_train, y_train)

In [None]:
#rf_random.best_params_

In [None]:
# from sklearn.model_selection import GridSearchCV
# from sklearn.ensemble import RandomForestRegressor
# # Create the parameter grid based on the results of random search 
# param_grid = {
#     'bootstrap': [True],
#     'max_depth': [80, 90, 100, 110],
#     'max_features': [2, 3],
#     'min_samples_leaf': [3, 4, 5],
#     'min_samples_split': [8, 10, 12],
#     'n_estimators': [100, 150, 180, 200]
# }
# # Create a based model
# rf1 = RandomForestRegressor()
# # Instantiate the grid search model
# grid_search = GridSearchCV(estimator = rf1, param_grid = param_grid, 
#                           cv = 3, n_jobs = -1, verbose = 2)

In [None]:
grid_search.fit(X_train, y_train)
grid_search.best_params_

In [None]:
# best_grid = grid_search.best_estimator_
# print(best_grid)
# grid_accuracy = accuracy_score(y_test, y_predict)
# print(grid_accuracy)
# print('Improvement of {:0.2f}%.'.format( 100 * (grid_accuracy - accuracy) / accuracy))
# 

In [None]:
for name, importance in zip(X, rf.feature_importances_):
    print(name, "=", importance)


### 8. Conclusion & Recommendations

<img src='https://www.usefultechtips.com/wp-content/uploads/2018/01/how-to-improve-website-speed.jpg' width='600'>