# Logistic Regression

This is my first project on Kaggle. Since I am new to this domain, I am sure that I am making a lot of mistakes. I am open to any constructive criticism.

<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#about_dataset">About the dataset</a></li>
        <li><a href="#business_problem">Business Problem</a></li>
        <li><a href="#business_problem">Data Exploration</a></li>
        <li><a href="#visualization_analysis">Data Visualization and Analysis</a></li>
        <li><a href="#classification">Classification</a></li>
    </ol>
</div>
<hr>

## I. About the dataset
 
The dataset comes from the UCI Machine Learning repository, and it is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed. 

https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

The dataset is consisted of 45,211 customer data on direct marketing campaigns (phone calls) of a Portuguese banking institution, with variables below: 
+ Client: age, job, marital, education, default status, housing, and loan
+ Campaign: last contact type, last contact month of year, last contact day of the week, and last contact duration
+ Others: number of contacts performed in current campaign, number of days that passed by after the client was last contacted, number of contacts performed before this campaign, outcome of previous campaign, and whether a client has subscribed a term deposit.

The classification goal is to predict whether the client will subscribe (1/0) to a term deposit (variable y).
                
#### 1. Title: Bank Marketing

#### 2. Sources

The dataset is public available for research. The details are described in [Moro et al., 2011]. 

[Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. 

In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.

Available at: [pdf] http://hdl.handle.net/1822/14838

              [bib] http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt
              

#### 3. Number of Instances: 45211 for bank.csv

#### 4. Number of Attributes: 17 output attributes.

#### 5. Attribute information:

Input variables

### Bank client data

1 - age (numeric)

2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unnon')

3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6.  balance

7 - housing: TEMPhas housing loan? (categorical: 'no','yes','Unknown')

8 - loan: TEMPhas personal loan? (categorical: 'no','yes','unknow')

### Related wif the last contact of the current campaign

9 - contact: contact communication type (categorical: 'cellular','telephone')

10 - day: last contact day of teh week (categorical: 'mon','tue','wed','thu','fri')

11 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

12 - duration: last contact duration, in seconds (numeric). Important note: dis attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not non before a call is performed. Also, after the end of the call y is obviously non. Thus, dis input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

### Other attributes

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days dat passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of teh previous marketing campaign (categorical: 'failure','nonexistent','success')


### Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: 'yes','no')

#### 6. Missing Attribute Values: None

## II. Business Problem

There has been a revenue decline for the Portuguese bank and they would like to know what actions to take. After investigation, they found out that the root cause is that their clients are not depositing as frequently as before. Knowing that term deposits allow banks to hold onto a deposit for a specific amount of time, so banks can invest in higher gain financial products to make a profit. In addition, banks also hold better chance to persuade term deposit clients into buying other products such as funds or insurance to further increase their revenues. As a result, the Portuguese bank would like to identify existing clients that have higher chance to subscribe for a term deposit and focus marketing effort on such clients.

To resolve the proble, we suggest a classification approach to predict which clients are more likely to subscribe for term deposits.

## III. Data Exploration

Let's load required libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import sys
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from collections import Counter

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

In [None]:
%config Completer.use_jedi = False

sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

np.set_printoptions(threshold=sys.maxsize)

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
%matplotlib inline

In [None]:
df = pd.read_csv('../input/portuguese-bank-marketing-data-set/bank-full.csv',sep=';')
df.head()

In [None]:
df.shape

In [None]:
#  Find missing values of each feature in the data set.
df.info()

In [None]:
df.describe().astype(np.int64)

### To get a feel for the type of data we are dealing with, we visualize distributions of numerical features with histograms

In [None]:
%matplotlib inline
df[['age','duration','campaign','previous']].hist(bins=30, figsize=(20,15))
plt.savefig("attribute_histogram_plots")
plt.show()

In [None]:
# Visualize feature correlations
fig, ax = plt.subplots(figsize=(10,10))  
sns.heatmap(df._get_numeric_data().astype(float).corr(),
            square=True, cmap='RdBu_r', linewidths=.5,
            annot=True, fmt='.2f').figure.tight_layout()
plt.show()

Most of our features are category type; hence, this heatmap does not help much. We can see that duration is a good indicator, however, this value has only known when the call is done.

### Print unique values for each column

In [None]:
category_features = df.select_dtypes(include=['object', 'bool']).columns.values

for col in category_features:
    print(col, "(", len(df[col].unique()) , "values):\n", np.sort(df[col].unique()))

I am not sure why we don't have data for the month of January and February.

In [None]:
for col in category_features:
    print(f"\033[1m\033[94m{col} \n{20 * '-'}\033[0m")    
    print(df[col].value_counts(), "\n")
    
print(df.nunique(axis=1))

## IV. Data Visualization and Analysis

### Category Data Distribution

We start with the exploratory analysis of the categorical features by using seaborn package to plot histogram charts.

In [None]:
for col in category_features:
    plt.figure(figsize=(10,5))    
    sns.barplot(df[col].value_counts().values, df[col].value_counts().index, data=df)    
    plt.title(col)    
    plt.tight_layout()


Our observations:
1. Job: The audiences of these campaigns target mostly administrators, blue-collars, and technicians.
2. Marital status: Most of them are married; married clients are twice as single people.
3. Education: Most clients have university education level while illiterate people are very less.
4. default/credit: Most people have no default stay on their credit file.
5. housing: Most people have no housing loan.
6. loan: Most people have no personal loan.
7. contact: Common means of communication are cellular.
8. month - May is the busy month and December is the least busy month (because of the holidays season).
9. day of week: Thursday is the most busy day while Friday is the least busy day of the week.

###  Subscription to the term deposit

In [None]:
# Pie chart
labels = ["Not \nsubscribed", "Subscribed"]
explode = (0, 0.1)  # only "explode" the second slice (i.e. 'Subscribed')

# depicting the visualization 
fig = plt.figure() 
ax = fig.add_axes([0,0,1,1]) 

ax.pie(df['y'].value_counts(), 
       labels = labels,
       explode = explode,
       autopct ='%1.2f%%',
       frame = True,
       textprops = dict(color ="black", size=12)) 

ax.axis('equal') 
plt.title('Subcription to the term deposit\n% of Total Clients',
     loc='left',
     color = 'black', 
     fontsize = '18')

plt.show()

11.27% customers subscribed to the term deposit. Our classes are imbalanced where positive values (subscribed) are only 11.27%. In the next section, we will balance the classes.


### Top 5 of highly successful campaigns

Now, we will print out the campaigns that the largest number of customers participate in

In [None]:
# We will groupby then count
df.groupby(['campaign'])['y'].count().reset_index().sort_values(by='y', ascending=False).iloc[:5]

### What is the target audience?
### Which customers were more likely to subscribe to the term deposit?

In [None]:
table = pd.crosstab(df.job, df.y)
table.columns = ['Not subscribed', 'Subscribed']
table.plot(kind='bar')

plt.grid(True)

plt.title('Purchase Frequency for Job Title')
plt.xlabel('Job')
plt.ylabel('Frequency of Purchase')

In [None]:
table = pd.crosstab(df.job, df.y)
table = round(table.div(table.sum(axis=1), axis=0).mul(100), 2)
table.columns=['notsubcribed', 'subcribed']
table.sort_values(by=['subcribed'], ascending=False).loc[:, 'subcribed']

The target customers are admins, blue-collars and techinicians but the frequency of students and retired people subscribed to the term deposit are pretty high (28.68% for students and 22.79% for retired people).

### Role of marital status in subscription behaviour

In [None]:
table = pd.crosstab(df.marital,df.y)
table = table.div(table.sum(1).astype(float), axis=0)
table.columns = ['Not subscribed', 'Subscribed']
# Ordering stacked bars and plot the chart
table[['Subscribed', 'Not subscribed']].plot(kind='bar', stacked=True)
plt.title('Frequency of Marital Status vs Purchase')
plt.xlabel('Marital Status')
plt.ylabel('Proportion of Customers')

#### There is no significant impact of marital status on subscription behaviour of customers.

## V. Classification

Although "duration" feature highly affects the output target, this value is not known before a call is performed. Hence; this feature should been discarded from the list of features to predict.

In [None]:
df = df.drop(['duration'], axis=1)

Here are steps we follow to preprocess our data:
1. Dealing with missing values
2. Splitting of data (80 : 20 split)
3. Handling Categorical Variable
4. Oversampling using SMOTE
5. Random Feature Elimination – RFE
6. Logistic Regression Model Fitting

### 1. Missing values
Luckily, our dataset does not contain missing data. Hence, we can skip this step.

### 2. Splitting of data (80 : 20 split)

Here we split the data into training and test set so that we can fit and evaluate a learning model. We will use the train_test_split() function from scikit-learn and use 80 percent of the data for training and 20 percent for testing.

In [None]:
# load X and y
X = df.drop(columns=['y'])
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

### 3. Handling Categorical Variable

In this project, I will use OneHotEncoding for converting ordinal and categorical variables to numerical values. 

First, we classify features into two groups: numerical and categorical features:

In [None]:
numeric_features = X_train.select_dtypes(include=['float64', 'int64']).columns.values
numeric_features = numeric_features[numeric_features != 'y']

category_features = X_train.select_dtypes(include=['object', 'bool']).columns.values

print(numeric_features)
print(category_features)

In [None]:
def dummify(ohe, x, columns):
    transformed_array = ohe.transform(x)

    # list of category columns
    enc = ohe.named_transformers_['cat'].named_steps['onehot']
    feature_lst = enc.get_feature_names(category_features.tolist())   
    
    cat_colnames = np.concatenate([feature_lst]).tolist()
    all_colnames = numeric_features.tolist() + cat_colnames 
    
    # convert numpy array to dataframe
    df = pd.DataFrame(transformed_array, index = x.index, columns = all_colnames)
    
    return transformed_array, df

In [None]:
# impute missing numerical values with a median value, then scale the values
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# impute missing categorical values using the 'missing' and one hot encode the categories
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Each transformer is a three-element tuple that defines 
#                                 the name of the transformer, 
#                                 the transform to apply, 
#                                 and the column features to apply it to
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, category_features)])

ohe = preprocessor.fit(X_train)

X_train_t = ohe.transform(X_train)
X_test_t = ohe.transform(X_test)

In [None]:
# transform training and test set and then convert it to dataframe
X_train_t_array, X_train_t = dummify(ohe, X_train, category_features)
X_test_t_array, X_test_t = dummify(ohe, X_test, category_features)

X_train_t.head()

In [None]:
X_train_columns = X_train_t.columns
print(X_train_columns)

### 4. Oversampling using SMOTE

Input values:    
* Dataframe: X_train_t, y_train, X_test_t, y_test
    
* Array: X_train_t_array, X_test_t_array

As mentioned above, our data is imbalanced. We can see that in our dataset, the positive samples (minority class) are much less than negative samples (majority class). The positive samples (the people who subscribed to the term deposit) were only 11.27% from the total samples. Therefore, accuracy is no longer a good measure of performance because if we simply predict all examples to the negative class, we achieve 88,73% accuracy. As a result, we need to apply methods to overcome class imbalance problem. In this section, we use SMOTE method to balance our dataset.

SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling. It focuses on the feature space to generate new instances with the help of interpolation between the positive instances that lie together.

 We define a SMOTE instance with default parameters that will balance the minority class and then fit and apply it in one step to create a transformed version of our dataset.

In [None]:
from imblearn.over_sampling import SMOTE

# summarize class distribution
counter = Counter(y_train)
print(counter)

# transform the dataset
oversample = SMOTE()
X_train_smote, y_train = oversample.fit_resample(X_train_t, y_train)

# summarize the new class distribution
counter = Counter(y_train)
print(counter)

### 5. Random Feature Elimination – RFE

RFE is a popular feature selection algorithm. It is easy to configure and pretty effective at selecting features in a training dataset. There are two important configuration options when using RFE: 

    1. The number of features to select.
    
    2. The choice of algorithm used to help choose features.

In [None]:
from sklearn.svm import SVC

final_X_train = pd.DataFrame(data=X_train_smote,columns=X_train_columns )
final_y_train = pd.DataFrame(data=y_train,columns=['y'])

rfe_model = RFE(LogisticRegression(solver='lbfgs', max_iter=1000), 25)
rfe_model = rfe_model.fit(final_X_train, final_y_train)

# feature selection
print(rfe_model.support_)
print(rfe_model.ranking_)

In [None]:
selected_columns = X_train_columns[rfe_model.support_]
print(selected_columns.tolist())

In [None]:
X_train_final = final_X_train[selected_columns.tolist()]
y_train_final = final_y_train['y']
X_test_final = X_test_t[selected_columns.tolist()]
y_test_final = y_test

X_test_final.head()

###  6. Logistic Regression Model Fitting

In [None]:
### Logistic Regression Model Fitting

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

logreg = LogisticRegression()
logreg.fit(X_train_final, y_train_final)

In [None]:
y_pred = logreg.predict(X_test_final)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test_final, y_test_final)))

References:

1. <a href='https://medium.com/@ashim.maity8/predict-if-the-client-will-subscribe-a-term-deposit-or-not-using-machine-learning-c6e4024c7028'>Predict if the client will subscribe a term deposit or not, using “Machine learning”</a>

2. <a href='https://github.com/maityashim/Machine-Learning-Project-on-Bank-Marketing-Data-Set/blob/master2/Bank_Marketing.ipynb'>Machine Learning Project on Bank Marketing Data Set</a>

3. <a href='https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8'>Building A Logistic Regression in Python, Step by Step</a>

4. Data Cleaning, Feature Selection, and Data Transforms in Python - Jason Brownlee.

5. https://www.roelpeters.be/solve-shape-mismatch-if-categories-is-an-array-it-has-to-be-of-shape-onehotencoder/

6. https://machinelearningmastery.com/columntransformer-for-numerical-and-categorical-data/