# Problem statement

###### Given the bank customer data predict whether the customer will subscribe a term deposit (Yes/No). This is a classification problem and we will use Logistic Regression to predict whether customer will subscribe for term deposit or not

# Importing the libraries

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn import preprocessing 

# Loading the dataset

In [None]:
bank = pd.read_csv("../input/bank-marketing-dataset/bank.csv")

We are using seperator as semicolon as we have data separated with semicolon in our dataset

In [None]:
bank.head()

We can just peek into few data points by using head function of pandas. By default, head function return top 5 values 

# Data Insights

In [None]:
bank.shape

In [None]:
bank.columns

In [None]:
bank.info()

### Observations :- 

# Summary Statistics

In [None]:
bank.describe()

### Observations :- 

1. If we compare the 75% values and max values for the columns like Balance, Duration, Campaign, Pdays and Previous we can see there is a huge difference. So there are chances of outliers in our data. We will check this further using Data visualization techniques.


# Understanding the target variable

##### We need to predict whether the customer will subscribe for the term deposit or not. So, the target variable here is y (outcome) and is of Yes/No form

In [None]:
bank['deposit'].value_counts()

##### We can see that we have more number of no values (the customers have not subscribed for term deposits) than yes values in our data. This means that we have class imbalance problem in our data 

# Data visualization

Performing univariate analysis using boxplot. Boxplots are very intuitive for checking any outlier data 

In [None]:
sn.boxplot(bank['age'])

### Observations :- 

1. We can see that the Age values lies from 20 to 70 (Approx.) 
2. Most of the age values are between 30 to 50 (25 percentile to 75 percentile)
3. There are few outlier values more than 70. As mostly the average lifetime of human being is around 70 years

In [None]:
sn.boxplot(bank['balance'])

### Observations :- 

1. The boxplot of balance is very different than usual boxplots. We can see lot of outliers data. This is mainly due to the reason that each person maintains a very different balance.
2. We can see most of the people are maintaining a very low balance. 
3. There are few people(only 5-6) who are maintaining a balance of more than 60,000  

In [None]:
sn.boxplot(bank['duration'])

### Observations :- 

1. We can see that there are outliers value in our duration values.
2. Most of the points lie between -1 and 1000. 

In [None]:
sn.boxplot(bank['campaign'])

### Observations :- 

1. The most of the values are less than 10 for campaign.
2. There are few outliers which means lot of contacts were made for these clients

In [None]:
sn.boxplot(bank['pdays'])

### Observations :- 

1. The boxplot shows most of the data points have the value of -1 which means that most of the clients were contacted the first time 
2. There are some clients that were contacted many times and few more than 800 times as well.

In [None]:
sn.boxplot(bank['previous'])

### Observations :- 

1. The boxplot of Previous is very similar to pdays but the outlier values here are less.
2. We can see most of the values as 0 so which means the clients were not contacted before this campaign. 
3. There is one client which was contacted more than 250 times before the campaign. There is a vast difference between this value and other values as all the other values are less than 100. There is a chance that this might be a typo (human error). 

### Dist plots

Dist plots are used to check the distribution of the data, peak value(the observation having the highest frequecy) and check for skewness in the data

In [None]:
sn.distplot(bank['age'])

In [None]:
sn.distplot(bank['balance'])

In [None]:
sn.distplot(bank['duration'])

In [None]:
sn.distplot(bank['campaign'])

In [None]:
sn.distplot(bank['pdays'])

In [None]:
sn.distplot(bank['previous'])

### Observations :- 

##### We can see high positive skewness in all the above dist plots 

## Heatmap

##### Heatmap is a very effective technique to check the missing values in the dataset and to also understand if there is any correlation between the features of the data

In [None]:
sn.heatmap(bank.isnull())

### Observations :-  

1. We don't have any missing values in our dataset. If it was present, there would be a different colour shade appearing on the red background. 

We can check missing values by using isna() method as well.

In [None]:
bank.isna().sum()

## Correlation

The Correlation matrix is an important data analysis metric that is computed to summarize data to understand the relationship (correlation) between various variables and make decisions accordingly.


##### Correlation only works on the continuous variables and we have few categorical variables in our dataset. We need to convert them into numerical values using Encoding techniques.

We have few categorical variables in our data job, marital,education, default, housing, loan, contact, month, poutcome and the output variable y as well.

In [None]:
bank['job'].describe()

##### We have total 12 unique values in job column. So we will go for Label Encoding Technique as the number of total unique values are 12 and if we use One hot Encoding technique we will have 12 more columns 

In [None]:
label_encoder = preprocessing.LabelEncoder() 
 
bank['job']= label_encoder.fit_transform(bank['job']) 
  
bank['job'].unique() 

In [None]:
bank['marital'].describe()

##### We can see that marital column is having 3 unique values and we can use both One Hot Encoding and Label Encoding Techniques here. The marital values are not ordinal so we can go for One Hot Encoding Technique.

In [None]:
bank = pd.get_dummies(bank, columns=['marital'])
bank.head()

In [None]:
bank['education'].describe()

##### We can see that education column is having 3 unique values and we can use both One Hot Encoding and Label Encoding Techniques here. The values are ordinal (primary, secondary, etc.) so we can go for Label Encoding Technique. If we use One Hot Encoding technique for ordinal data then there is a chance that we can have multicollinearity problem (as the outcome of one variable can easily be predicted with the help of the remaining variables)

In [None]:
bank['education']= label_encoder.fit_transform(bank['education']) 
  
bank['education'].unique() 

In [None]:
bank['default'].describe()

##### We can see that default column is having only 2 unique values and we can use both One Hot Encoding and Label Encoding Techniques here. The values in default column are not ordinal so we can go for One Hot Encoding Technique.

In [None]:
bank = pd.get_dummies(bank, columns=['default'])
bank.head()

In [None]:
bank['housing'].describe()

##### We can see that housing column is having only 2 unique values and we can use both One Hot Encoding and Label Encoding Techniques here. The values in housing column are not ordinal, so we can go for One Hot Encoding Technique.

In [None]:
bank = pd.get_dummies(bank, columns=['housing'])
bank.head()

##### For loan column as we we have only 2 unique values(yes/no). So similar like housing we can use One Hot Encoding Tecnhique

In [None]:
bank = pd.get_dummies(bank, columns=['loan'])
bank.head()

In [None]:
bank['contact'].describe()

##### We can see that contact column is having only 3 unique values and we can use both One Hot Encoding and Label Encoding Techniques here. The values in contact column are not ordinal, so we can go for One Hot Encoding Technique.

In [None]:
bank = pd.get_dummies(bank, columns=['contact'])
bank.head()

In [None]:
bank['month'].describe()

##### We have 12 unique values in the month column. So we will use Label Encoding technique. Also the months of a year are ordinal

In [None]:
bank['month']= label_encoder.fit_transform(bank['month']) 
  
bank['month'].unique() 

In [None]:
bank['poutcome'].describe()

##### We can see that poutcome column is having only  4 unique values and we can use both One Hot Encoding and Label Encoding Techniques here. The values in pouttcome column are not ordinal, so we can go for One Hot Encoding Technique.

In [None]:
bank = pd.get_dummies(bank, columns=['poutcome'])
bank.head()

In [None]:
bank['deposit'].describe()

In [None]:
bank['deposit'].value_counts()

##### We can see that deposit column is having only 2 unique values (yes/no) type. The deposit column is actually our outcome value. We need to predict the deposit values for our data ie whether a client will subscribe for term deposit or not

##### The frequency count of no and yes values is almost same in our data. We can say our data is balanced. 

In [None]:
bank['deposit']= label_encoder.fit_transform(bank['deposit']) 
  
bank['deposit'].unique() 

In [None]:
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)

In [None]:
bank.head()

In [None]:
plt.figure(figsize=(12,5))
sn.heatmap(bank.corr(),annot = True)

In [None]:
bank.corr()

###### We can now separate our target variable y with the our input variables.

In [None]:
Y = bank['deposit']

In [None]:
Y.head()

In [None]:
X = bank.drop('deposit',axis=1)

In [None]:
X.head()

##### Variable X has our input variables and Y has our output variable

# Fitting a Logistic Regression Model

In [None]:
classifier = LogisticRegression()
classifier.fit(X,Y)

##### We will try to predict Y values for the X values in our data. 

In [None]:
y_pred = classifier.predict(X)

In [None]:
y_pred_df= pd.DataFrame({'actual': Y,
                         'predicted_prob': classifier.predict(X)})

In [None]:
y_pred_df

In the above dataframe, we are comparing our actual vs predicted values

# Checking model accuracy

We have multiple ways of checking model accuracy for our classification model.We will use below 2 methods

1. Confusion matrix report
2. ROC curve

### Confusion Matrix for the model accuracy

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(Y,y_pred)
print (confusion_matrix)

In [None]:
((4679+3917)/(4679+1194+1372+3917))*100

* ##### The overall accuracy of the model is 77%

### Classification report 

In [None]:
from sklearn.metrics import classification_report
print(classification_report(Y,y_pred))

### ROC Curve

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

fpr, tpr, thresholds = roc_curve(Y, classifier.predict_proba (X)[:,1])

auc = roc_auc_score(Y, y_pred)

import matplotlib.pyplot as plt
plt.plot(fpr, tpr, color='red', label='logit model ( area  = %0.2f)'%auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')

In [None]:
auc

We have got an auc score of 0.76 for our model.