# Assignment - 06 (Logistic Regression)

Output variable -> y
y -> Whether the client has subscribed a term deposit or not
Binomial ("yes" or "no")
Attribute information For bank dataset

### Input variables:
###### Bank Client Data:
1 - age (numeric)

2 - job : type of job (categorical:
"admin.","unknown","unemployed","management","housemaid","entrepreneur","student","blue-collar","self-employed","retired","technician","services")

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric)
   
7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")
    related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)
    other attributes:

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

### Output variable (desired target):
17 - y - has the client subscribed a term deposit? (binary: "yes","no")

8. Missing Attribute Values: None


In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv("/content/bank-full.csv (1).crdownload",sep = ";")
df.head()

#### EDA

In [None]:
# Dropping unwanted columns
df.drop(["marital","month","poutcome"],axis=1,inplace=True)

In [None]:
df.columns

In [None]:
# Datatypes and null values in columns
df.info()

In [None]:
df.head()

In [None]:
# unique values in columns
df['job'].unique()

In [None]:
from sklearn.preprocessing import LabelEncoder
LE= LabelEncoder()
df['job'] = LE.fit_transform(df['job'])
df['education'] = LE.fit_transform(df['education'])
df['default'] = LE.fit_transform(df['default'])
df['housing'] = LE.fit_transform(df['housing'])
df['loan'] = LE.fit_transform(df['loan'])
df['contact'] = LE.fit_transform(df['contact'])
df['y'] = LE.fit_transform(df['y'])

In [None]:
# replacing values
#df["default"]=df["default"].replace({'yes':1, 'no':0})
#df["housing"]=df["housing"].replace({'yes':1, 'no':0})
#df["loan"]=df["loan"].replace({'yes':1, 'no':0})
#df["loan"]=df["loan"].replace({'yes':1, 'no':0})
#df["y"]=df["y"].replace({'yes':1, 'no':0})

In [None]:
df.describe()

In [None]:
# rows and columns in dataset
df.shape

In [None]:
# To find the correlation
df.corr()

In [None]:
df[df.duplicated()]

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(df)
plt.show()

#### Model Building

In [None]:
# Dividing our data into input and output variables
X = df.iloc[:,:-1]
Y = df.iloc[:,-1]
#print(X)
#print(Y)

In [None]:
# Python provides pickle modules for Serialization and de-Serialization of python objects like lists, dictionaries, tuples, etc.
# Pickling is also called marshaling or flattening in other languages.
# Pickling is used to store python objects
from sklearn.linear_model import LogisticRegression
import pickle

In [None]:
X

In [None]:
#Logistic regression and fit the model
classifier = LogisticRegression()
classifier.fit(X,Y)

In [None]:
# save the model to disk
filename = "finalized_model.sav"
pickle.dump(classifier, open(filename,'wb'))

In [None]:
#Predict for X dataset
pickle.load(open(filename,'rb'))
# classifier.read_pickle_file('/content/finalized_model.sav')
y_pred = classifier.predict(X)

In [None]:
y_pred_df = pd.DataFrame({'actual':Y, 'predicted_prob':classifier.predict(X)})

In [None]:
y_pred_df

#### Confusion Matrix for the model accuracy

In [None]:
# It is a table that is used in classification problems to assess where errors in the model were made.
# The rows represent the actual classes the outcomes should have been. While the columns represent the predictions we have made.
# Using this table it is easy to see which predictions are wrong.
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(Y,y_pred_df)
print(confusion_matrix)

In [None]:
((39176+962)/(39176+746+4327+962))

#### Classification Report

In [None]:
from sklearn.metrics import classification_report
print(classification_report(Y,y_pred))

#### ROC Curve

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
fpr,tpr,thresholds = roc_curve(Y, classifier.predict_proba(X)[:,1])
auc= roc_auc_score(Y,y_pred)

import matplotlib.pyplot as plt
plt.plot(fpr,tpr,color = 'red', label='logit model(area = % 0.2f)' %auc)
plt.plot([0,1],[0,1], 'k--1')
plt.xlabel('False Positive Rate or [1- True Negative Rate]')
plt.ylabel("True Positive Rate")
plt.show()

In [None]:
auc

### The above model is good for predicting whether the client has subscribed a term deposit.