# Testing the model

Using your solution so far, test the model on new data.

The new data is located in the ‘Bank_data_testing.csv’.

Good luck!

## Import the relevant libraries

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

#Apply a fix to the statsmodels library
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

## Load the data

Load the ‘Bank_data.csv’ dataset.

In [3]:
bank_data_all_raw = pd.read_csv('Bank-data.csv')
bank_data_all_raw.head()

Unnamed: 0.1,Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,0,1.334,0.0,1.0,0.0,0.0,117.0,no
1,1,0.767,0.0,0.0,2.0,1.0,274.0,yes
2,2,4.858,0.0,1.0,0.0,0.0,167.0,no
3,3,4.12,0.0,0.0,0.0,0.0,686.0,yes
4,4,4.856,0.0,1.0,0.0,0.0,157.0,no


In [4]:
# We make sure to create a copy of the data before we start altering it. Note that we don't change the original data we loaded.
data = bank_data_all_raw.copy()
# Removes the index column thata comes with the data
data = data.drop(['Unnamed: 0'], axis = 1)
# We use the map function to change any 'yes' values to 1 and 'no'values to 0. 
data['y'] = data['y'].map({'yes':1, 'no':0})
data.head()

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,1.334,0.0,1.0,0.0,0.0,117.0,0
1,0.767,0.0,0.0,2.0,1.0,274.0,1
2,4.858,0.0,1.0,0.0,0.0,167.0,0
3,4.12,0.0,0.0,0.0,0.0,686.0,1
4,4.856,0.0,1.0,0.0,0.0,157.0,0


### Declare the dependent and independent variables

Use 'duration' as the independet variable.

In [5]:
y1 = data['y']
x1 = data[['duration']]

### Simple Logistic Regression

Run the regression and graph the scatter plot.

In [10]:
x = sm.add_constant(x1)
reg_logit1 = sm.Logit(y1,x)
results_logit1 = reg_logit1.fit()
# Get the regression summary
results_logit1.summary()

Optimization terminated successfully.
         Current function value: 0.546118
         Iterations 7


0,1,2,3
Dep. Variable:,y,No. Observations:,518.0
Model:,Logit,Df Residuals:,516.0
Method:,MLE,Df Model:,1.0
Date:,"Tue, 18 Dec 2018",Pseudo R-squ.:,0.2121
Time:,22:34:41,Log-Likelihood:,-282.89
converged:,True,LL-Null:,-359.05
,,LLR p-value:,5.387e-35

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.7001,0.192,-8.863,0.000,-2.076,-1.324
duration,0.0051,0.001,9.159,0.000,0.004,0.006


## Expand the model

We can be omitting many causal factors in our simple logistic model, so we instead switch to a multivariate logistic regression model. Add the ‘interest_rate’, ‘march’, ‘credit’ and ‘previous’ estimators to our model and run the regression again. 

### Declare the independent variable(s)

In [11]:
estimators = ['interest_rate','march','credit','previous','duration']

x2 = data[estimators]
y2 = data['y']

In [12]:
reg_logit2 = sm.Logit(y2,x2)
results_logit2 = reg_logit2.fit()
# Get the regression summary
results_logit2.summary2()

Optimization terminated successfully.
         Current function value: 0.336668
         Iterations 7


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.514
Dependent Variable:,y,AIC:,358.7881
Date:,2018-12-18 22:34,BIC:,380.038
No. Observations:,518,Log-Likelihood:,-174.39
Df Model:,4,LL-Null:,-359.05
Df Residuals:,513,LLR p-value:,1.1845e-78
Converged:,1.0000,Scale:,1.0
No. Iterations:,7.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
interest_rate,-0.8030,0.0787,-10.2013,0.0000,-0.9573,-0.6487
march,-1.8387,0.3153,-5.8307,0.0000,-2.4567,-1.2206
credit,2.3459,1.0712,2.1899,0.0285,0.2463,4.4455
previous,1.5262,0.4785,3.1899,0.0014,0.5885,2.4640
duration,0.0069,0.0007,10.3646,0.0000,0.0056,0.0082


### Confusion Matrix

Find the confusion matrix of the model and estimate its accuracy. 

<i> For convenience we have already provided you with a function that finds the confusion matrix and the model accuracy.</i>

In [13]:
def confusion_matrix(data,actual_values,model):
        
        # Confusion matrix 
        
        # Parameters
        # ----------
        # data: data frame or array
            # data is a data frame formatted in the same way as your input data (without the actual values)
            # e.g. const, var1, var2, etc. Order is very important!
        # actual_values: data frame or array
            # These are the actual values from the test_data
            # In the case of a logistic regression, it should be a single column with 0s and 1s
            
        # model: a LogitResults object
            # this is the variable where you have the fitted model 
            # e.g. results_log in this course
        # ----------
        
        #Predict the values using the Logit model
        pred_values = model.predict(data)
        # Specify the bins 
        bins=np.array([0,0.5,1])
        # Create a histogram, where if values are between 0 and 0.5 tell will be considered 0
        # if they are between 0.5 and 1, they will be considered 1
        cm = np.histogram2d(actual_values, pred_values, bins=bins)[0]
        # Calculate the accuracy
        accuracy = (cm[0,0]+cm[1,1])/cm.sum()
        # Return the confusion matrix and 
        return cm, accuracy

## Test the model

Load the test data from the ‘Bank_data_testing.csv’ file provided. (Remember to convert the outcome variable ‘y’ into Boolean). 

### Load new data 

In [14]:
bank_data_all_test = pd.read_csv('Bank-data-testing.csv')
bank_data_all_test.head()

Unnamed: 0.1,Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,0,1.313,0.0,1.0,0.0,0.0,487.0,no
1,1,4.961,0.0,0.0,0.0,0.0,132.0,no
2,2,4.856,0.0,1.0,0.0,0.0,92.0,no
3,3,4.12,0.0,0.0,0.0,0.0,1468.0,yes
4,4,4.963,0.0,0.0,0.0,0.0,36.0,no


In [16]:
# We make sure to create a copy of the data before we start altering it. Note that we don't change the original data we loaded.
data3 = bank_data_all_test.copy()
# Removes the index column thata comes with the data
data3 = data3.drop(['Unnamed: 0'], axis = 1)
# We use the map function to change any 'yes' values to 1 and 'no'values to 0. 
data3['y'] = data3['y'].map({'yes':1, 'no':0})
data3.head()

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,1.313,0.0,1.0,0.0,0.0,487.0,0
1,4.961,0.0,0.0,0.0,0.0,132.0,0
2,4.856,0.0,1.0,0.0,0.0,92.0,0
3,4.12,0.0,0.0,0.0,0.0,1468.0,1
4,4.963,0.0,0.0,0.0,0.0,36.0,0


### Declare the dependent and the independent variables

In [17]:
estimators = ['interest_rate','march','credit','previous','duration']

x3 = data3[estimators]
y3 = data3['y']

Determine the test confusion matrix and the test accuracy and compare them with the train confusion matrix and the train accuracy.

In [18]:
confusion_matrix(x3, y3, results_logit2)

(array([[93., 18.],
        [13., 98.]]), 0.8603603603603603)