# Testing the model

Using your solution so far, test the model on new data.

The new data is located in the ‘Bank_data_testing.csv’.

Good luck!

## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()


## Load the data

Load the ‘Bank_data.csv’ dataset.

In [14]:
raw_data = pd.read_csv('Bank-data-testing.csv')

In [36]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222 entries, 0 to 221
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   interest_rate  222 non-null    float64
 1   credit         222 non-null    float64
 2   march          222 non-null    float64
 3   may            222 non-null    float64
 4   previous       222 non-null    float64
 5   duration       222 non-null    float64
 6   y              222 non-null    int64  
dtypes: float64(6), int64(1)
memory usage: 12.3 KB


In [21]:
raw_data.drop(['Unnamed: 0'],axis = 1,inplace=True)

In [37]:
raw_data.head()

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,1.313,0.0,1.0,0.0,0.0,487.0,0
1,4.961,0.0,0.0,0.0,0.0,132.0,0
2,4.856,0.0,1.0,0.0,0.0,92.0,0
3,4.12,0.0,0.0,0.0,0.0,1468.0,1
4,4.963,0.0,0.0,0.0,0.0,36.0,0


In [30]:
raw_data['y'] = raw_data['y'].map({'yes':1,'no':0})

In [31]:
raw_data.describe()

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
count,222.0,222.0,222.0,222.0,222.0,222.0,222.0
mean,2.922095,0.031532,0.274775,0.346847,0.099099,398.86036,0.5
std,1.891766,0.175144,0.44741,0.75595,0.29947,410.565798,0.50113
min,0.639,0.0,0.0,0.0,0.0,6.0,0.0
25%,1.04925,0.0,0.0,0.0,0.0,144.75,0.0
50%,1.714,0.0,0.0,0.0,0.0,255.5,0.5
75%,4.96,0.0,1.0,0.0,0.0,525.25,1.0
max,4.968,1.0,1.0,4.0,1.0,3643.0,1.0


### Declare the dependent and independent variables

Use 'duration' as the independet variable.

In [32]:
data = raw_data

In [27]:
data.columns.values

array(['interest_rate', 'credit', 'march', 'may', 'previous', 'duration',
       'y'], dtype=object)

In [45]:
x1 = data[['interest_rate', 'duration']]

In [34]:
y = data['y']

In [35]:
y

0      0
1      0
2      0
3      1
4      0
      ..
217    1
218    1
219    0
220    0
221    1
Name: y, Length: 222, dtype: int64

### Simple Logistic Regression

Run the regression and graph the scatter plot.

In [38]:
import statsmodels.api as sm

In [46]:
x = sm.add_constant(x1)

In [47]:
x

Unnamed: 0,const,interest_rate,duration
0,1.0,1.313,487.0
1,1.0,4.961,132.0
2,1.0,4.856,92.0
3,1.0,4.120,1468.0
4,1.0,4.963,36.0
...,...,...,...
217,1.0,4.963,458.0
218,1.0,1.264,397.0
219,1.0,1.281,34.0
220,1.0,0.739,233.0


In [48]:
results = sm.Logit(y,x)

In [51]:
final = results.fit()

Optimization terminated successfully.
         Current function value: 0.380814
         Iterations 7


In [52]:
final.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,222.0
Model:,Logit,Df Residuals:,219.0
Method:,MLE,Df Model:,2.0
Date:,"Tue, 20 Apr 2021",Pseudo R-squ.:,0.4506
Time:,17:57:14,Log-Likelihood:,-84.541
converged:,True,LL-Null:,-153.88
Covariance Type:,nonrobust,LLR p-value:,7.707000000000001e-31

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.0141,0.345,0.041,0.967,-0.662,0.691
interest_rate,-0.9047,0.138,-6.545,0.000,-1.176,-0.634
duration,0.0071,0.001,6.495,0.000,0.005,0.009


## Expand the model

We can be omitting many causal factors in our simple logistic model, so we instead switch to a multivariate logistic regression model. Add the ‘interest_rate’, ‘march’, ‘credit’ and ‘previous’ estimators to our model and run the regression again. 

### Declare the independent variable(s)

### Confusion Matrix

Find the confusion matrix of the model and estimate its accuracy. 

<i> For convenience we have already provided you with a function that finds the confusion matrix and the model accuracy.</i>

In [1]:
def confusion_matrix(data,actual_values,model):
        
        # Confusion matrix 
        
        # Parameters
        # ----------
        # data: data frame or array
            # data is a data frame formatted in the same way as your input data (without the actual values)
            # e.g. const, var1, var2, etc. Order is very important!
        # actual_values: data frame or array
            # These are the actual values from the test_data
            # In the case of a logistic regression, it should be a single column with 0s and 1s
            
        # model: a LogitResults object
            # this is the variable where you have the fitted model 
            # e.g. results_log in this course
        # ----------
        
        #Predict the values using the Logit model
        pred_values = model.predict(data)
        # Specify the bins 
        bins=np.array([0,0.5,1])
        # Create a histogram, where if values are between 0 and 0.5 tell will be considered 0
        # if they are between 0.5 and 1, they will be considered 1
        cm = np.histogram2d(actual_values, pred_values, bins=bins)[0]
        # Calculate the accuracy
        accuracy = (cm[0,0]+cm[1,1])/cm.sum()
        # Return the confusion matrix and 
        return cm, accuracy

## Test the model

Load the test data from the ‘Bank_data_testing.csv’ file provided. (Remember to convert the outcome variable ‘y’ into Boolean). 

### Load new data 

### Declare the dependent and the independent variables

Determine the test confusion matrix and the test accuracy and compare them with the train confusion matrix and the train accuracy.