In [1]:
import numpy as np
import statsmodels.api as sm
from ISLP import load_data, confusion_table
from ISLP.models import (ModelSpec as MS, summarize, poly)
from sklearn.model_selection import train_test_split

from functools import partial
from sklearn.model_selection import (cross_validate, KFold, ShuffleSplit)
from sklearn.base import clone
from ISLP.models import sklearn_sm

#### Q5
In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

In [2]:
Default = load_data('Default')
print("The number of data points are {0}".format(Default.shape[0]))
Default.head()

The number of data points are 10000


Unnamed: 0,default,student,balance,income
0,No,No,729.526495,44361.625074
1,No,Yes,817.180407,12106.1347
2,No,No,1073.549164,31767.138947
3,No,No,529.250605,35704.493935
4,No,No,785.655883,38463.495879


(a) Fit a logistic regression model that uses income and balance to predict default.

(b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

i. Split the sample set into a training set and a validation set.

ii. Fit a multiple logistic regression model using only the training observations.

iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior  probability of  default for that individual, and classifying the individual to  the default category if the posterior probability is greater than 0.5.

iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

(c) Repeat the process in (b) three times, using three different split of the observations into a training set and a validation set. Comment on the results obtained.

In [3]:
# Defining a function that performs steps ii to iv for different random states
# A multiple logistic regression model is fit using income, balance to predict default 

def validation_set(seed, predictors):
    # Splitting the data for different seed values 
    Default_train, Default_test = train_test_split(Default, test_size=5000, random_state=seed) 
    predictors = MS(predictors)

    
    # Fitting the multiple regression model 
    X_train = predictors.fit_transform(Default_train)
    y_train = Default_train.default == 'Yes' # Logistic regression predicts probability of one event
    X_test = predictors.fit_transform(Default_test)
    y_test = Default_test.default
    
    glm = sm.Logit(y_train,X_train) 
    results = glm.fit() 
    
    # Making predictions and finding validation error  
    probs = results.predict(X_test)
    labels = np.array(['No']*5000) 
    labels[probs>0.5] = 'Yes'
    error = (np.mean(labels != y_test))*100
    print("The validation set error for random seed {0} is {1}%".format(seed, round(error,3)))

for i in [42,3,15]:
    validation_set(i,['balance','income'])

Optimization terminated successfully.
         Current function value: 0.078493
         Iterations 10
The validation set error for random seed 42 is 3.64%
Optimization terminated successfully.
         Current function value: 0.079927
         Iterations 10
The validation set error for random seed 3 is 3.66%
Optimization terminated successfully.
         Current function value: 0.075339
         Iterations 10
The validation set error for random seed 15 is 3.94%


> The errors vary (slightly) with different random splits. This is expected from the validation set approach since the method has high variance.

(d) Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

In [4]:
for i in [42,3,15]:
    validation_set(i,['balance','income','student'])

Optimization terminated successfully.
         Current function value: 0.077900
         Iterations 10
The validation set error for random seed 42 is 3.66%
Optimization terminated successfully.
         Current function value: 0.079409
         Iterations 10
The validation set error for random seed 3 is 3.7%
Optimization terminated successfully.
         Current function value: 0.075292
         Iterations 10
The validation set error for random seed 15 is 3.94%


> Since the validation errors are very similar to the previous model, there doesn't seem to be an improvement using the student variable.

#### Q7
In Sections 5.1.2 and 5.1.3, we saw that the cross_validate() function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just sm.GLM() and the predict() method of the fitted model within a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the Weekly data set. Recall that in the context of classification problems, the LOOCV error is given in (5.4).

In [5]:
Weekly = load_data('Weekly')
print("The number of data points are {0}".format(Weekly.shape[0]))
Weekly.head()

The number of data points are 1089


Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
0,1990,0.816,1.572,-3.936,-0.229,-3.484,0.154976,-0.27,Down
1,1990,-0.27,0.816,1.572,-3.936,-0.229,0.148574,-2.576,Down
2,1990,-2.576,-0.27,0.816,1.572,-3.936,0.159837,3.514,Up
3,1990,3.514,-2.576,-0.27,0.816,1.572,0.16163,0.712,Up
4,1990,0.712,3.514,-2.576,-0.27,0.816,0.153728,1.178,Up


(a) Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but the first observation.

In [6]:
# Splitting the data
Weekly_train = Weekly.drop([0])
Weekly_test = Weekly.iloc[0].to_frame().T
y = Weekly.Direction == 'Up'
y_train = y.drop([0])
y_test = y[0]

# Fitting the model 
predictors = MS(['Lag1','Lag2'])
X_train = predictors.fit_transform(Weekly_train)
X_test = predictors.fit_transform(Weekly_test)

glm = sm.GLM(y_train, X_train, family=sm.families.Binomial())
results = glm.fit()
summarize(results)

Unnamed: 0,coef,std err,z,P>|z|
intercept,0.2232,0.061,3.63,0.0
Lag1,-0.0384,0.026,-1.466,0.143
Lag2,0.0608,0.027,2.291,0.022


(b) Use the model from (a) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if P ( Direction = "Up"| Lag1, Lag2) > 0.5. Was this observation correctly classified?

In [7]:
prob = results.predict(X_test.values.astype(float))
prob

array([0.57139232])

(c) Use the model from (a) to predict the direction of the first observation. You can do this by predicting that the
first observation will go up if P ( Direction = "Up"| Lag1, Lag2) > 0.5. Was this observation correctly classified?

In [8]:
prediction = 'Up' if prob>0.5 else 'Down'
print("correct" if prediction==y_test else "incorrect")

incorrect


(d) Write a for loop from i = 1 to i = n, where n is the number of observations in the data set, that performs each of the following steps:

i. Fit a logistic regression model using all but the ith observation to predict Direction using Lag1 and Lag2.

ii. Compute the posterior probability of the market moving up for the ith observation.

iii. Use the posterior probability for the ith observation in order to predict whether or not the market moves up.

iv. Determine whether or not an error was made in predicting the direction for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.

(e) Take the average of the n numbers obtained in (d) iv in order to obtain the LOOCV estimate for the test error. Comment on the results.

In [9]:
# Defining a function that performs LOOCV for given data 

def LOOCV():
    n = Weekly.shape[0]
    misclassification_count = 0
    y = Weekly.Direction == 'Up'
    for i in range(n):
        Weekly_train = Weekly.drop([i])
        Weekly_test = Weekly.iloc[i].to_frame().T
        y_train = y.drop([i])
        y_test = y[i]

        # Fitting the model 
        predictors = MS(['Lag1','Lag2'])
        X_train = predictors.fit_transform(Weekly_train)
        X_test = predictors.fit_transform(Weekly_test)

        glm = sm.GLM(y_train, X_train, family=sm.families.Binomial())
        results = glm.fit()
        
        # Making predictions and counting errors
        prob = results.predict(X_test.values.astype(float))
        prediction = True if prob>0.5 else False
        if prediction!=y_test:
            misclassification_count+=1
    print("{0} out of 1089 examples are misclassified".format(misclassification_count))
    print("and the error rate is {0}%".format(round(misclassification_count/n,2)*100))

LOOCV()

490 out of 1089 examples are misclassified
and the error rate is 45.0%


> We see that the validation error is 45%, which is only slightly better than flipping a coin. However, there is large positive correlation in the training examples. Hence this error is most likely an overestimate of the actual test error. 