# Chapter 5

## Question 7

Implementing Leave-one-out cross-validation (LOOCV) in order to get an estimate of the test error

In [1]:
import sklearn.linear_model
import sklearn.metrics
import numpy as np
import statsmodels.api as sm

In [2]:
stocks = sm.datasets.get_rdataset("Weekly", "ISLR").data
stocks["Direction_Binary"] = stocks["Direction"] == "Up"
stocks.head()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction,Direction_Binary
0,1990,0.816,1.572,-3.936,-0.229,-3.484,0.154976,-0.27,Down,False
1,1990,-0.27,0.816,1.572,-3.936,-0.229,0.148574,-2.576,Down,False
2,1990,-2.576,-0.27,0.816,1.572,-3.936,0.159837,3.514,Up,True
3,1990,3.514,-2.576,-0.27,0.816,1.572,0.16163,0.712,Up,True
4,1990,0.712,3.514,-2.576,-0.27,0.816,0.153728,1.178,Up,True


### (a) Fit a logistic regression model that predicts `Direction` using `Lag1` and `Lag2`

In [3]:
X_train = stocks[["Lag1", "Lag2"]]
y_train = stocks["Direction_Binary"]
logistic_model = sklearn.linear_model.LogisticRegression(solver="lbfgs",random_state=10)
logistic_model.fit(X_train, y_train)
y_pred = logistic_model.predict(X_train)

fraction_misclassified = sklearn.metrics.zero_one_loss(y_train, y_pred)
print(fraction_misclassified)


0.44536271808999084


### (b) Fit a logistic regression model that predicts `Direction` using `Lag1` and `Lag2` using all but the first observation.

### (c) Use the model from (b) to predict the first observation.

In [4]:
# Split the sample set into a training set and a validation set
X = stocks[["Lag1", "Lag2"]]
y = stocks["Direction_Binary"]


def leaveOneOut(index, *arrays):
    """
    Given array1, array2, etc, (which should be DataFrames)
    return the row at specified index, and the remainder, for each array
    """
    objects = []
    for array in arrays:
        row = array.iloc[index]
        objects.append(row)
        remainder = array.drop(index)
        objects.append(remainder)
    return objects
    
        
X_test, X_train, y_test, y_train = leaveOneOut(0, X,y)

# Fit a model using only the training data
logistic_model = sklearn.linear_model.LogisticRegression(solver="lbfgs", random_state=10)
logistic_model.fit(X_train, y_train)

# Predict using the test data
y_pred = logistic_model.predict([X_test])[0]

print(f"Actual value: {y_test}, predicted: {y_pred}")
# Get the fraction misclassified


Actual value: False, predicted: True


### (d) Loop over all rows, recording whether the ith observation is correctly predicted or not when the model is trained on everything except the ith row

In [5]:
predictions = []
for i in range(len(X)):
    X_test, X_train, y_test, y_train = leaveOneOut(i, X,y)

    # Fit a model using only the training data
    logistic_model = sklearn.linear_model.LogisticRegression(solver="lbfgs", random_state=10)
    logistic_model.fit(X_train, y_train)

    # Predict using the test data
    y_pred = logistic_model.predict([X_test])[0]
    
    predictions.append(0 if y_test == y_pred else 1)
    # Get the fraction misclassified


In [6]:
print(np.mean(predictions))

0.44995408631772266


In [7]:
print(sum(y)/len(y))

0.5555555555555556


The test error is ~0.45, i.e. we are wrong roughly 45% of the time. Given that predicting "Up" every time would also be wrong roughyl 45% of the time, this suggests the stock market can't be predicted very well using a logistic regressor on the last 2 weeks of data.