# Investigating K-fold Cross Validation

# TOC
___
[Sect.1: Introduction](#Sect.1)

[Sect.2: Linear Regression Model](#Sect.2)

[Sect.3: K-fold Cross Validation](#Sect.3)

[Sect.4: Comparison with Leave One Out Validation](#Sect.4)

# Sect.1
# Introduction
[Back to top](#TOC)

# Sect.2
# Linear Regression Model
[Back to top](#TOC)

In this section we will:
- Import Census population data
- Set up methods to create a linear regression model based on training data inputted with expected target values

In [5]:
#Import
import pandas as pd
import numpy as np

In [2]:
# Import Census data from CSV
census = pd.read_csv("census.csv")
census

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.832960,17.647293,21.845705,19.243286,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.500690,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3188,50,4,8,56,37,Wyoming,Sweetwater County,43806,43806,43593,...,1.072643,16.243199,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195
3189,50,4,8,56,39,Wyoming,Teton County,21294,21294,21297,...,-1.589565,0.972695,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747
3190,50,4,8,56,41,Wyoming,Uinta County,21118,21118,21102,...,-17.755986,-4.916350,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351
3191,50,4,8,56,43,Wyoming,Washakie County,8533,8533,8545,...,-11.637475,-0.827815,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961


### Least squares estimator

$$y = \beta X + \epsilon $$

Least Squares Estimator (Best Linear Unbiased Estimator) of model parameters (Wasserman p217):

$$ \hat{\beta} = (X'X)^{-1}(X'Y)$$

In [371]:
def linear_regression(x_train, y_train):
    '''
    input: x_train (arraylike, size: n x p)
            y_train (arraylike, size: n x 1)
            
    Uses the least squares estimator to create a model (b_hat) which is returned 
    
    output: b_hat (arraylike, size p x 1)
    '''
    x_prime = x_train.T
    # TEST
    print("X train: \n", x_train)
    print(" Y train: \n", y_train)
    print("x_prime: \n", x_prime)
    
    # Calculate the Least Squares Estimates of model coefficients
    # B_hat = ( X'X)^-1 X'Y
    first_term = np.linalg.inv( np.matmul(x_prime, x_train) )
    second_term = np.matmul(first_term , x_prime)
    B_hat = np.matmul( second_term ,  y_train)
    
    return B_hat

To test the linear regression model, we will use a linear dataset and try to find the original parameters of the generating model:

### Make dataset to test lin reg

In [416]:
def prepare_inputs(y_train, *x_trains):
    '''
    Takies in np.arrays y_train, and 1+ x_trains 
    
    It checks if the y_train and x_train are the same length
    
    then it transposes the y_train and 
    '''
    first_col = x_trains[0]
    for col in x_trains:
        assert(len(first_col) == len(col))
    
    # Join x_train columns
    input_vars = np.array(x_trains)
    
    # Transpose
    input_vars = np.transpose(input_vars)
    
    # Add extra column of ones
    ones_col = np.transpose(np.matrix(np.ones(first_col.shape[0])))
    input_vars = np.append(ones_col,input_vars,  axis=1)
    
    # Transpose y train
    y_train = np.transpose(np.matrix(y_train))
    
    return input_vars, y_train

 

In [417]:
# Suppose we have a generating function which takes 2 input variables (x,z) to 
# to make a target variable y:
# y =  3x + 2z + 5 + E (noise)

x = np.array([1,5,6,3,2]) #+ np.array([np.random.normal() for _ in range(len(x))])
z =  np.array([2,5,3,11,6]) 
y = 3*x + 2*z + 5 + np.array([np.random.normal() for _ in range(len(x))])

input_vars, y_train= prepare_inputs(y, x, z )

input_vars, y_train

(matrix([[ 1.,  1.,  2.],
         [ 1.,  5.,  5.],
         [ 1.,  6.,  3.],
         [ 1.,  3., 11.],
         [ 1.,  2.,  6.]]),
 matrix([[11.90381961],
         [29.75337551],
         [28.36513501],
         [37.36594527],
         [23.6843491 ]]))

### Create Lin Reg Model

In [418]:
B_hat = linear_regression(input_vars, y_train)

print(f"Estimate of model parameters (B_hat):\n {B_hat}")

X train: 
 [[ 1.  1.  2.]
 [ 1.  5.  5.]
 [ 1.  6.  3.]
 [ 1.  3. 11.]
 [ 1.  2.  6.]]
 Y train: 
 [[11.90381961]
 [29.75337551]
 [28.36513501]
 [37.36594527]
 [23.6843491 ]]
x_prime: 
 [[ 1.  1.  1.  1.  1.]
 [ 1.  5.  6.  3.  2.]
 [ 2.  5.  3. 11.  6.]]
Estimate of model parameters (B_hat):
 [[4.73590944]
 [2.82767217]
 [2.1971352 ]]


### Predict using Lin Reg Model

In [419]:
def predict(model, x_pred):
    '''
    input: model (arraylike, size: p x 1)
            x_pred (arraylike, size m x p)
            
    Uses the least squares estimator model (b_hat) from the linear_regression function to 
    make a prediction for what the output set (y_pred) would be given a certain input set (x_pred)
    
    [y_pred] = [x_pred] * [b_hat] 
    asserts that the num. columns in x_pred = the rows of the model (p)
    
    output: y_pred (arraylike, size m x 1)
    '''
    print("Model shape: ", model.shape)
    print("Input shape: " , x_pred.shape)
    assert(model.shape[0] == x_pred.shape[1])
    y_pred = np.matmul(x_pred, model)
    return np.matrix(y_pred)

In [420]:
# Test what model predicts for new inputs:
x_pred = np.array([1,2,3,6,4,7]) #+ np.array([np.random.normal() for _ in range(len(x))])
z_pred =  np.array([5,6,7,8,9,10]) 
y_pred = 3*x_pred + 2*z_pred + 5 + np.array([np.random.normal() for _ in range(len(x_pred))])

input_vars, y_train= prepare_inputs(y_pred, x_pred, z_pred)

input_vars, y_train

(matrix([[ 1.,  1.,  5.],
         [ 1.,  2.,  6.],
         [ 1.,  3.,  7.],
         [ 1.,  6.,  8.],
         [ 1.,  4.,  9.],
         [ 1.,  7., 10.]]),
 matrix([[16.06803437],
         [23.08617605],
         [28.97497997],
         [40.34706207],
         [35.67398509],
         [45.87923343]]))

In [421]:
# Try to preditct:
y_prediction = predict(B_hat , input_vars)
y_prediction_matrix = zip(y_prediction, y_train)

print(f"predicted\t actual \t\t diff")
MSE = 0
for y_p , y_t in y_prediction_matrix:
    error = y_p[0] -y_t[0]
    MSE +=  error**2
    print(f"{y_p[0]}\t{y_t[0]}\t{error}")

print("MSE is :", MSE/len(y_prediction))

Model shape:  (3, 1)
Input shape:  (6, 3)
predicted	 actual 		 diff
[[18.54925762]]	[[16.06803437]]	[[2.48122324]]
[[23.57406498]]	[[23.08617605]]	[[0.48788893]]
[[28.59887235]]	[[28.97497997]]	[[-0.37610761]]
[[39.27902406]]	[[40.34706207]]	[[-1.06803801]]
[[35.82081493]]	[[35.67398509]]	[[0.14682984]]
[[46.50096663]]	[[45.87923343]]	[[0.6217332]]
MSE is : [[1.34746295]]


### Calculating MSPE of prediction

In [424]:
def MSPE(y_predicted, y_train):
    '''
    Mean Square Predictiona error:
    
    input: y_predicted (arraylike, size: m x 1)
           y (arraylike, size m x 1)
            
    Finds the sum of squared differences between the predicted y values in the training set
    and the observed y_train values
    '''
    print("prediction shape: ", y_predicted.shape)
    print("type:" , type( y_predicted))
    
    print("y_train shape: " , y_train.shape)
    print("type:" , type( y_train))

  
    assert(len(y_predicted) == len(y_train))
    
    errors = np.subtract(y_predicted , y_train)
    
    print("Y pred:\n", y_predicted)
    print("Y train:\n", y_train)
    
    # MSPE = (predicted - actual)**2 / n
    mean_error = np.matmul( errors.T, errors) / len(y_train)
    print("Mean squared errors: ", mean_error)
    return errors,mean_error[0]

In [434]:
# Find MSPE of the estimates:

MSPE_errors, error = MSPE(y_prediction, y_train )
print("Mean Squared Prediction Error: ", error[0,0])
MSPE_errors, error

prediction shape:  (6, 1)
type: <class 'numpy.matrix'>
y_train shape:  (6, 1)
type: <class 'numpy.matrix'>
Y pred:
 [[18.54925762]
 [23.57406498]
 [28.59887235]
 [39.27902406]
 [35.82081493]
 [46.50096663]]
Y train:
 [[16.06803437]
 [23.08617605]
 [28.97497997]
 [40.34706207]
 [35.67398509]
 [45.87923343]]
Mean squared errors:  [[1.34746295]]
Mean Squared Prediction Error: , 1.3474629504361089


(matrix([[ 2.48122324],
         [ 0.48788893],
         [-0.37610761],
         [-1.06803801],
         [ 0.14682984],
         [ 0.6217332 ]]),
 matrix([[1.34746295]]))

2.481223242316883

# Sect.3
#  K-fold Cross Validation
[Back to top](#TOC)

# Sect.4
# Comparison with Leave One Out Validation
[Back to top](#TOC)