# Investigating K-fold Cross Validation

# TOC
___
[Sect.1: Introduction](#Sect.1)

[Sect.2: Linear Regression Model](#Sect.2)

[Sect.3: K-fold Cross Validation](#Sect.3)

[Sect.4: Comparison with Leave One Out Validation](#Sect.4)

# Sect.1
# Introduction
[Back to top](#TOC)

# Sect.2
# Linear Regression Model
[Back to top](#TOC)

In this section we will:
- Import Census population data
- Set up methods to create a linear regression model based on training data inputted with expected target values

In [5]:
#Import
import pandas as pd
import numpy as np

In [2]:
# Import Census data from CSV
census = pd.read_csv("census.csv")
census

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.832960,17.647293,21.845705,19.243286,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.500690,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3188,50,4,8,56,37,Wyoming,Sweetwater County,43806,43806,43593,...,1.072643,16.243199,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195
3189,50,4,8,56,39,Wyoming,Teton County,21294,21294,21297,...,-1.589565,0.972695,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747
3190,50,4,8,56,41,Wyoming,Uinta County,21118,21118,21102,...,-17.755986,-4.916350,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351
3191,50,4,8,56,43,Wyoming,Washakie County,8533,8533,8545,...,-11.637475,-0.827815,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961


### Least squares estimator

$$y = \beta X + \epsilon $$

Least Squares Estimator (Best Linear Unbiased Estimator) of model parameters (Wasserman p217):

$$ \hat{\beta} = (X'X)^{-1}(X'Y)$$

In [134]:
def linear_regression(x_train, y_train):
    '''
    input: x_train (arraylike, size: n x p)
            y_train (arraylike, size: n x 1)
            
    Uses the least squares estimator to create a model (b_hat) which is returned 
    
    output: b_hat (arraylike, size p x 1)
    '''
    # TEST
    print("X train: \n", x_train)
    print(" Y train: \n", y_train)
    
    # Add extra column of ones
    ones_col = NP.transpose(NP.matrix(NP.ones(x_train.shape[0])))
    x_train = NP.append(ones_col,x_train,  axis=1)
    x_prime = np.transpose(x_train)    

    print("x train plus one: \n", x_train)

    print("x_prime: \n", x_prime)
    
    # Calculate the Least Squares Estimates of model coefficients
    # B_hat = ( X'X)^-1 X'Y
    first_term = np.linalg.inv( np.matmul(x_prime, x_train) )
    second_term = np.matmul(first_term , x_prime)
    B_hat = np.matmul( second_term ,  y_train)
    
    return B_hat

To test the linear regression model, we will use a linear dataset and try to find the original parameters of the generating model:

In [162]:
import random
x = [np.random.normal() for y in range(10)]

np.mean(x), np.std(x)
x

[-0.8984658488261471,
 -1.7341917756996592,
 -0.9697475110328132,
 0.3462384130918392,
 0.8714687012412616,
 0.1391733798984588,
 -0.1315822034739071,
 0.9869453279630414,
 2.3796306709701667,
 1.0974225400687614]

In [163]:
# Suppose we have a generating function which takes 2 input variables (x,z) to 
# to make a target variable y:
# y =  3x + 2z + 5
x = np.arange(10)
y = 16*x+12 +np.random.normal()
#z = np.flip(np.arange(10))

#y = 3*x + 2*z + 5

#x,z,y 

x,y

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 array([ 11.41050212,  27.41050212,  43.41050212,  59.41050212,
         75.41050212,  91.41050212, 107.41050212, 123.41050212,
        139.41050212, 155.41050212]))

In [164]:
# Multiple vars
#input_vars = np.array([x,z])
#input_vars = np.transpose(input_vars)

#added to just test one var
input_vars = np.transpose(np.matrix(x))

y = np.transpose(np.matrix(y))

linear_regression(input_vars, y)

X train: 
 [[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]
 Y train: 
 [[ 11.41050212]
 [ 27.41050212]
 [ 43.41050212]
 [ 59.41050212]
 [ 75.41050212]
 [ 91.41050212]
 [107.41050212]
 [123.41050212]
 [139.41050212]
 [155.41050212]]
x train plus one: 
 [[1. 0.]
 [1. 1.]
 [1. 2.]
 [1. 3.]
 [1. 4.]
 [1. 5.]
 [1. 6.]
 [1. 7.]
 [1. 8.]
 [1. 9.]]
x_prime: 
 [[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]]


matrix([[11.41050212],
        [16.        ]])

In [165]:
import numpy as NP
my_data = NP.random.random_integers(0, 9, 12).reshape(3, 4)
new_col = NP.transpose(NP.matrix(NP.ones(my_data.shape[0])))
res = NP.append(my_data, new_col, axis=1)

my_data, new_col
res

  my_data = NP.random.random_integers(0, 9, 12).reshape(3, 4)


matrix([[8., 7., 8., 7., 1.],
        [4., 7., 4., 3., 1.],
        [2., 2., 8., 1., 1.]])

# Sect.3
#  K-fold Cross Validation
[Back to top](#TOC)

# Sect.4
# Comparison with Leave One Out Validation
[Back to top](#TOC)