# k-fold cross validation

### Demonstrating k-fold cross validation using a linear regression estimator

In [1]:
import numpy as np
from linear_regressor import LinearRegressor
from lr_metrics import r_squared

from kfold import k_fold_cv

### Generate dataset

In [2]:
size = 1000

coefficients = [6.2,-1.4,2.1,-3,11,-8]
X = np.ones((size,len(coefficients)))
for i in range(1,len(coefficients)):
    X[:,i]=np.random.rand(size)
y = (X*coefficients).sum(axis=1) + 10*np.random.normal(size=size)

### Estimator

**Note: no need to fit estimator for k-fold cv**

In [3]:
reg = LinearRegressor()

**Notice: user can provide any estimator with the correct methods, not just estimators from this repository**

See `linear_regression.py` as an example

## k-fold CV

In [4]:
k_fold_cv(reg,X,y,5)

101.04875046268883

Default scoring output is the estimator's default scoring function.
<br>
In the case of the linear regressor this is mean square error

### CV input parameters

**User can change k** (performance might be affected with real data)

In [5]:
results = np.zeros((18,))
# k must be >1
for i in range(2,20):
    results[i-2]=(round(k_fold_cv(reg,X,y,i),2))
results

array([101.49, 101.08, 101.24, 101.05, 101.15, 101.1 , 101.18, 101.34,
       101.33, 101.16, 101.52, 101.32, 101.23, 101.2 , 101.11, 101.1 ,
       101.26, 101.16])

**User can provide any scoring function**

In [6]:
k_fold_cv(reg,X,y,5,scoring=r_squared)

0.16457148025905666

Note: r_squared is a function from lr_metrics module

**User can write any function and pass it as a scoring function** (which takes `true_y` and `predictions` as inputs)

In [7]:
def dummy_score(true_y,predictions):
    return (np.sign(true_y - predictions).sum())

In [8]:
k_fold_cv(reg,X,y,5,scoring=dummy_score)

-8.8

**Note: this function has no meaning and is only used for demonstration**

<br>

### Estimator input parameters

User can pass parameters to the estimator using a dictionary

Note: argument names must be correct

**Let's put the CV to the test:**

In [9]:
batch = {'method':'batch','epochs':100,'learning_rate':0.01}
sgc = {'method':'stochastic','epochs':100,'learning_rate':0.01,'bin_size':1}
normal = {'method':'normal'}

Note: pass dictionary as is (don't use `**dict`)

In [10]:
k_fold_cv(reg,X,y,5,batch)

101.04253840387358

In [11]:
k_fold_cv(reg,X,y,5,sgc)

101.7308768760407

In [12]:
k_fold_cv(reg,X,y,5,normal)

101.04875046268883

**With these parameters, it appears that SGC is negligibly inferior**