# Cross-validation and bias in hyperparameter selection

Many classification algorithms use cross-validation error to choose hyperparameters.
If hyperparameters strongly depend on the dataset size then comparison of cross-validation errors lead to incorrect parameter choices if the fold count $k$ is small.
On the other hand, leave-one-out cross-validation scheme is not very useful for very stable classification methods for which the classification does not change if we drop a single point, such as support vector machines.
This creates a trade-off. In the following, we explore this phenomenon in practice.

# Homework 

## 8.1 Bias in hyperparameter estimation* (<font color='red'>3p</font>)

Use a large dataset, such as [UCI: Diabetes 130-US hospitals for years 1999-2008 Data Set](https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008#) to test this issue: 

* Bias detection (<font color='red'>1p</font>)
  * Split the dataset into training and test set.
  * Use [SVM with RBF kernel](https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html) for predicing the target value.
  * Use cross-validation to find the best kernel width $\gamma$ and regularisation parameter $C$.
  * Use test set for finding the best kernel width $\gamma$ and regularisation parameter $C$.
  * Compare the results. Is the optimal value of the kernel width $\gamma$ different? (<font color='red'>1p</font>)

* Measure the effect of trade-offs (<font color='red'>1p</font>)
  * Use the same setup but try different number of folds. 
  * Tabulate the discrepancies in optimal hyperparameter values for different fold counts.
  * Tabulate the difference in the accuracy.
  * Interpret results. What seems to be the best cross-validation scheme?
  
* Measure the effect of training set size (<font color='red'>1p</font>)
  * Choose you favorite fold count $k\leq 10$ and alter the size of the training set.
  * Tabulate the discrepancies in optimal hyperparameter values for different training sets.
  * Tabulate the differences in the accuracy.
  * Interpret results. Does this effect diminish when the training set increases?


### Remarks
* Any dataset where SVM with RBF kernel works and contains more than 3000 data points is suitable. The hold-out set must be large enough to get a reliable test error estimate.
* RBF kernel width is known to be dependent on the sample density. If density is low, the width must be large or otherwise the SVM collapses to nearest neighbour classifier. The density obviously increases if we add back a large fraction of datapoints and hence the optimal $\gamma$ value should decrease. 
* As cross-validation can be computationally quite demanding, you can use a simple hold-out split for hyperparameter tuning to cut corners.