Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below.

Rename this problem sheet as follows:

    ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}
    
for example
    
    ps8_blja_problem2

Submit your homework until Thursday, December 17, 2020, 9 am.

In [1]:
NAME = "Ahmad Modabber"
EMAIL = "amod@tu-chemnitz.de"
USERNAME = "amod"

---

# Introduction to Data Science
## Lab 8: Cross-validation methods provided by Scikit-Learn

### Part A: Generation of a toy data set

We want to experiment with the methods `sklearn` provides to us.

**Task (1 point)**: Generate a *toy* dataset `X`.
It should be a 1-dimensional `numpy.ndarray` containing only the numbers from 1 to 10.

In [2]:
import numpy as np
X=np.arange(1,11)
X

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [3]:
assert type(X) == np.ndarray
assert X.shape == (10,)
assert X.mean() == 5.5
assert X.var() == 8.25

### Part B: Leave-One-Out Cross-Validation

The function `LeaveOneOut` is a simple cross-validation method.
Each training set is created by taking all the samples except one, the test set consisting of the single remaining sample.
Thus, for `n` samples, we have `n` different training sets and `n` different test sets.
Leave-one-out cross-validation (LOOCV) can be computationally expensive for large datasets.

You can import the function `LeaveOneOut` by

    from sklearn.model_selection import LeaveOneOut
    
The documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut).

The command

    S = LeaveOneOut().split(X)

generates a leave-one-out cross-validation iterator `S` across the set/list/array `X`.
An *iterator* is an object that can be iterated upon, meaning that you can traverse through all its values.

Once, you've set up an iterator you would typically access its train and test set within a loop and do the data science stuff that you want.

**Task (2 points)**: Set up a leave-one-out cross validation iterator for the data set `X`.
Afterwards, set up a loop which prints the training and test data set in each iteration.
Your output should look similar to this:

    Training set: [ 2  3  4  5  6  7  8  9 10]	 Test set: [1]
    Training set: [ 1  3  4  5  6  7  8  9 10]	 Test set: [2]
    Training set: [ 1  2  4  5  6  7  8  9 10]	 Test set: [3]
    Training set: [ 1  2  3  5  6  7  8  9 10]	 Test set: [4]
    Training set: [ 1  2  3  4  6  7  8  9 10]	 Test set: [5]
    Training set: [ 1  2  3  4  5  7  8  9 10]	 Test set: [6]
    Training set: [ 1  2  3  4  5  6  8  9 10]	 Test set: [7]
    Training set: [ 1  2  3  4  5  6  7  9 10]	 Test set: [8]
    Training set: [ 1  2  3  4  5  6  7  8 10]	 Test set: [9]
    Training set: [1 2 3 4 5 6 7 8 9]	         Test set: [10]

In [4]:
from sklearn.model_selection import LeaveOneOut
# YOUR CODE HERE
S=LeaveOneOut().split(X)

for train_index, test_index in S:
    print(f"Training set: {X[train_index]} \t Test set: {X[test_index]}")

Training set: [ 2  3  4  5  6  7  8  9 10] 	 Test set: [1]
Training set: [ 1  3  4  5  6  7  8  9 10] 	 Test set: [2]
Training set: [ 1  2  4  5  6  7  8  9 10] 	 Test set: [3]
Training set: [ 1  2  3  5  6  7  8  9 10] 	 Test set: [4]
Training set: [ 1  2  3  4  6  7  8  9 10] 	 Test set: [5]
Training set: [ 1  2  3  4  5  7  8  9 10] 	 Test set: [6]
Training set: [ 1  2  3  4  5  6  8  9 10] 	 Test set: [7]
Training set: [ 1  2  3  4  5  6  7  9 10] 	 Test set: [8]
Training set: [ 1  2  3  4  5  6  7  8 10] 	 Test set: [9]
Training set: [1 2 3 4 5 6 7 8 9] 	 Test set: [10]


### Part C: K-Fold cross validation

The function `KFold` divides all the samples into `k` groups of samples called folds (if $k=n$, this is equivalent to the Leave-One-Out strategy) of equal sizes (if possible).
The prediction function is learned using `k−1` folds, and the omitted fold is used for testing.

You can import the function `KFold` by

    from sklearn.model_selection import KFold

Check out the documentation of the function [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold).

**Task (2 points)**: As for LOOCV, use the data set `X` and create a test example that shows the behaviour of the function.
For `n_splits=2`, you should obtain

    Training set: [5 6 7 8 9]	 Test set: [0 1 2 3 4]
    Training set: [0 1 2 3 4]	 Test set: [5 6 7 8 9]

In [5]:
# YOUR CODE HERE
from sklearn.model_selection import KFold

kfold = KFold(n_splits=2, shuffle=False)

for train_index, test_index in kfold.split(X):
     print(f"Training set: {train_index} \t Test set:{test_index}")

Training set: [5 6 7 8 9] 	 Test set:[0 1 2 3 4]
Training set: [0 1 2 3 4] 	 Test set:[5 6 7 8 9]
