# Scikit-Learn Cross-Validation Splits

This notebook explains how to generate K-folds for cross-validation using `scikit-learn` for evaluation of machine learning models with out of sample data.

This notebook will work with an OpenML dataset to predict who pays for internet with 10108 observations and 69 columns.

### Packages

This tutorial uses:
* [pandas](https://pandas.pydata.org/docs/)
* [scikit-learn](https://scikit-learn.org/stable/)
    * [sklearn.datasets](https://scikit-learn.org/stable/datasets.html)
    * [sklearn.model_selection](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [1]:
from sklearn.datasets import fetch_openml
import pandas as pd
from sklearn.model_selection import KFold

## Reading the data

The data is from [OpenML](https://www.openml.org/d/981) imported using the Python package `sklearn.datasets`.

In [2]:
data = fetch_openml(name='kdd_internet_usage', as_frame=True)
df = data.frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10108 entries, 0 to 10107
Data columns (total 69 columns):
 #   Column                                    Non-Null Count  Dtype   
---  ------                                    --------------  -----   
 0   Actual_Time                               10108 non-null  category
 1   Age                                       10108 non-null  category
 2   Community_Building                        10108 non-null  category
 3   Community_Membership_Family               10108 non-null  category
 4   Community_Membership_Hobbies              10108 non-null  category
 5   Community_Membership_None                 10108 non-null  category
 6   Community_Membership_Other                10108 non-null  category
 7   Community_Membership_Political            10108 non-null  category
 8   Community_Membership_Professional         10108 non-null  category
 9   Community_Membership_Religious            10108 non-null  category
 10  Community_Membership_S

Split the data into target and features.

Drop target leakage features of other options to pay.

In [3]:
target = 'Who_Pays_for_Access_Work'
y = df[target]
X = data.data.drop(columns=['Who_Pays_for_Access_Dont_Know',
       'Who_Pays_for_Access_Other', 'Who_Pays_for_Access_Parents',
       'Who_Pays_for_Access_School', 'Who_Pays_for_Access_Self'])

## Cross-validation splitting

Scikit-learn's `KFold` will randomly sample the data into **N** folds (default of 5) that can be used to perform cross-validation during machine learning training.

In [4]:
kf = KFold(n_splits=10, random_state=1066, shuffle=True)
for train_index, test_index in kf.split(X):
    print("Train:", train_index, "Test:", test_index)
    X_train = X.iloc[train_index, :]
    y_train = y[train_index]
    X_test = X.iloc[test_index, :]
    y_test = y[test_index]

Train: [    0     1     2 ... 10105 10106 10107] Test: [    9    52    80 ... 10092 10102 10103]
Train: [    0     1     2 ... 10105 10106 10107] Test: [   16    20    21 ... 10069 10079 10101]
Train: [    0     1     2 ... 10105 10106 10107] Test: [    4    12    22 ... 10066 10074 10076]
Train: [    0     1     2 ... 10105 10106 10107] Test: [   13    25    34 ... 10073 10075 10100]
Train: [    0     1     2 ... 10105 10106 10107] Test: [    3     6     7 ... 10093 10095 10104]
Train: [    1     3     4 ... 10104 10105 10107] Test: [    0     2    18 ... 10045 10096 10106]
Train: [    0     1     2 ... 10105 10106 10107] Test: [    8    11    14 ... 10067 10084 10086]
Train: [    0     1     2 ... 10105 10106 10107] Test: [   10    30    31 ... 10083 10085 10098]
Train: [    0     2     3 ... 10103 10104 10106] Test: [    1     5    19 ... 10097 10105 10107]
Train: [    0     1     2 ... 10105 10106 10107] Test: [   15    32    39 ... 10081 10094 10099]
