[![Open in Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/justmarkham/scikit-learn-tips/master?filepath=notebooks%2F31_kfold_shuffle.ipynb)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justmarkham/scikit-learn-tips/blob/master/notebooks/31_kfold_shuffle.ipynb)

# 🤖⚡ scikit-learn tip #31 ([video](https://www.youtube.com/watch?v=Ld8-_WP0G90&list=PL5-da3qGB5ID7YYAqireYEew2mWVvgmj6&index=31))

If you use cross-validation and your samples are NOT in an arbitrary order, shuffling may be required to get meaningful results.

Use KFold or StratifiedKFold in order to shuffle!

See example 👇

In [1]:
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import cross_val_score

In [2]:
# set up the regression problem
X_reg, y_reg = load_diabetes(return_X_y=True)
reg = LinearRegression()

In [3]:
# set up the classification problem
df = pd.read_csv('http://bit.ly/kaggletrain')
X_clf = df[['Pclass', 'Fare', 'SibSp']]
y_clf = df['Survived']
clf = LogisticRegression()

In [4]:
from sklearn.model_selection import KFold, StratifiedKFold

### Use KFold with regression problems:

In [5]:
kf = KFold(5, shuffle=True, random_state=1)
cross_val_score(reg, X_reg, y_reg, cv=kf, scoring='r2')

array([0.43843604, 0.38982527, 0.52792606, 0.47359858, 0.57449343])

### Use StratifiedKFold with classification problems:

In [6]:
skf = StratifiedKFold(5, shuffle=True, random_state=1)
cross_val_score(clf, X_clf, y_clf, cv=skf, scoring='accuracy')

array([0.65363128, 0.7247191 , 0.66853933, 0.68539326, 0.65730337])

### Want more tips? [View all tips on GitHub](https://github.com/justmarkham/scikit-learn-tips) or [Sign up to receive 2 tips by email every week](https://scikit-learn.tips) 💌

© 2020 [Data School](https://www.dataschool.io). All rights reserved.