<a href="https://colab.research.google.com/github/tkeldenich/Scikit-Learn_Cross-Validation/blob/main/Scikit_Learn_Cross_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cross Validation – THE Tutorial How To Use it – sklearn**

In this tutorial we will see how to simply use Cross Validation with [Scikit-Learn](https://scikit-learn.org/stable/index.html) and how to use it for prediction.

> Cross Validation is a way to ensure that our Machine Learning model is at its best.

**There are only 4 steps to perform a Cross Validation**:

- create 5 subgroups of our dataset
- train a model on 4 subgroups
- evaluate the model on the last subgroup
- repeat steps 2 and 3 so that all subgroups are evaluated

**Here, the Cross Validation will give us, at the end of the workflow, 5 different Machine Learning models.**

This multiplicity of models will allow us to have a diversity in the final predictions.

Actually, Cross Validation give us the opinion of 5 experts (5 models) instead of only one.

*You can choose the number of subgroups created during Cross Validation, be it 2, 3, 5 or 40. The only constraint is to have enough data in each subgroup to get a robust model.*

Once we have all these opinions, we’ll have to decide which expert to follow. This is what we will see in this article.

Let’s start by loading our data! 🔥

## **Data**

This tutorial is following [our detailed article on learning Machine Learning.](https://inside-machinelearning.com/en/scikit-learn-project-start-ml/)

But of course, you can follow this tutorial without having followed the previous one. You only have to download the dataset [winequality-white.csv](https://github.com/tkeldenich/First_Project_with_Scikit-Learn_MachineLearning/blob/main/winequality-white.csv) from [this Github address.](https://github.com/tkeldenich/First_Project_with_Scikit-Learn_MachineLearning/blob/main/winequality-white.csv)

Our dataset ranks wines according to their quality. The objective is to predict the quality level of wines from their features (acidity, alcohol level, pH, etc).

Once the dataset is loaded in your working environment, open it with the Pandas library:


In [None]:
import pandas as pd

df = pd.read_csv("winequality-white.csv", sep=";")
df.head(3)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6


Cross Validation (CV) divides our dataset into subgroups.

To make sure that these subgroups have a fair distribution, we first shuffle the dataset with the `sample(frac=1)` function:

In [None]:
df = df.sample(frac=1).reset_index(drop=True)

`reset_index(drop=True)` reset the index of each line after the shuffling.

Next, we prepare our features (X) and label (Y) for the Cross Validation:

In [None]:
X = df.drop(columns='quality')
y = df['quality']

*Note: Here we don’t need train and test data. Indeed in Cross Validation, each subgroup is used once for testing and N-1 times for training. It is, therefore, not necessary to indicate train and test set because all subgroups go through these stages.*

## **Cross Validation Score**

Let’s load the best performing model from [our article for learning Machine Learning](https://inside-machinelearning.com/en/scikit-learn-project-start-ml/): Decision Tree.

In [None]:
from sklearn import tree

decisionTree = tree.DecisionTreeClassifier()

With this model we had obtained an accuracy of 60%.

**Can we do better?**

We can see that directly with sklearn `cross_val_score` function:

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(decisionTree, X, y, cv=10)

For this evaluation we’ve chosen to perform a Cross Validation on 10 subgroups by indicating `cv=10`.

This allow us to train 10 different models of Decision Tree.

Let’s display the result of these 10 models:

In [None]:
scores



array([0.63265306, 0.57959184, 0.64693878, 0.6122449 , 0.65510204,
       0.62040816, 0.59183673, 0.63265306, 0.63599182, 0.58282209])

**Most of the models have an accuracy above 60%. This is a very good signal!**

Let’s calculate the mean to know the real potential of this Cross Validation:

In [None]:
scores.mean()

0.6190242477359041

61.9% of accuracy, that’s 1.9% more than the score obtained in the first tutorial.

The problem is that `cross_val_score` does not recover the trained models.

**This function only test Cross Validation on our dataset and our model.**

Actually, `cross_val_score` enables Data Scientists and Machine Learning Engineers to know if it is worth implementing Cross Validation.



## **Training models with Cross Validation**

Now that we know Cross Validation will improve our model, we can get down to business!

First, I suggest to divide our dataset in two:

- Data for Cross Validation, which we will call `train_test`
- Data for testing the final models, which we will call `gtest` for global test

To separate our dataset we use the `train_test_split` function (`gtest` will be composed of 10% of our dataset):

In [None]:
from sklearn.model_selection import train_test_split

X_train_test, X_val, y_train_test, y_val = train_test_split(X, y, test_size=0.10)

Then let’s initialize our classifier:

In [None]:
from sklearn import tree

decisionTree = tree.DecisionTreeClassifier()

And now we can implement the REAL Cross Validation.

For this, it’s simple, we use the `cross_validate` function.

This function returns several informations:

- `fit_time` – training time for the N models
- `test_score` – accuracy of the N models
- `score_time` – scoring time for the N models
- `estimator` (when return_estimator=True) – the N trained models

We run Cross Validation with 10 subgroups:

In [None]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(decisionTree, X_train_test, y_train_test, cv=10, return_estimator=True)



We can now display the score for each of the 10 trained models:

In [None]:
cv_results['test_score']

array([0.6031746 , 0.59183673, 0.61904762, 0.59183673, 0.60997732,
       0.63038549, 0.59637188, 0.58276644, 0.61136364, 0.65227273])

And calculate the total average :

In [None]:
cv_results['test_score'].mean()

0.608903318903319

We’ve gained 0.8% for the test data. It’s not much but it’s an acceptable score.

What about the global test data that the model has never seen?

To measure our Cross Validation, we will go through each of our models (stored in the variable `cv_results['estimator']`) and calculate the score for `X_gtest` and `y_gtest`:

In [None]:
val_score = []
for i in range(len(cv_results['estimator'])):
  val_score.append(cv_results['estimator'][i].score(X_val, y_val))

Here is the final score of the Cross Validation:

In [None]:
sum(val_score) / len(val_score)

0.6181632653061225

We gain 1.8% of precision compared to our basic model! This is huge! 🎉

1.8% improvement in accuracy may seem not enough from the point of view of a novice in Machine Learning but any expert knows it is a huge improvement!

**Indeed, Machine Learning competitions are sometimes played with only 0.001% difference in accuracy.**

## **Predicting with Cross Validation**

How to use CV models for predictions?

There are different approaches depending on the practitioner:

- **Take the best of the N models and use it directly**
- **Take the best of the N models and re-train it on the whole data set**
- **Keep the N models and rely on the opinion of the majority**

*I want to make it clear that there is no wrong way to do this. Each method is worthwhile and will be useful depending on your particular context. The best option is to test by yourself!*

After reading [our article to learn Machine Learning](https://inside-machinelearning.com/en/scikit-learn-project-start-ml/), you should be able to use the first two options.

**I propose to detail the 3rd option which is the most complex, especially since it is divided into two techniques.**

In the following parts, we’ll predict the result for the first wine of `X_gtest`.

### **Compute prediction for raw results**

Scikit-Learn offers two options to perform prediction:

- `predict()` – the raw results, in our case the quality of wine: 3, 4, 5, 6, 7, 8 or 9
- `predict_proba()` – the results as probabilities

In this part, we use the `predict()` option.

We predict, for each of the 10 models, the quality of the first wine of our `X_gtest` data:


In [None]:
result = []
for i in range(len(cv_results['estimator'])):
  result.append(int(cv_results['estimator'][i].predict(X_val.iloc[:1])))

Each of these results are stored in a list, which can be displayed:

In [None]:
result

[5, 5, 5, 5, 5, 6, 6, 6, 5, 5]

The objective now is to take the prediction that has appeared the most often.

Here, we see that most of our models conclude that the wine is of quality 5, when three of them predicted 6.

We extract the most frequently predicted value…

In [None]:
max(set(result), key=result.count)

5

… which we can compared with the real value :

In [None]:
y_val.iloc[0]

5

Here the real value is well predicted! The majority was right!

### **Compute prediction for probabilities**

Finally, I’d like to use the `predict_proba()` option which is the most complex of all.

For our Machine Learning model, 7 levels of wine quality are possible: 3, 4, 5, 6, 7, 8 or 9.

**With `predict_proba()` we get the probability that our wine is of each quality. For example: 20% that the wine is of quality 3, 8% for quality 4, 58% for quality 5, etc.**

With our Cross Validation, we’ll obtain 10 lists of probabilities.

**To calculate the prediction of the Cross Validation, we’ll sum all these probabilities together and divide the result by the number of subgroups, 10.**

Actually, we average all our probabilities to determine the quality with the highest overall probability.

First, we sum the probabilities together:

In [None]:
import numpy as np

result_proba = cv_results['estimator'][0].predict_proba(X_val.iloc[:1])
for i in range(1, len(cv_results['estimator'])):
  result_proba =+ np.add(result_proba, cv_results['estimator'][i].predict_proba(X_val.iloc[:1]))

Then we calculate the average:

In [None]:
result_proba = result_proba/10

We extract the index with the highest probability:

In [None]:
np.argmax(result_proba)

2

Here, the index is 2, it indicates a quality of 5.

Indeed, if we take our list of possible results `[3, 4, 5, 6, 7, 8, 9]`, the first index being 0, quality 3, the second corresponds to quality 5:

In [None]:
wine_quality = [3, 4, 5, 6, 7, 8, 9]
wine_quality[np.argmax(result_proba)]

5

Here, the end result for the raw prediction and the probabilistic prediction remains the same, but keep in mind that this is not always the case.

## **Conclusion**

**In this article, we learned how to improve the accuracy of our Machine Learning model by 1.8% and how to use Cross Validation for prediction.**

Other methods exist to improve a Machine Learning model like:

- [Normalize data](Normalize data)
- Changing the hyperparameters of the models
- Data Augmentation
- Ensemble methods

**One last thing: Cross Validation is not to be taken lightly. It is a technique used in 2022 by the best experts to push Machine Learning models to their maximum performance.**

Cross Validation is even used for Deep Learning!

Soon, an article will be published on the subject.

**In the meantime, if you want to stay informed, don’t hesitate to subscribe to our newsletter** 😉