# Unit 1 Cross-Validation in Machine Learning

# Lesson Introduction

Hi there! Today, we're diving into a significant concept in machine learning known as cross-validation. Imagine you're baking a cake. You wouldn't just taste one slice, right? You'd want to taste slices from different parts to ensure they are evenly good. That's what cross-validation does for machine learning models. It ensures our models work well on different sections of the data.

By the end of this lesson, you'll understand cross-validation, perform it using Scikit-Learn, and interpret the results. Let's get started!

## Introduction to Cross-Validation

### What is cross-validation?

Cross-validation evaluates a machine learning model by splitting the data in multiple ways. Instead of just one split into training and testing sets, we split it multiple times, each time in a different way, and train and test the model on these splits. This gives a more reliable performance estimate.

Think of it like trying different slices of your cake to ensure it's consistently good.

In cross-validation, a **fold** refers to a single iteration of splitting the data into training and validation sets. For example, in 5-fold cross-validation, the entire dataset is divided into 5 parts (called folds). Each fold takes a turn being the validation set while the remaining folds together form the training set. This process repeats 5 times.

## Example of Cross-Validation

Let's see how to do this in Python.

First, we need a real-world dataset. We'll use the "wine dataset" from Scikit-Learn.

```python
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler

# Load the wine dataset
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)
```

Here, `X` contains the features (input data), and `y` contains the target (output labels). Note that we scale the features to improve the model's convergence.

Next, we'll split the data into training and testing sets. Even when using cross-validation, it's essential to hold back a portion of the data for final testing. Cross-validation will be performed only on the training data.

```python
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

And we'll use `DecisionTreeClassifier`, a simple yet well-performing model.

```python
from sklearn.tree import DecisionTreeClassifier

# Create a Decision Tree model
decision_tree = DecisionTreeClassifier(random_state=42)
```

## Performing Cross-Validation

Now, let's perform cross-validation on the training data.

```python
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation on the training data
scores = cross_val_score(decision_tree, X_train, y_train, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Mean cross-validation score: {scores.mean():.2f}")
# Cross-validation scores: [0.93103448 0.93103448 0.89285714 0.92857143 0.89285714]
# Mean cross-validation score: 0.92
```

The `cross_val_score` function splits the training data into 5 parts (using `cv=5`), trains the model on 4 parts, and tests it on the remaining part. This is repeated 5 times.

Each time, we get a score that shows the model's performance. Finally, we print these individual scores and their average.

### Scoring Parameter

The `scoring` parameter in `cross_val_score` allows you to specify the metric used to evaluate the model's performance. By default, it uses the scoring method based on the model type (e.g., accuracy for classification models). However, you can specify other metrics such as 'f1', 'precision', 'recall', etc.

For example, to use F1-score, you can modify the `cross_val_score` function call as follows:

```python
scores = cross_val_score(decision_tree, X_train, y_train, cv=5, scoring='f1_weighted')
```

This flexibility allows you to choose the metric that best aligns with your model's performance goals.

## Evaluating on the Test Set

After performing cross-validation, it's crucial to evaluate the model on the test set to get an unbiased estimate of its final performance.

```python
# Train the model on the entire training data
decision_tree.fit(X_train, y_train)

# Evaluate the model on the test data
test_score = decision_tree.score(X_test, y_test)
print(f"Test score: {test_score:.2f}")
# Test score: 0.94
```

Here, we fit the model on the entire training data and then evaluate it on the test data to see how well it generalizes to unseen data.

## Interpreting Cross-Validation Results

Let's look at the output:

```
Cross-validation scores: [0.909 0.896 0.909 0.909 0.85 ]
Mean cross-validation score: 0.89
Test score: 0.94
```

These scores show the model's performance on different data parts. It's like tasting various slices of the cake.

The mean score gives an overall performance measure. It's like averaging the taste scores from different slices.

A mean cross-validation score of 0.89 means our Decision Tree model correctly predicts about 89% of the time on average during cross-validation. The test score of 0.94 indicates that the model performs better, predicting correctly 94% of the time on unseen test data.

## Lesson Summary

Great job! Today, we've covered:

* **What is Cross-Validation?**
    * A method to ensure our machine learning model performs well on different data parts.
    * Introduction to folds in cross-validation.
* **How to Perform Cross-Validation Using Scikit-Learn**
    * We used Python to load a dataset, split it into training and testing sets, create a Decision Tree model, and perform cross-validation only on the training set.
    * Explanation of the `scoring` parameter in cross-validation.
* **Evaluating the Model on the Test Set**
    * We trained the model on the entire training data and evaluated its performance on the test data to ensure it generalizes well.
* **Interpreting the Results**
    * We learned how to read individual scores and calculate the mean score to understand the model's performance.

Now it's your turn! You'll get hands-on experience with cross-validation in the upcoming practice. You'll use different models and datasets to see how cross-validation helps ensure your machine learning models are reliable and performant. Ready to give it a try? Let's get started!

## Using F1 Score for Cross-Validation

Space Explorer, let's tweak our cross-validation a bit.

Change the code to use the F1 score as the scoring metric instead of the default one. The scoring parameter in cross_val_score will help you achieve this. Note that as wine dataset has three target classes, you should use 'f1_weighted'.

Give it a try!

```python
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# Step 1: Getting the dataset
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)

# Create a Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)

# Perform 5-fold cross-validation
scores = cross_val_score(log_reg, X, y, cv=5)

print(f"Cross-validation scores: {scores}")
print(f"Mean cross-validation score: {scores.mean():.2f}")

```

```python
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# Step 1: Getting the dataset
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)

# Create a Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)

# Perform 5-fold cross-validation using 'f1_weighted' as the scoring metric
scores = cross_val_score(log_reg, X, y, cv=5, scoring='f1_weighted')

print(f"Cross-validation scores: {scores}")
print(f"Mean cross-validation score: {scores.mean():.2f}")
```

## Complete the Cross-Validation Process

Hey there, Stellar Navigator!

Let's spice things up a bit. Can you complete the missing parts of the code to perform cross-validation and print the results?

Best of luck!

```python
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Getting the dataset
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)

# Create a Logistic Regression model
model = LogisticRegression(max_iter=2000)
# TODO: perform 3-fold cross-validation and print the scores and mean score

```
```python
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Getting the dataset
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)

# Create a Logistic Regression model
model = LogisticRegression(max_iter=2000)

# Perform 3-fold cross-validation and print the scores and mean score
scores = cross_val_score(model, X, y, cv=3)
print(f"Cross-validation scores: {scores}")
print(f"Mean cross-validation score: {scores.mean():.2f}")
```

## Comparing Models Using Cross-Validation

Hey there, Space Explorer! It's time to put your cross-validation skills to the test. Your mission is to compare a logistic regression model and a decision tree model on the wine dataset using cross-validation. Follow the steps and fill in the code to complete the task. Ready to show your Space Wizardry? Let's go!

```python
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# TODO: Load the wine dataset and scale features

# TODO: Create a Logistic Regression model and a Decision Tree model

# TODO: Perform 5-fold cross-validation for both models

# TODO: Print the average cross-validation scores for both models
```

```python
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# Load the wine dataset and scale features
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)

# Create a Logistic Regression model and a Decision Tree model
logistic_model = LogisticRegression(max_iter=1000, random_state=42)
decision_tree_model = DecisionTreeClassifier(random_state=42)

# Perform 5-fold cross-validation for both models
logistic_scores = cross_val_score(logistic_model, X, y, cv=5)
decision_tree_scores = cross_val_score(decision_tree_model, X, y, cv=5)

# Print the average cross-validation scores for both models
print(f"Logistic Regression Mean Cross-validation Score: {logistic_scores.mean():.2f}")
print(f"Decision Tree Mean Cross-validation Score: {decision_tree_scores.mean():.2f}")
```

## Exploring Ensemble Models with Cross-Validation

Ready for the final challenge, Space Voyager? Let's explore the wine dataset by comparing cross-validation scores for different ensemble models. Follow the TODO instructions to complete the code for this comparison. Go for it!

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# TODO: Load the wine dataset

# TODO: Split the dataset into training and testing sets
# X_train, X_test, y_train, y_test = ...

# TODO: Create two ensemble models with different configurations
# model1 = ...
# model2 = ...

# TODO: Perform 5-fold cross-validation for each model
# scores1 = ...
# scores2 = ...

# TODO: Print mean cross-validation scores for each model
```

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler # Import StandardScaler for feature scaling

# Load the wine dataset
X, y = load_wine(return_X_y=True)

# Scale features for better model performance
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create two ensemble models with different configurations
# RandomForestClassifier with default parameters
model1 = RandomForestClassifier(random_state=42)
# GradientBoostingClassifier with n_estimators=100 and learning_rate=0.1
model2 = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Perform 5-fold cross-validation for each model on the training data
scores1 = cross_val_score(model1, X_train, y_train, cv=5)
scores2 = cross_val_score(model2, X_train, y_train, cv=5)

# Print mean cross-validation scores for each model
print(f"RandomForestClassifier Mean Cross-validation Score: {scores1.mean():.2f}")
print(f"GradientBoostingClassifier Mean Cross-validation Score: {scores2.mean():.2f}")
```