# Introduction 3: Model Evaluation with Cross-Validation
This notebook explains how to properly evaluate models and select hyperparameters using cross-validation with GridSearchCV. You will learn how to split data into training and validation sets, build a pipeline, and use GridSearchCV to find the best model settings.

In [None]:
# Import necessary libraries for loading data, building models, and evaluating performance
from sklearn.datasets import load_boston
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# import matplotlib.pylab as plt
import pandas as pd


In [None]:
# Load the Boston housing dataset (features and target values)
X, y = load_boston(return_X_y=True)

# Create a pipeline that first scales the data, then applies the KNeighborsRegressor model
pipe = Pipeline(
    [("scale", StandardScaler()), ("model", KNeighborsRegressor(n_neighbors=1))]
)


In [None]:
# pipe.get_params()  # This would show all the parameters you can set in the pipeline


In [None]:
# Create a new model using GridSearchCV to find the best settings and evaluate performance with cross-validation
mod = GridSearchCV(
    # Pass the pipeline as the estimator (must have .fit() and .predict() methods)
    estimator=pipe,
    # param_grid defines the hyperparameters and values to search over in the pipeline
    # Use get_params() on any scikit-learn estimator to see available parameter names
    # Here, we search over different values for n_neighbors in the KNeighborsRegressor step
    param_grid={"model__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},
    # Set the number of cross-validation folds
    cv=3,
)


## What does GridSearchCV actually do?
GridSearchCV automates the process of tuning hyperparameters and evaluating model performance. When you call `mod.fit(X, y)`, it tries every combination of parameters you specify, using cross-validation to estimate performance for each setting. This way, you don't have to manually split your data or loop over parameter values—GridSearchCV handles it all for you.

In [None]:
# Fit (train) the GridSearchCV model on the data
mod.fit(X, y)
# After training, GridSearchCV stores results for each parameter setting and cross-validation split
# mod.cv_results_

# You can turn the results into a pandas DataFrame for easier viewing
pd.DataFrame(mod.cv_results_)

# This DataFrame shows how well each parameter setting performed in each cross-validation split


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001194,0.000511,0.001173,0.000272,1,{'model__n_neighbors': 1},0.226933,0.432998,0.127635,0.262522,0.127179,10
1,0.000666,1.8e-05,0.00109,0.000102,2,{'model__n_neighbors': 2},0.358216,0.409229,0.172294,0.313246,0.101821,9
2,0.000637,4.2e-05,0.001368,0.000378,3,{'model__n_neighbors': 3},0.413515,0.476651,0.318534,0.4029,0.064986,1
3,0.003408,0.001966,0.008284,0.00948,4,{'model__n_neighbors': 4},0.475349,0.402495,0.273014,0.383619,0.083675,7
4,0.003678,0.002695,0.002074,0.000619,5,{'model__n_neighbors': 5},0.512318,0.347951,0.26259,0.374286,0.103638,8
5,0.000623,2.3e-05,0.003463,0.003001,6,{'model__n_neighbors': 6},0.533611,0.389504,0.248482,0.390532,0.116406,6
6,0.00087,6.1e-05,0.001448,0.000203,7,{'model__n_neighbors': 7},0.544782,0.385199,0.243668,0.391216,0.123003,5
7,0.00059,8e-06,0.001236,0.000187,8,{'model__n_neighbors': 8},0.589644,0.39465,0.209714,0.398003,0.155124,2
8,0.000655,6.7e-05,0.001238,0.000116,9,{'model__n_neighbors': 9},0.590352,0.407556,0.185253,0.394387,0.165643,3
9,0.000682,8.5e-05,0.001395,0.000125,10,{'model__n_neighbors': 10},0.61651,0.395077,0.164023,0.39187,0.184741,4


With just a few lines of code, you now have a robust machine learning workflow!

If you plan to use scikit-learn regularly, this pattern is essential:

```python
X, y = load_boston(return_X_y=True)

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("model", KNeighborsRegressor(n_neighbors=1))
])

mod = GridSearchCV(
    estimator=pipe,
    param_grid={"model__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},
    cv=3,
    )
```

Try to stick to this fit-predict-pipeline pattern whenever you use scikit-learn. The ability to chain preprocessing and modeling steps, and to tune parameters with cross-validation, is a powerful feature of the library.