# 3. Model with Train/Validation Split
In this notebook, we'll use GridSearchCV to properly evaluate our model and select the best hyperparameters.

Previously, we trained and tested on the same data, which can lead to overfitting and overly optimistic results. GridSearchCV solves this by performing cross-validation: it splits the data into training and validation folds, ensuring that every prediction is made on data the model hasn't seen during fitting.

We'll define a pipeline (e.g., scaling + model), pass it to GridSearchCV, and specify hyperparameters (like n_neighbors) to search over. This process gives a reliable estimate of model performance and automatically selects the best parameters.

In [1]:
from sklearn.datasets import load_boston
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# import matplotlib.pylab as plt
import pandas as pd


In [2]:
X, y = load_boston(return_X_y=True)

pipe = Pipeline(
    [("scale", StandardScaler()), ("model", KNeighborsRegressor(n_neighbors=1))]
)


In [3]:
# pipe.get_params()


In [4]:
# Create a new model using GridSearchCV to tune hyperparameters and perform cross-validation
mod = GridSearchCV(
    # Pass the pipeline as the estimator (must have .fit() and .predict() methods)
    estimator=pipe,
    # param_grid defines the hyperparameters and values to search over in the pipeline
    # Use get_params() on any scikit-learn estimator to see available parameter names
    # Here, we search over different values for n_neighbors in the KNeighborsRegressor step
    param_grid={"model__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},
    # Set the number of cross-validation folds
    cv=3,
)


## What does GridSearchCV actually do?
GridSearchCV automates the process of tuning hyperparameters and evaluating model performance. When you call `mod.fit(X, y)`, it tries every combination of parameters you specify, using cross-validation to estimate performance for each setting. This way, you don't have to manually split your data or loop over parameter values—GridSearchCV handles it all for you.

In [5]:
mod.fit(X, y)
# Now one this has trained, there is a really interesting property called "cv_results_"
# For every setting and cross-validation, it' keeping track of a couple of numbers
# mod.cv_results_

# You can turn it into a pandas DataFrame
pd.DataFrame(mod.cv_results_)

# Now for every parameter that we have and for every cross-validation split that we have made, we can see how well it did on a certain score and see which one was the best


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.000975,0.000412,0.001133,9.7e-05,1,{'model__n_neighbors': 1},0.226933,0.432998,0.127635,0.262522,0.127179,10
1,0.000661,2.2e-05,0.001193,0.000123,2,{'model__n_neighbors': 2},0.358216,0.409229,0.172294,0.313246,0.101821,9
2,0.000666,0.00011,0.001651,0.000905,3,{'model__n_neighbors': 3},0.413515,0.476651,0.318534,0.4029,0.064986,1
3,0.000746,0.000134,0.001154,9.5e-05,4,{'model__n_neighbors': 4},0.475349,0.402495,0.273014,0.383619,0.083675,7
4,0.000766,2.8e-05,0.00125,4.5e-05,5,{'model__n_neighbors': 5},0.512318,0.347951,0.26259,0.374286,0.103638,8
5,0.000722,0.000102,0.001315,0.000129,6,{'model__n_neighbors': 6},0.533611,0.389504,0.248482,0.390532,0.116406,6
6,0.000641,5.1e-05,0.001164,8.5e-05,7,{'model__n_neighbors': 7},0.544782,0.385199,0.243668,0.391216,0.123003,5
7,0.000672,7.4e-05,0.001342,4.3e-05,8,{'model__n_neighbors': 8},0.589644,0.39465,0.209714,0.398003,0.155124,2
8,0.000593,4e-06,0.001247,0.000208,9,{'model__n_neighbors': 9},0.590352,0.407556,0.185253,0.394387,0.165643,3
9,0.000633,6.3e-05,0.001234,0.000111,10,{'model__n_neighbors': 10},0.61651,0.395077,0.164023,0.39187,0.184741,4


With just a few lines of code, you now have a robust machine learning workflow!

If you plan to use scikit-learn regularly, this pattern is essential:

```python
X, y = load_boston(return_X_y=True)

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("model", KNeighborsRegressor(n_neighbors=1))
])

mod = GridSearchCV(
    estimator=pipe,
    param_grid={"model__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},
    cv=3,
    )
```

Try to stick to this fit-predict-pipeline pattern whenever you use scikit-learn. The ability to chain preprocessing and modeling steps, and to tune parameters with cross-validation, is a powerful feature of the library.