# 3. Model with tran/validation split

We will now use GridSearchCV to evaluate the model correctly.

In earlier examples, we trained and tested on the same data, which leads to overfitting and overly optimistic results.

GridSearchCV solves this by performing cross-validation: it splits the data into training and validation folds ensuring that every prediction is made on data the model hasn't seen during fitting.

We define a pipeline (e.g., scaling + model), pass it to GridSearchCV, and specify hyperparameters (like n_neighbors) to search over.

This process gives a reliable estimate of model performance and automatically selects the best parameters.


In [None]:
from sklearn.datasets import load_boston
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# import matplotlib.pylab as plt
import pandas as pd


In [4]:
X, y = load_boston(return_X_y=True)

pipe = Pipeline(
    [("scale", StandardScaler()), ("model", KNeighborsRegressor(n_neighbors=1))]
)


In [None]:
# pipe.get_params()


In [5]:
# New model
mod = GridSearchCV(
    # We need to pass an estimator; an estimator must have a .fit() and .predict()
    estimator=pipe,
    # The param grid represents all the settings that we would like to go over in our pipeline
    # Now to set the grid, we need to have the name of that parameter and the easiest way to get there is to use the "get_params()" method that is on every sci-kit learn estimator
    # This is saying: here are all the values I'd like you to check
    param_grid={"model__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},
    # To also perform cross-validation, let's use this:
    cv=3,
)


Ok, so what is GridSearch *actually doing*? Well, it's really just calling `mod.fit(X, y)` except that there is a lot of settings and cross-validation that is happening on our behalf so we don't have to do it yourself.

In [6]:
mod.fit(X, y)
# Now one this has trained, there is a really interesting property called "cv_results_"
# For every setting and cross-validation, it' keeping track of a couple of numbers
# mod.cv_results_

# You can turn it into a pandas DataFrame
pd.DataFrame(mod.cv_results_)

# Now for every parameter that we have and for every cross-validation split that we have made, we can see how well it did on a certain score and see which one was the best


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.00217,0.001931,0.001425,0.000344,1,{'model__n_neighbors': 1},0.226933,0.432998,0.127635,0.262522,0.127179,10
1,0.000834,0.000187,0.001255,0.000197,2,{'model__n_neighbors': 2},0.358216,0.409229,0.172294,0.313246,0.101821,9
2,0.000687,3.1e-05,0.001208,0.000122,3,{'model__n_neighbors': 3},0.413515,0.476651,0.318534,0.4029,0.064986,1
3,0.00077,0.000144,0.001219,0.000158,4,{'model__n_neighbors': 4},0.475349,0.402495,0.273014,0.383619,0.083675,7
4,0.000677,4.8e-05,0.00122,0.000103,5,{'model__n_neighbors': 5},0.512318,0.347951,0.26259,0.374286,0.103638,8
5,0.000632,2.7e-05,0.001169,6.6e-05,6,{'model__n_neighbors': 6},0.533611,0.389504,0.248482,0.390532,0.116406,6
6,0.000584,1e-06,0.001172,0.000123,7,{'model__n_neighbors': 7},0.544782,0.385199,0.243668,0.391216,0.123003,5
7,0.000599,9e-06,0.001173,0.000106,8,{'model__n_neighbors': 8},0.589644,0.39465,0.209714,0.398003,0.155124,2
8,0.000637,3.7e-05,0.001374,0.000135,9,{'model__n_neighbors': 9},0.590352,0.407556,0.185253,0.394387,0.165643,3
9,0.000604,2.6e-05,0.001323,0.000265,10,{'model__n_neighbors': 10},0.61651,0.395077,0.164023,0.39187,0.184741,4


The point is that with very little code, we now have a fairly mature pipeline!

If you're going to use sci-kit learn a lot, this:

```python
X, y = load_boston(return_X_y=True)

pipe = Pipeline(
    [("scale", StandardScaler()), ("model", KNeighborsRegressor(n_neighbors=1))]
)

mod = GridSearchCV(
    estimator=pipe,
    param_grid={"model__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},
    cv=3,
)
```
...is what you're eventually going to be aiming for.

Try to stick to this pattern whenever you're using sci-kit learn: the system of fit-predict that sci-kit learn offers and the way that it allows you to construct pipelines is something to appreciate.