# Introduction

This is a study of flight take off data from John F. Kennedy International Airport.
Our goal is to predict the taxi-out time of a flight.
You can download the dataset [from Kaggle](https://www.kaggle.com/deepankurk/flight-take-off-data-jfk-airport).

### Loading the dataset

Before we do anything, we're going to update `scikit-learn` to a newer neversion.


In [None]:
! pip install --upgrade scikit-learn

Let's start by taking a look at the raw data.

In [None]:
import pandas as pd

data = pd.read_csv("../input/flight-take-off-data-jfk-airport/M1_final.csv")

In [None]:
data.shape

In [None]:
data.head()

### Description of the columns

There are $28820$ observations of $23$ variables.
Each observation is an individual flight.
- `MONTH`, `DAY_OF_MONTH`, `DAY_OF_WEEK` contain information about the date of the flight
- `OP_UNIQUE_CARRIER` contains ID of the airline (i.e. `AA` stands for American Airlines)
- `TAIL_NUM` is the tail number of the plane
- `DEST` is the destination airport code
- `DEP_DELAY` is the departure delay of the flight
- `CRS_ELAPSED_TIME` is expected duration of the light
- `DISTANCE` is the distance between airports
- `CRS_DEP_M` is scheduled departure time (in minutes after midnight)
- `DEP_TIME_M` is actual departure time (gate checkout)
- `CRS_ARR_M` is scheduled arrival time
- `Temperature`, `Dew Point`, `Humidity`, `Wind Speed`, `Wind Gust` and `Pressure` are the numeric characteristis of the weather
- `Wind` is the direction of the wind (`CALM` if calm, `VAR` if wind blows from various directions)
- `Condition` contains natural language description of the weather
- `sch_dep` is the number of flights scheduled for departure
- `sch_arr` is the number of flights scheduled for arrival 
- `TAXI_OUT` is the time between the actual pushback and wheels-off.

There are five caterogical variables: `OP_UNIQUE_CARRIER`, `TAIL_NUM`, `DEST`, `WIND` and `Condition`. 
The rest of the variables is numerical.

### Train/test split

Before we go any further, we need to split the dataset into a training and a test part.
Our target variable is `TAXI_OUT`.


In [None]:
y = data["TAXI_OUT"]
X = data.drop("TAXI_OUT", axis=1)

#reproducibility
seed = 1001

from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

# Exploratory Analysis

### Are there any NaN values?

In [None]:
X_train.info()

We see that all there are $23056$ non-null values in each column of the training data, except for `Wind`, where there are $23055$ non-null values.

We cannot have any missing value in the test set, either.
However, there also is one missing value in the `Wind` variable.

In [None]:
X_test.info()

### Basic statistics

In [None]:
X_train.describe()

### Plotting the distributions

We're going to plot the distribution of each variable.
Let's start with the numerical ones.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def plot_num_var_dist(var, kde=True, kde_plot=False, discrete=False):
  if kde_plot:
    sns.kdeplot(data=X_train, x=var)
  else:
    sns.histplot(data=X_train, x=var, kde=kde, discrete=discrete)

In [None]:
plot_num_var_dist("MONTH", discrete=True)

We have data about departures from November, December, and January, distributed almost evenly.

In [None]:
plot_num_var_dist("DAY_OF_MONTH", discrete=True)

In [None]:
plot_num_var_dist("DAY_OF_WEEK", kde=False, discrete=True)

There is equal number of flights on Monday, Tuesday, Wednesday, Thursday and Sunday (about 3200).
There are more flights on Friday (about 3600) and fewer on Saturday (about 2800).

In [None]:
plot_num_var_dist("DEP_DELAY", kde_plot=True)

As can be excpected, there is a spike around $0$.

In [None]:
plot_num_var_dist("CRS_ELAPSED_TIME")

In [None]:
plot_num_var_dist("DISTANCE")

Notice that the distributon of `DISTANCE` and `CRS_ELAPSED_TIME` is very roughly the same.
This of course makes a lot of sense.

In [None]:
plot_num_var_dist("CRS_DEP_M")

In [None]:
plot_num_var_dist("DEP_TIME_M")

These are also pretty similar. 

In [None]:
plot_num_var_dist("CRS_ARR_M")

In [None]:
plot_num_var_dist("Temperature", kde_plot=True)

This looks more or less like a normal distribution.

In [None]:
X_train = X_train.astype({"Dew Point": "int"})

plot_num_var_dist("Dew Point", kde_plot=True)

In [None]:
plot_num_var_dist("Humidity", kde_plot=True)

In [None]:
plot_num_var_dist("Wind Speed", kde_plot=True)

In [None]:
plot_num_var_dist("Wind Gust")

This is another variable with a spike at $0$.

In [None]:
plot_num_var_dist("Pressure", kde_plot=True)

In [None]:
plot_num_var_dist("sch_dep", discrete=True)

In [None]:
plot_num_var_dist("sch_arr", discrete=True)

Now, let's create countplots for categorical variables.

In [None]:
def plot_cat_var_dist(var, n=20, figsize=(10,5)):
  plt.figure(figsize=figsize)
  sns.countplot(y=X_train[var], order=X_train[var].value_counts().iloc[:n].index, palette="crest")

In [None]:
plot_cat_var_dist("OP_UNIQUE_CARRIER")

There is a lot of unique values in `TAIL_NUM` column, so we can't really create a column for each one.
Instead, let's consider twenty most common tail numbers.

In [None]:
plot_cat_var_dist("TAIL_NUM", n=20)

Below, we plot twenty most common arrival destinations.

In [None]:
plot_cat_var_dist("DEST", n=20)

In [None]:
plot_cat_var_dist("Wind")

In [None]:
plot_cat_var_dist("Condition", n=25, figsize=(10,8))

Finally, let's plot the target variable.

In [None]:
sns.histplot(x=y_train, kde=True, discrete=True)

### Violin plots

Let's create a violin plot for each categorical variable.
Most of them have too many unique values, so we're going to consider only the most common ones.

In [None]:
import numpy as np

def violin(var, figsize=(10,10), n=10):
  mask = np.in1d(X_train[var], X_train[var].value_counts().iloc[:n].index)
  plt.figure(figsize=figsize)
  sns.violinplot(data=X_train.loc[mask], x=var, y=y_train, palette="crest")

violin("MONTH", figsize=(7.5,7.5))

In [None]:
violin("OP_UNIQUE_CARRIER", figsize=(20,7.5))

In [None]:
violin("TAIL_NUM",  figsize=(20,7.5), n=10)

In [None]:
violin("DEST", figsize=(20,7.5), n=10)

In [None]:
violin("Wind", figsize=(30,7.5), n=20)

In [None]:
violin("Condition", figsize=(10,7.5), n=5)

### PCA Visualization

We're going to use numerical columns to visualize the data in a 2D and 3D projection on principal components.
Before we do that, we're going to temporarily standarize the data.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline      import Pipeline

categorical = ["OP_UNIQUE_CARRIER", "TAIL_NUM", "DEST", "Wind", "Condition"]
numerical = list(set(X_train.columns) - set(categorical))

steps2d = [("scaler", StandardScaler()), ("PCA", PCA(n_components=2))]
pca2d = Pipeline(steps2d)
pca2d_dt = pca2d.fit_transform(X_train[numerical])
pca2d_dt = pd.DataFrame(pca2d_dt)

plt.figure(figsize=(10,10))
sns.scatterplot(x=pca2d_dt[0], y=pca2d_dt[1], hue=y_train, palette="crest")

In [None]:
steps3d = [("scaler", StandardScaler()), ("PCA", PCA(n_components=3))]
pca3d = Pipeline(steps3d)
pca3d_dt = pca3d.fit_transform(X_train[numerical])
pca3d_dt = pd.DataFrame(pca3d_dt)

fig = plt.figure()
ax = plt.axes(projection="3d")
ax.scatter3D(pca3d_dt[0], pca3d_dt[1], pca3d_dt[2])

On the 2D plot we see two clusters, that almost could be separated by a plane.
There is also a line pattern of outliers on the right side of the plot.

### Plotting the target variable

We're going to plot the target variable against every other numerical variable on a scatterplot.

In [None]:
def scatter(var):
  sns.scatterplot(y=y_train, x=X_train[var])

scatter("MONTH")

In [None]:
scatter("DAY_OF_MONTH")

In [None]:
scatter("DAY_OF_WEEK")

The distributions above are pretyy even.

In [None]:
scatter("DEP_DELAY")

In [None]:
scatter("CRS_ELAPSED_TIME")

In [None]:
scatter("DISTANCE")

Here, we see a bunch of outliers, but the target variable's values doesn't seem to be affected by them.

In [None]:
scatter("CRS_DEP_M")

In [None]:
scatter("DEP_TIME_M")

In [None]:
scatter("CRS_ARR_M")

Here, wee see two clusters: a big one, and a smaller one.

In [None]:
scatter("Temperature")

In [None]:
scatter("Dew Point")

In [None]:
scatter("Humidity")

The line on the left seems very bizzare, as if the measurement was incorrect a couple of times.

In [None]:
scatter("Wind Speed")

In [None]:
scatter("Wind Gust")

In [None]:
scatter("Pressure")

In [None]:
scatter("sch_dep")

In [None]:
scatter("sch_arr")

Again, we see traces of the second, smaller cluster.

### Correlation heatmap

Below, we see a correlation heatmap of numerical features.

In [None]:
corr = X_train[numerical].corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr, vmin=-1, vmax=1, center=0, square=True)

There are two pairs of highly correlated features:
- `DISTANCE` and `CRS_ELAPSED_TIME`: this is pretty obvious (the more distant the destination the longer the journey is going to take) and we noticed it before. I believe we can safely remove `DISTANCE` column.
- `CRS_DEP_M` and `DEP_TIME_M`: this is also not surprising, first column contains scheduled departure time, the other the actual deprature time. Their difference is containted in the `DEP_DELAY` variable, so here we also most likely can remove one of the features.

# Baseline performance

To create a baseline against which we're going to be testing more complex models, we're going to use `sklearn`'s `DummyRegressor`.

In [None]:
from sklearn.dummy import DummyRegressor

dummy_clf = DummyRegressor(strategy="mean")
dummy_clf.fit(X_train, y_train)
dummy_pred = dummy_clf.predict(X_test)

from sklearn.metrics import mean_squared_error

baseline_MSE = mean_squared_error(dummy_pred, y_test)
baseline_MSE

Prediction with mean value gives us baseline mean squared error of about $47$.

Let's also create a simple linear regression model using the numerical columns and see what MSE it'll achieve.

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train[numerical], y_train)
lr_preds = lr.predict(X_test[numerical])

baseline_MSE_lr = mean_squared_error(lr_preds, y_test)
baseline_MSE_lr

# Feature engineering

### Spikes at $0$

We cannot really do anything about `Wind Gust` variable.
It just has a lot of $0$ values.
On the other hand, take a look at the distribution of the `DEP_DELAY`.
An idea is to see if it is normally distributed using Shapiro-Wilk test.
If so, we could transform it using the inverse CDF function.

In [None]:
from scipy.stats import shapiro

shapiro(X_train["DEP_DELAY"])

Apparently not.
The `scipy` library gives us a warning about possible inaccuracy of the p-value, so let's confirm the Shapiro-Wilk test with Kolmogorov-Smirnov test.

In [None]:
from scipy.stats import kstest
from scipy.stats import norm

kstest(X_train["DEP_DELAY"], norm.cdf)

Another idea is to use a logarithm.
The variable has negative values, so first we'll shift to the right.

In [None]:
def spike_transform(var, test=False):
  if test:
    return np.log(X_test[var] + 100)
  else:
    return np.log(X_train[var] + 100)

sns.kdeplot(spike_transform("DEP_DELAY"))

This looks a bit better, the spike is not as sharp as it was before.

In [None]:
X_train = X_train.assign(DEP_DELAY = lambda x: spike_transform("DEP_DELAY"))

We also have to transform the `DEP_DELAY` column in the test set. 

In [None]:
X_test = X_test.assign(DEP_DELAY = lambda x: spike_transform("DEP_DELAY", test=True))

As to `Wind Gust`, we'll leave it as it is.
Using a logarithm won't benefit us, because the value are not cenetered at zero, there's just a lot of $0$ in the column.

### Deleteing highly correlated features

We've already decided to remove `DISTANCE` and `CRS_DEP_M` features.
Let's do that now.

In [None]:
X_train = X_train.drop(["DISTANCE", "CRS_DEP_M"], axis=1)
X_test = X_test.drop(["DISTANCE", "CRS_DEP_M"], axis=1)

### Discretization of `Humidity` (and similar)

Recall the distribution of the `Humidity` variable.

In [None]:
plot_num_var_dist("Humidity", kde_plot=True)

It seems a good idea would be to create a new variable, `Binary Humidity`. 
It should be equal to zero if `Humidity` is smalelr than $20$ and equal two one otherwise.

In [None]:
def binary20(x):
  if x < 20: return 0
  else:      return 1

X_train = X_train.assign(Binary_Humidity = lambda x:  X_train["Humidity"].apply(binary20))
X_test = X_test.assign(Binary_Humidity = lambda x: X_test["Humidity"].apply(binary20))

Notice that there are other columns like this.

In [None]:
plot_num_var_dist("CRS_ELAPSED_TIME")

In [None]:
plot_num_var_dist("sch_arr", discrete=True)

In [None]:
plot_num_var_dist("CRS_ARR_M")

The `CRS_ELAPSED_TIME` needs $3$ values, rather than $2$.
For `sch_arr` and `CRS_ARR_M` we see that $2$ are enough.

In [None]:
def binary25(x):
  if x < 25: return 0
  else:      return 1

X_train = X_train.assign(Binary_sch_arr = lambda x:  X_train["sch_arr"].apply(binary25))
X_test = X_test.assign(Binary_sch_arr = lambda x: X_test["sch_arr"].apply(binary25))

def binary400(x):
  if x < 400: return 0
  else:      return 1

X_train = X_train.assign(Binary_CRS_ARR_M = lambda x:  X_train["CRS_ARR_M"].apply(binary400))
X_test = X_test.assign(Binary_CRS_ARR_M = lambda x: X_test["CRS_ARR_M"].apply(binary400))

def cet_transformer(x):
  if x < 300:  return 0
  if x < 4000: return 1
  else:        return 2

X_train = X_train.assign(Classes_CRS_ELAPSED_TIME = lambda x:  X_train["CRS_ELAPSED_TIME"].apply(cet_transformer))
X_test = X_test.assign(Classes_CRS_ELAPSED_TIME = lambda x: X_test["CRS_ELAPSED_TIME"].apply(cet_transformer))

### Filling missing values

There are two of them, both in the `Wind` variable.
We're going fill them with a special value.
We want to use `OridinalEncoder` anyways, so they will get their own class and perhaps the model will use this information to improve its predictions.

In [None]:
X_train = X_train.fillna("missing")
X_test = X_test.fillna("missing")

We will also create a new column, `Wind_NA`, where value are equal to `0` if the corresponding values in the `Wind` column were not missing and equal to `1` when they were.

In [None]:
def was_missing(x):
  if x == "missing": return 1
  else:              return 0

X_train = X_train.assign(Wind_NA = lambda x:  X_train["Wind"].apply(was_missing))
X_test = X_test.assign(Wind_NA = lambda x: X_test["Wind"].apply(was_missing))

### `Wind` variable

The `Wind` variable has a lot of unique values.

In [None]:
X_train["Wind"].unique()

We will use `OridinalEncoder` to encode it anyways, however we're also going to create two new variables, containing the direction of the wind on two axis.
More precisily, this is the mapping we will use:

`E`   -> 1, 0 \\
`ENE` -> 0.92, 0.38 \\
`NE`  -> 0.7, 0.7
`NNE` -> 0.38, 0.92 \\
`N`   -> 0, 1

etc.

So we're assigning an angle $\theta$ to each label and then pair $\cos(\theta), \sin(\theta)$. 
This way the directions are actually distributed on a circle, which wouldn't be possible in one dimension.

This does not cover all the possible value in the `Wind` column.
We will assign $0$ to `CALM` values.
We will also assign $0$ to missing values, since we created a separate `Wind_NA` column anyways.

We're also going to map `VAR` values to $0$ and create separte column, `Wind_VAR`, which will indicate if the `Wind` column contained `Var` value in this row. 

In [None]:
def was_var(dir):
  if dir == "VAR": return 1
  else:            return 0

wind_order = ["E", "ENE", "NE", "NNE", 
              "N", "NNW", "NW", "WNW",
              "W", "WSW", "SW", "SSW",
              "S", "SSE", "SE", "ESE"]

wind_ang = {k: np.pi/8*i for i,k in enumerate(wind_order)}

def get_cos(dir):
  if dir in set(["VAR", "CALM", "missing"]): 
    return 0
  else:
    return np.cos(wind_ang[dir])

def get_sin(dir):
  if dir in set(["VAR", "CALM", "missing"]): 
    return 0
  else:
    return np.sin(wind_ang[dir])
  
X_train = X_train.assign(Wind_VAR = lambda x:  X_train["Wind"].apply(was_var))
X_test = X_test.assign(Wind_VAR = lambda x: X_test["Wind"].apply(was_var))

X_train = X_train.assign(Wind_COS = lambda x:  X_train["Wind"].apply(get_cos))
X_test = X_test.assign(Wind_COS = lambda x: X_test["Wind"].apply(get_cos))

X_train = X_train.assign(Wind_SIN = lambda x:  X_train["Wind"].apply(get_sin))
X_test = X_test.assign(Wind_SIN = lambda x: X_test["Wind"].apply(get_sin))

### Encoding the categorical variables

The categorical variables, except for `Condition`, are pretty simple.
They tend to have a lot of unique values, so we're simply going to use a `sklearn`'s `OridinalEncoder`.
We have to use `handle_unknown="use_encoded_value"` parameter, since there are tail numbers in the test set that aren't present in the training set.

We're also standarizing here the rest of the columns, except for the `MONTH`, `DAY_OF_WEEK`, `DAY_OF_MONTH` and the variables we just created.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose       import ColumnTransformer

# changed order of feature names
diff = lambda l1,l2: [x for x in l1 if x not in l2]

to_encode = categorical
to_omit = ["MONTH", 
           "DAY_OF_MONTH", 
           "DAY_OF_WEEK", 
           "Binary_Humidity",	
           "Binary_sch_arr", 
           "Binary_CRS_ARR_M", 
           "Classes_CRS_ELAPSED_TIME", 
           "Wind_NA",
           "Wind_VAR",
           "Wind_COS",
           "Wind_SIN"]

to_scale = diff(diff(X_train.columns, to_encode), to_omit)

names = to_encode + to_scale + to_omit

enc = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
scl = StandardScaler()
ct = ColumnTransformer([("encode", enc, to_encode), ("scale", scl, to_scale)], remainder="passthrough")

ct.fit(X_train)
X_train = pd.DataFrame(ct.transform(X_train), columns=names)
X_test = pd.DataFrame(ct.transform(X_test), columns=names)

In [None]:
names

### Month column

We have three value in the `MONTH` column: `1` for January, `11` for November, and `12` for December.
However the order should be November, December, January.
Also, January comes right after December, but in our data value assigned to December is much bigger than value assigned to January.
We might benefit from transforming this column in the following fashion.
We'll asign `1` to November, `2` to December, and `3` to January.

In [None]:
def month_transformer(x):
  if x == 1:  return 3
  if x == 11: return 1
  if x == 12: return 2

X_train = X_train.assign(MONTH = lambda x: X_train["MONTH"].apply(month_transformer))
X_test = X_test.assign(MONTH = lambda x: X_test["MONTH"].apply(month_transformer))

Let's see if linear regression model achieves lower MSE after feature engineering.

In [None]:
lr.fit(X_train, y_train)
lr_preds = lr.predict(X_test)

mean_squared_error(lr_preds, y_test)

It is smaller by just a tiny bit.

# Outlier detection

We're going to use a couple of automatic outlier detection algorithms and visualize the outcome.

Notice that the PCA visualization will differ from the one we've seen before, as we've encoded categorical features.

### Isolation Forest

The `threshold` parameter below wich the observations are considered to be outliers was chosen "by trial and error", that is I used some different value and chose the one that seemed to be most reasonable.

In [None]:
from sklearn.ensemble import IsolationForest

iforest = IsolationForest(n_estimators = 250, random_state=seed)
iforest.fit(X_train)
if_scores = iforest.score_samples(X_train)

def set_labels(x, threshold):
  if x < threshold: return "Outlier"
  else:             return "Inlier" 

set_labels = np.vectorize(set_labels)

In [None]:
steps2d = [("scaler", StandardScaler()), ("PCA", PCA(n_components=2))]
pca2d = Pipeline(steps2d)
pca2d_dt = pca2d.fit_transform(X_train)
pca2d_dt = pd.DataFrame(pca2d_dt)

plt.figure(figsize=(10,10))
g = sns.scatterplot(x=pca2d_dt[0], y=pca2d_dt[1], hue=set_labels(if_scores, -0.595))
g.set(xlabel="PCA1", ylabel="PCA2", title="Isolation Forest outlier detection")

### Local Outlier Factor

In [None]:
from sklearn.neighbors import LocalOutlierFactor

lofact = LocalOutlierFactor()
lofact.fit(X_train)
lof_scores = lofact.negative_outlier_factor_

plt.figure(figsize=(10,10))
g = sns.scatterplot(x=pca2d_dt[0], y=pca2d_dt[1], hue=set_labels(lof_scores, -1.4))
g.set(xlabel="PCA1", ylabel="PCA2", title="Local Outlier Factor outlier detection")

Both algorithms found a couple observations that can be considered as outliers.
We're going to simply remove them from the training data.

In [None]:
def mask_outliers(x, threshold):
  if x < threshold: return True
  else:             return False

mask_outliers = np.vectorize(mask_outliers)

mask_if = mask_outliers(if_scores, -0.6)
mask_lof = mask_outliers(lof_scores, -1.25)

mask = mask_if | mask_lof

X_train = X_train.drop(X_train.loc[mask].index)
y_train = y_train.drop(y_train.loc[mask].index)


# Building the regressor

### Model selection

The models we're going to try out are:
- random forest
- extra-trees regressor
- gradient boosting
- support vector machine
- elastic net
- multilayer perceptron
- guassian process regression
- gaussian naive bayes

We're going to use the MSE metric.


In [None]:
from sklearn.ensemble         import GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor
from sklearn.linear_model     import ElasticNet
from sklearn.svm              import SVR
from sklearn.naive_bayes      import GaussianNB
from sklearn.neural_network   import MLPRegressor

models = {
    "gb": GradientBoostingRegressor(random_state=seed),
    "ranger": RandomForestRegressor(random_state=seed),
    "extra": ExtraTreesRegressor(random_state=seed),
    "enet": ElasticNet(),
    "svm": SVR(),
    "bayes": GaussianNB(),
    "mlp": MLPRegressor(random_state=seed)
}

Let's start by creating testing each model in a ten fold cross-validation on the training set.


In [None]:
from sklearn.model_selection import cross_val_score

cv_scores = {key: -cross_val_score(clf, X_train, y_train, cv=5, scoring="neg_mean_squared_error") for key, clf in models.items()}
pd.DataFrame(cv_scores)

Let's also take a look at the means.

In [None]:
cv_scores_means = {key: np.mean(values) for key, values in cv_scores.items()}
cv_scores_means

From now on, let's focus on the extra trees and random forest models, since they achieved the lowest MSE.

### Hyperparameter tuning

Below, we're defining a search space of parameters. 

This is just a prop search space for demonstration purposes.
It should definitely be adjusted before actual tuning.

In [None]:
search_space = {
  "extra": {
      "n_estimators": [750],
      "min_samples_split": [2],
      "max_features": ["auto"],
      "ccp_alpha": [0]
  },
  "ranger": {
      "n_estimators": [750],
      "min_samples_split": [2],
      "max_features": ["auto"],
      "ccp_alpha": [0]
  }
}

Now, let's use `GridSearchCV` to tune the models.
Note that we should more likely use `RandomSearchCV` for searching through wider search spaces than what we're dealing with here.

In [None]:
from sklearn.model_selection import GridSearchCV

rscv = [GridSearchCV(models[name], search_space[name]) for name in ["extra", "ranger"]]
search = [rs.fit(X_train, y_train) for rs in rscv]

This are the scores:

In [None]:
pd.DataFrame({"extra": [search[0].best_score_], "ranger": [search[1].best_score_]})

We use the better model for the final prediction.

In [None]:
clf_final = search[0].best_estimator_
clf_final.fit(X_train, y_train)
preds = clf_final.predict(X_test)

MSE = mean_squared_error(preds, y_test)

In [None]:
MSE