<img src="../images/logo.png" align='right' width=250px>

# Feature Selection as a Transformer

In this notebook we will create a custom transformer to perform feature selection.

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import mean_absolute_error, r2_score


Consider this dataset of food recipes. In total, it has 445 features - way too many! 

![](https://i.pinimg.com/236x/83/d8/52/83d8524ad3deddf5e7045ba795eb3b98--cartoon-cooking-food-doodles.jpg)


In [None]:
food_df = pd.read_csv("../data/food_recipes.csv")
food_df.head()

In [None]:
y = food_df.dropna()["calories"].reset_index(drop=True)
X = (
    food_df.dropna()
    .drop(["calories", "title", "rating"], axis="columns")
    .reset_index(drop=True)
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=111
)

### The Baseline Model

Create a modelling pipeline for this dataset. 
* Use `VarianceThreshold` (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold) on how to use) to eliminate features based on little variance. A good value would be ~0.05.
* Use a scaling technique. 
* Choose a regression model, preferably one sensistive to irrelevant features. E.g. k-nearest neighbors regressor. 

In [None]:
# Your code here.

In [None]:
# %load ../answers/variance_threshold.py

In [None]:
# Calculate scores.
r2 = r2_score(y_test, y_pred)
MAE = mean_absolute_error(y_test, y_pred)

# Report.
print(f"r2 score: {r2:.3f}")
print(f"mea: {MAE:.3f}")

### The Assigment

The goal is to only use those features that satisfy the `f_regression` test. The code that did this for you previously is available. However, keep in mind that this code is only compatible with Pandas. While an individual transformer can be compatible with Pandas, a transformer used in a pipeline will receive the output of a previous transformer, i.e. a numpy array.


* Create a feature selection transformer based on the `f_regression` test. 
* Make sure the transformer is compatible with `numpy` arrays, not only Pandas DataFrames.
* Let the threshold (0.05 originally) be a tunable hyperparameter.


In [None]:
# OLD CODE.
from sklearn.feature_selection import f_regression

# Perform test.
_, p_values = f_regression(X_train, y_train)

# Columns to drop.
columns_to_drop = X_train.columns[p_values > 0.05]

# Convert data into Pandas dataframe.
X_train_reduced = X_train.drop(columns_to_drop, axis="columns")
X_test_reduced = X_test.drop(columns_to_drop, axis="columns")

In [None]:
# YOUR CODE.

In [None]:
# %load ../answers/feature_selector.py

In order to test your code, create a pipeline just like the one before that includes your transformer. 

Experiment with differrent values for the threshold paramter. What value gives the best test performance?

In [None]:
# Your pipeline with feature selection transformer.

In [None]:
# Calculate scores.
r2 = r2_score(y_test, y_pred)
MAE = mean_absolute_error(y_test, y_pred)

# Report.
print(f"r2 score: {r2:.3f}")
print(f"mea: {MAE:.3f}")