`````{note}
This lecture is going to:
`````

# Features and Feature Selection

We're now at the point where we may have many features in a dataset, or where we want to generate many features to build predictive models. We need ways to generalize this process, and control how many features are used. 

As a reminder, we're currently interested in solving supervised regression problems of the form:
\begin{align*}
\hat{y}=f(\mathbf{X})
\end{align*}
where 
* $\hat{y}$ is a vector of outputs (one per data point)
* $\mathbf{X}$ is a 2D array of features (one row per data point, one column per feature)

We're only going to use linear regression models in today's lecture.

## Test function (Himmelblau's function)

Let's start with the special function we used for the optimization lecture
\begin{align}
f(x, y) = (x^2+y-11)^2+(x+y^2-7)^2
\end{align}
For clarity, I'm going to write this as 
\begin{align}
y=f(x_1, x_2) &= (x_1^2+x_2-11)^2+(x_1+x_2^2-7)^2\\
&=x_1^4+x_2^4+2x_1x_2^2+2x_2x_1^2-21x_1^2-13x_2^2-14x_1-22x_2+170
\end{align}
where y is the output/label, and $x_1,x_2$ are potential features (among many that we could choose)! This is a nice test function since it's analytical and is polynomial. If we try to find polynomial features of $x_1,x_2$, we should recover this form. 

```{seealso}
https://en.wikipedia.org/wiki/Himmelblau%27s_function
```

First, let's define the range of $x_1,x_2$ values we're interested in plotting over, and turn them into a 2D grid like we did in the optimization lecture.

In [None]:
import numpy as np

x1range = np.linspace(-5, 5)
x2range = np.linspace(-5, 5)

# Make 2d arrays for all the unique values of x_1/x_2
x1grid, x2grid = np.meshgrid(x1range, x2range)

Now, let's define a function to return Himmelblau's function, plus a little bit of random noise

In [None]:
import numpy as np


def himmelblau_with_noise(x1, x2, noise=0.1, seed=42):

    # Set the numpy random seed so the results are reproducible
    np.random.seed(seed)

    # Generate the himmelblau function
    himmelblau = (x1**2 + x2 - 11) ** 2 + (x1 + x2**2 - 7) ** 2

    # Multiple by a bit of Gaussian random noise and return
    noise = np.random.normal(loc=1, scale=noise, size=x1.shape)
    return himmelblau * noise

Finally, let's plot the function without noise, and add points for 20 random points in that space. 

In [None]:
import plotly.graph_objects as go

fig = go.Figure(
    data=[
        go.Surface(
            x=x1range, y=x2range, z=himmelblau_with_noise(x1grid, x2grid, noise=0)
        )
    ]
)

# Generate 20 samples from the noisy Himmelblau function
nsamples = 20
X = np.random.uniform(low=-5, high=5, size=(nsamples, 2))
y = himmelblau_with_noise(X[:, 0], X[:, 1])

# Plot with plotly
fig.add_scatter3d(
    x=X[:, 0], y=X[:, 1], z=y, mode="markers", marker=dict(size=6, color="#00FF00")
)
fig.update_layout(autosize=False, width=800, height=800)
fig.show()

## Base case ($x_1$, $x_2$ as the only the features)

Our goal will be to find the polynomial features that make this an easy function to fit. Before we do something interesting, let's start with the simplest thing we can try: a linear function using just $x_1$ and $x_2$ as features.

We'll use scikit-learn now that we've seen an example through the homework!