<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Feature-Selection-and-Feature-Engineering" data-toc-modified-id="Feature-Selection-and-Feature-Engineering-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Feature Selection and Feature Engineering</a></span></li><li><span><a href="#Objectives" data-toc-modified-id="Objectives-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#Model-Selection" data-toc-modified-id="Model-Selection-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Model Selection</a></span><ul class="toc-item"><li><span><a href="#Decisions,-Decisions,-Decisions..." data-toc-modified-id="Decisions,-Decisions,-Decisions...-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Decisions, Decisions, Decisions...</a></span></li></ul></li><li><span><a href="#Correlation-and-Multicollinearity" data-toc-modified-id="Correlation-and-Multicollinearity-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Correlation and Multicollinearity</a></span><ul class="toc-item"><li><span><a href="#Multicollinearity" data-toc-modified-id="Multicollinearity-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Multicollinearity</a></span></li></ul></li><li><span><a href="#Recursive-Feature-Elimination" data-toc-modified-id="Recursive-Feature-Elimination-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Recursive Feature Elimination</a></span><ul class="toc-item"><li><span><a href="#Recursive-Feature-Elimination-in-Scikit-Learn" data-toc-modified-id="Recursive-Feature-Elimination-in-Scikit-Learn-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Recursive Feature Elimination in Scikit-Learn</a></span></div>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Feature Selection & Feature Engineering</p>
</div>

Data Science Cohort Live NYC June 2022
<p>Phase 2: Topic 20</p>
<br>
<br>

<div align = "right">
<img src="images/flatiron-school-logo.png" align = "right" width="200"/>
</div>

In [None]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

> We want to do our best to make good predictions

One way we can improve our model is to consider the data's feature and either specifically _select_ features and/or _create new features_ (called **feature engineering**)

# Learning Objectives

- Use correlations and other algorithms to inform feature selection
- Address the problem of multicollinearity in regression problems
- Create new features for use in modeling
- Use `PolynomialFeatures` to build compound features

# Model Selection

Let's imagine that I'm going to try to predict wine quality based on the other features.

In [None]:
wine = pd.read_csv('data/wine.csv')

In [None]:
wine.head(10)

## Decisions, Decisions, Decisions...

Now: Which columns (predictors) should I choose? 

There are 12 predictors I could choose from. For each of these predictors, I could either use it or not use it in my model, which means that there are $2^{12} = 4096$ _different_ models I could construct! Well, okay, one of these is the "empty model" with no predictors in it. But there are still 4095 models from which I can choose.

How can I decide which predictors to use in my model?

![](images/i_choose_you.gif)

> Data scientist choosing predictors/features to use ~~in battle~~ for the model

We'll explore a few methods in the sections below.

# Correlation and Multicollinearity

Our first attempt might be just see which features are _correlated_ with the target to make a prediction.

We can use the correlation metric in making a decision.

In [None]:
# Use the .corr() DataFrame method to find out about the
# correlation values between all pairs of variables!

wine.corr()

In [None]:
sns.set(rc={'figure.figsize':(8, 8)})

# Use the .heatmap function to depict the relationships visually!
sns.heatmap(wine.corr());

In [None]:
# Let's look at the correlations with 'quality'
# (our dependent variable) in particular.

wine_corrs = wine.corr()['quality'].map(abs).sort_values(ascending=False)
wine_corrs

It looks like we can see the features have different correlations with the target. The larger the correlation, the more we'd expect these features to be better predictors.

Let's try using only a subset of the strongest correlated features to make our model.

In [None]:
# Let's choose 'alcohol' and 'density'.

wine_preds = wine[['alcohol', 'density']]
wine_target = wine['quality']

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(wine_preds, wine_target)

In [None]:
lr.score(wine_preds, wine_target)

## Multicollinearity

Multicollinearity describes the correlation between distinct predictors. Why might high multicollinearity be a problem for interpreting a linear regression model?


It's problematic for statistics in an inferential mode because, if $x_1$ and $x_2$ are highly correlated with $y$ but also *with each other*, then it will be very difficult to tease apart the effects of $x_1$ on $y$ and the effects of $x_2$ on $y$. If I really want to have a good sense of the effect of $x_1$ on $y$, then I'd like to vary $x_1$ while keeping the other features constant. But if $x_1$ is highly correlated with $x_2$ then this will be a practically impossible exercise!

> We will return to this topic again. For more, see [this post](https://towardsdatascience.com/https-towardsdatascience-com-multicollinearity-how-does-it-create-a-problem-72956a49058).

A further assumption for multiple linear regression is that **the predictors are independent.**

**How can I check for this?**
- Check the model Condition Number.
- Check the correlation values.
- Compute Variance Inflation Factors ([VIFs](https://www.statsmodels.org/devel/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html)).

**What can I do if it looks like I'm violating this assumption?**

- Consider dropping offending predictors
- We'll have much more to say about this topic in future lessons!

# Recursive Feature Elimination

The idea behind recursive feature elimination is to start with all predictive features and then build down to a small set of features slowly, by eliminating the features with the lowest coefficients.

That is:

1. Start with a model with _all_ $n$ predictors
2. find the predictor with the smallest effect (coefficient)
3. throw that predictor out and build a model with the remaining $n-1$ predictors
4. set $n = n-1$ and repeat until $n-1$ has the value you want!

## Recursive Feature Elimination in Scikit-Learn

In [None]:
lr_rfe = LinearRegression()
select = RFE(lr_rfe, n_features_to_select=3)

In [None]:
ss = StandardScaler()
ss.fit(wine.drop('quality', axis=1))

wine_scaled = ss.transform(wine.drop('quality', axis=1))

In [None]:
select.fit(X=wine_scaled, y=wine['quality'])

In [None]:
select.support_

In [None]:
wine.drop('quality', axis=1).head()

In [None]:
select.ranking_

These features are volatile acidity, alcohol, and red_wine.

> **Caution**: RFE is probably not a good strategy if your initial dataset has many predictors. It will likely be easier to start with a *simple* model and then slowly increase its complexity. This is also good advice for when you're first getting your feet wet with `sklearn`!

For more on feature selection, see [this post](https://towardsdatascience.com/the-5-feature-selection-algorithms-every-data-scientist-need-to-know-3a6b566efd2).

# Feature Engineering

> Domain knowledge can be helpful here! 🧠

In practice this aspect of data preparation can constitute a huge part of the data scientist's work. As we move into data modeling, much of the goal will be a matter of finding––**or creating**––features that are predictive of the targets we are trying to model.

There are infinitely many ways of transforming and combining a starting set of features. Good data scientists will have a nose for which engineering operations will be likely to yield fruit and for which operations won't. And part of the game here may be getting someone else on your team who understands what the data represent better than you!