<center><img src=img/MScAI_brand.png width=70%></center>

# Scikit-Learn: Feature Selection/Engineering

* Feature selection: we choose a subset of existing features
* Feature engineering: we construct new features from existing data

In Scikit-Learn, both are implemented as *transformers*: a `transform(X)` method, usually preceded by `fit(X)`. And optionally `fit_transform(X)` as a shortcut.

### Feature selection

* **Filter**: calculate a statistic per feature and choose those above a threshold.
* **Wrapper**: try different subsets and see which gives best performance

Reference:
* https://scikit-learn.org/stable/modules/feature_selection.html#feature-selection



### Filter approach


In [4]:
import numpy as np
from sklearn.feature_selection import VarianceThreshold
X = np.array([[0.5, 1.0], [0.3, 1.0], [0.1, 1.0], 
              [0.9, 1.0], [0.8, 1.0]])
print(X)

[[0.5 1. ]
 [0.3 1. ]
 [0.1 1. ]
 [0.9 1. ]
 [0.8 1. ]]


`VarianceThreshold` throws away features with too little variance. By default, only zero variance is thrown away.

In [23]:
sel = VarianceThreshold() 
sel.fit_transform(X)

array([[0.5],
       [0.3],
       [0.1],
       [0.9],
       [0.8]])

The *chi-squared* ($\chi^2$) score is a measure of correlation between a numerical feature and a discrete target. 

<center><img src=img/feature_selection.svg width=55%></center>

Higher values indicate more informative features. So, it's natural to throw away low-valued features. Conceptually, we sort and then impose a threshold, chosen to keep the best `k` features.


<center><img src=img/feature_selection2.svg width=55%></center>

In [1]:
# https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target

In [2]:
sel = SelectKBest(chi2, k=2).fit(X, y)
X_new = sel.transform(X)
print("Scores", sel.scores_)
print("Shapes", X.shape, X_new.shape)

Scores [ 10.81782088   3.7107283  116.31261309  67.0483602 ]
Shapes (150, 4) (150, 2)


### Feature selection

Two main approaches:

* **Filter**: calculate a statistic per feature and choose those above a threshold (**done**)
* **Wrapper**: try different subsets and see which gives best performance (**see exercise**)


### Feature engineering

* Scaling
* Missing values
* One-hot encoding
* Arithmetic feature transformations
* Text features
* Image features

### Scaling

Some ML methods work better if features are normalised to $[0, 1]$ or standardised to have mean 0 and standard deviation 1. Scikit-Learn provides `StandardScaler`, for example, for the latter. The calculation is simple: $(X - \bar{X}) / \sigma(X)$. 

<center><img src=img/data-leak.jpg width=15%><font size=1><a href="https://meterpreter.org/us-postal-service-website-vulnerability-leaked-60-million/">meterpreter.org</a></font></center>

A slight complication is the rule that we must not leak information about the test set into our training procedure. So, we first calculate the mean and std of the train set, and use them to transform the train set. We then use the same values to transform the test set. We never calculate the mean and standard deviation of the test set!

In [5]:
from sklearn.preprocessing import StandardScaler
X_train = np.array([0, 4, 1, 6, 7, 8, 5, 9.0]
                  ).reshape(-1, 1)
X_test = np.array([3.3, 4.5, 5.5]
                 ).reshape(-1, 1)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
# do not fit on X_test!
X_test = scaler.transform(X_test) 

As a result, `X_train` is now standardised:

In [6]:
X_train

array([[-1.66666667],
       [-0.33333333],
       [-1.33333333],
       [ 0.33333333],
       [ 0.66666667],
       [ 1.        ],
       [ 0.        ],
       [ 1.33333333]])

But `X_test` will not have a zero mean and unit variance:

In [7]:
X_test

array([[-0.56666667],
       [-0.16666667],
       [ 0.16666667]])

### Imputing missing values

It's common to have missing values in our data:

In [9]:
X = np.array([0, 4, 1, 6, 7, np.nan, 5, 9.0]
            ).reshape(-1, 1)
X

array([[ 0.],
       [ 4.],
       [ 1.],
       [ 6.],
       [ 7.],
       [nan],
       [ 5.],
       [ 9.]])

A common strategy is just to impute the mean of the values present in the column. 

In [10]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
X2 = imp.fit_transform(X)
X2

array([[0.        ],
       [4.        ],
       [1.        ],
       [6.        ],
       [7.        ],
       [4.57142857],
       [5.        ],
       [9.        ]])

### Arithmetic feature transformations

<center><img src=img/arithmetic-feature-transformation.png width=50%></center> 
<font size=2>Derived from PDSH; code in `code/make_arithmetic_transformation_plot.py`</font>

Suppose we have data like the above. We'll find that linear regression $y = a+bx$ doesn't model it well (left). But if we added the feature $x^2$ to give the model $y = a+bx+b_2x^2$, we could find a good fit (right)!

The same idea can in principle work for $x^3$ and higher. These are called *polynomial features*. 

In [11]:
from sklearn.preprocessing import PolynomialFeatures
X = np.array([0, 1.5, 2, 4, 4.5, 5, 6, 7, 8]
            ).reshape(-1, 1)
poly = PolynomialFeatures(degree=3, include_bias=False)
X2 = poly.fit_transform(X)
print(X2)

[[  0.      0.      0.   ]
 [  1.5     2.25    3.375]
 [  2.      4.      8.   ]
 [  4.     16.     64.   ]
 [  4.5    20.25   91.125]
 [  5.     25.    125.   ]
 [  6.     36.    216.   ]
 [  7.     49.    343.   ]
 [  8.     64.    512.   ]]


### One-hot Encoding

In *one-hot encoding*, we convert a single categorical feature `f` with $n$ levels to $n$ binary features `f0`, `f1`, etc:

`f` | `f0` | `f1` | `f2`
----|------|------|-----
 `a`|  1   |   0  |  0
 `b`|  0   |   1  |  0 
 `a`|  1   |   0  |  0
 `c`|  0   |   0  |  1

Of course, Scikit-Learn provides that for us:

In [1]:
from sklearn.preprocessing import OneHotEncoder
f = [["a"], ["b"], ["a"], ["c"]]
OneHotEncoder(sparse=False).fit_transform(f)

array([[1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])

### `DictVectorizer`

`DictVectorizer` converts a dataset in the form of a list of dicts to a nice rectangular array. It does one-hot encoding on any fields which need it.

In [2]:
# from PDSH 
# https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html
data = [ 
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]

In [6]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)
vec.get_feature_names()
from sklearn.preprocessing import LabelEncoder
help(LabelEncoder)

Help on class LabelEncoder in module sklearn.preprocessing.label:

class LabelEncoder(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin)
 |  Encode labels with value between 0 and n_classes-1.
 |  
 |  Read more in the :ref:`User Guide <preprocessing_targets>`.
 |  
 |  Attributes
 |  ----------
 |  classes_ : array of shape (n_class,)
 |      Holds the label for each class.
 |  
 |  Examples
 |  --------
 |  `LabelEncoder` can be used to normalize labels.
 |  
 |  >>> from sklearn import preprocessing
 |  >>> le = preprocessing.LabelEncoder()
 |  >>> le.fit([1, 2, 2, 6])
 |  LabelEncoder()
 |  >>> le.classes_
 |  array([1, 2, 6])
 |  >>> le.transform([1, 1, 2, 6]) #doctest: +ELLIPSIS
 |  array([0, 0, 1, 2]...)
 |  >>> le.inverse_transform([0, 0, 1, 2])
 |  array([1, 1, 2, 6])
 |  
 |  It can also be used to transform non-numerical labels (as long as they are
 |  hashable and comparable) to numerical labels.
 |  
 |  >>> le = preprocessing.LabelEncoder()
 |  >>> le.fit(["paris"

### Text features

When we have text data, we need even more work to convert to rectangular data. Sophisticated methods are covered in the NLP module. 

Here we'll see a simple TF-IDF approach based on [PDSH](https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html#Text-Features). 

It assumes that we have a list of strings, and we want to convert each string to a row in an (unlabelled) rectangular dataset.

In [3]:
# from PDSH
sample = ['problem of evil',
          'evil queen',
          'horizon problem']

In [4]:
from sklearn.feature_extraction.text \
    import TfidfVectorizer
import pandas as pd
vec = TfidfVectorizer()
X = vec.fit_transform(sample) 
pd.DataFrame(X.toarray(),  
             columns=vec.get_feature_names())

Unnamed: 0,evil,horizon,of,problem,queen
0,0.517856,0.0,0.680919,0.517856,0.0
1,0.605349,0.0,0.0,0.0,0.795961
2,0.0,0.795961,0.0,0.605349,0.0


### Image features

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.image: we will not cover these. It's usually better to just use a convolutional neural network which is outside our scope here.

### Conclusion

Scikit-Learn gives us lots of methods for feature **selection** and feature **engineering**, and we've seen a wide sample of the most important and simplest ones.

A nice thing about the Scikit-Learn API is that the same code can be used for training a model on a dataset, regardless of whether or how we have done feature selection or feature engineering on that dataset.

In [24]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
X = np.array([[0.1, "red"  ], 
              [0.2, "blue" ], 
              [0.1, "green"], 
              [0.3, "blue" ]])
ohe = OneHotEncoder().fit(X)
print(ohe.transform(X).toarray())
print(ohe.get_feature_names())

[[1. 0. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0.]
 [1. 0. 0. 0. 1. 0.]
 [0. 0. 1. 1. 0. 0.]]
['x0_0.1' 'x0_0.2' 'x0_0.3' 'x1_blue' 'x1_green' 'x1_red']


### Exercises


**Exercise 1**. We saw Polynomial features above, but degrees higher than 2 are often hard to justify (multiple "turns") and could lead to overfitting. Perhaps more reasonable transformations are things like $\log(x)$, $e^x$, and $\sqrt{x}$. To do these transformations, we would use [`FunctionTransformer`](
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer). Look this up and use it run a square-root transform on this data:

```python
X = np.array([0, 1.5, 2, 4, 4.5, 5, 6, 7, 8]
            ).reshape(-1, 1)
```

**Exercise 2** (background). The *wrapper* approach to feature selection is to try different subsets and see which gives best performance when used inside the ML model we want to use them in. There are at least three approaches:

* Forward
* Backward
* Metaheuristic

In the *forward* approach, we start with 0 features, and try adding one at a time, re-training many times.

In the *backward* approach, also known as *recursive feature elimination*, we train with all features, and try removing one at a time, re-training many times. One way to decide what to eliminate is to use coefficient (`coef_`) values of fitted regression models. Some decision tree models provide `feature_importances_` and these can be used instead of `coef_`.

In the *metaheuristic* approach, we use a search algorithm like a genetic algorithm to try out different subsets. (Full-time MScAI students may study this in CT5141 Optimisation in Semester 2.)


**Exercise 2**. Run this:
```python
from sklearn.feature_selection import SelectFromModel
from sklearn.tree import DecisionTreeRegressor
iris = load_iris()
X, y = iris.data, iris.target
help(SelectFromModel)
```
and then implement recursive feature elimination on `iris` using `SelectFromModel`.

**Exercise 3**. In the TF-IDF vectorizer, why did we need `X.toarray()`? What was `X` before that? Why does the `TfidfVectorizer` choose to return results in that format?

**Solution 1**.

In [18]:
from sklearn.preprocessing import FunctionTransformer
func_trans = FunctionTransformer(
    lambda X: np.sqrt(X), validate=False)
X2 = func_trans.fit_transform(X)
print(X2)

[[0.        ]
 [2.        ]
 [1.        ]
 [2.44948974]
 [2.64575131]
 [2.82842712]
 [2.23606798]
 [3.        ]]


Here, we transformed all of `X`. It's a bit more complicated to transform just one column and leave other columns alone, so we won't cover that. 

<center><img src=img/data-leak.jpg width=15%><font size=1><a href="https://meterpreter.org/us-postal-service-website-vulnerability-leaked-60-million/">meterpreter.org</a></font></center>

Transformations like one-hot encoding and $x^2$ and $\sqrt{x}$ are *stateless*, so we don't have to worry about leaking information from the test set into our training, so we can just carry out these transformations on the whole dataset up-front.

**Solution 2**.

In [10]:
from sklearn.feature_selection import SelectFromModel
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
print(X.shape) # check X shape for later comparison

(150, 4)


In [32]:
clf = DecisionTreeRegressor() # usual workflow
clf = clf.fit(X, y)
# special attribute available after fitting 
# see https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
print(clf.feature_importances_) 
# prefit=True: we have already fit()ted
model = SelectFromModel(clf, prefit=True)
# remember: feature selection/engineering is a transformer
X_new = model.transform(X)
print(X_new.shape)

[0.         0.00666667 0.78202798 0.21130535]
(150, 1)


**Solution 3**. `X` alone is a *sparse* matrix:

In [16]:
vec = TfidfVectorizer()
X = vec.fit_transform(sample) 
X

<3x5 sparse matrix of type '<class 'numpy.float64'>'
	with 7 stored elements in Compressed Sparse Row format>

This is chosen because conceptually, the result of TF-IDF vectorisation is an array of mostly zeros: most sentences contain a very small sample of all possible words, and the TF-IDF value is 0 for a word not present in a sentence. So, we save a huge amount of space by storing in "Compressed Sparse Row" format, which just stores the non-zero values. Conceptually that could be a list of tuples:

```python
[ 
    (evil, 0, 0.517856),
    (horizon, 2, 0.795961),
    ...
]
```

Or it could be stored as a dict:

```python
{
(evil, 0): 0.517856,
(horizon, 2): 0.795961,
    ...
}
```