# Feature Extraction


Assumptions:
- data are represented by a fixed number of features which can be binary, categorical or continuous
- finding a good data representation is very domain specific and related to available measurements
- human expertise, which is often required to convert “raw” data into a set of useful features, can be complemented by automatic feature construction and feature selection methods

We will refer to the combined application of *feature construction* and *feature selection* as **feature extraction**.

## Feature Construction

Sometimes called *feature generation*. 

Feature construction is a preprocessing step in our data pipeline.

### Types of Feature Construction

##### standardization and normalization

See [2-Standardization_and_Normalization.ipynb](2-Standardization_and_Normalization.ipynb) and [3-Scaling.ipynb](3-Scaling.ipynb).

##### feature discretization

Trivially, you could imagine "binning" continuous data, for example, given the feature `age` creating the new feature `is_minor` where `age < 18`.

Scikit-learn also includes:

- `preprocessing.Binarizer([threshold, copy])`	Binarize data (set feature values to 0 or 1) according to a threshold
- `preprocessing.LabelBinarizer([neg_label, …])`	Binarize labels in a one-vs-all fashion
- `preprocessing.LabelEncoder`	Encode labels with value between 0 and n_classes-1.
- `preprocessing.MultiLabelBinarizer([classes, …])`	Transform between iterable of iterables and a multilabel format
- `preprocessing.OneHotEncoder([n_values, …])`	Encode categorical integer features using a one-hot aka one-of-K scheme.

     
##### linear and non-linear space embedding
   
We have started to look at one technique, Principal Component Analysis.

You might explore others in the sklearn package. Have a look at [Comparison_of_Manifold_Learning_Methods.ipynb](Comparison_of_Manifold_Learning_Methods.ipynb)
     
#####  non-linear expansions
   
See [4-Simple_Polynomial_Expansion.ipynb](4-Simple_Polynomial_Expansion.ipynb). Later in the course we will talk about kernels and basis functions. 

---

#### Dimensionality

Some methods do not alter the space dimensionality (e.g. signal enhancement, normalization, standardization), while others enlarge it (non-linear expansions, feature discretization), reduce it (space embedding methods) or can act in either direction (extraction of local features).   

Feature construction can largely condition the success of any subsequent statistics or machine learning endeavor. 

In particular, one should beware of not losing information at the feature construction stage. 

It may be a good idea to add the raw features to the preprocessed data or at least to compare the performances obtained with either representation. 
 
Adding features seems reasonable but it comes at a price: it may increase the dimensionality of the patterns and thereby immerses the relevant information into a sea of possibly irrelevant, noisy or redundant features. 

How do we know when a feature is relevant or informative? 

---

## Feature Selection

select relevant and informative features

general data reduction, to limit storage requirements and increase algorithm speed

feature set reduction, to save resources in the next round of data collection or during utilization

performance improvement, to gain in predictive accuracy

data understanding, to gain knowledge about the process that generated the data or simply visualize the data

### Selecting Relevant and Informative Features

- individual feature relevance
   
- relevant features that are individually irrelevant
   - a helpful feature may be irrelevant by itself
   - two individually irrelevant features may be relevant in combination
   
   See [5-Individual_Feature_Relevance.ipynb](5-Individual_Feature_Relevance.ipynb).
   
- forward and backward procedures
   - recursive feature elimination 

- redundant features
   - eliminate noisy features
   - correlation does not imply redundancy
   
