Skip to content

Feature Selection

Rafael Garcia Leiva edited this page Dec 20, 2019 · 10 revisions

Optimal Feature Selection

Given a dataset X = {x1, ..., xp} composed by p features, and a target variable y, the miscoding of the feature xj measures how difficult is to reconstruct y given xj, and the other way around. We are not only interested in to identify how much information xj contains about y, but also if xj contains additional information that is not related to y (which is a bad thing).

The fastautoml.Miscoding class allow us to compute the relevance of features, the quality of a dataset, and select the optimal subset of features to include in a stydy.

Feature Relevance

Let's generate a synthetic dataset composed by 1000 random points belonging to 10 Gaussian blobs. The samples of the dataset are described by 20 features, from which only 4 are informative.

fig

For more information about how to identify the relevance of features using the Miscoding class see the following blog entries:

  • Miscoding of Random Distributions
  • Correlation vs Miscoding

Mathematical Formulation

Clone this wiki locally