-
Notifications
You must be signed in to change notification settings - Fork 6
Feature Selection
Given a dataset X = {x1, ..., xp} composed by p features, and a target variable y, the miscoding of the feature xj measures how difficult is to reconstruct y given xj, and the other way around. We are not only interested in to identify how much information xj contains about y, but also if xj contains additional information that is not related to y (which is a bad thing).
The fastautoml.Miscoding class allow us to compute the relevance of features, the quality of a dataset, and select the optimal subset of features to include in a stydy.
Let's generate a synthetic dataset composed by 1000 random points belonging to 10 Gaussian blobs. The samples of the dataset are described by 20 features, from which only 4 are informative. Next figure shows the blobs projected into the two dimensional space defined by the features x8 and x10.

For more information about how to identify the relevance of features using the Miscoding class see the following blog entries:
- Miscoding of Random Distributions
- Correlation vs Miscoding