Correlation vs Miscoding

A common approach used in practical machine learning to identify the most relevant features of a dataset (that is, those features of X with higher predictive power over the target variable y) is to use the Pearson's correlation coefficient. Unfortunately, correlation is only able to detect linear relations, providing misleading results in case of non-linear dependencies. In this entry we are going to compare correlation with our own miscoding metric.

We will use as example a synthetic dataset where the target variable y is a collection of normally-distributed clusters of points, and the training set X is composed by both, relevant and irrelevant predictors. In particular we will generate 1.000 samples composed by 20 features that describe 10 clusters; only 4 of the features are relevant for prediction, and the other remaining 6 are just random values.

>>> from sklearn.datasets.samples_generator import make_classification
>>> X, y = make_classification(n_samples=1000, n_features=20, n_informative=4, n_redundant=0, n_classes=10, n_clusters_per_class=1, flip_y=0)

Next figure shows the blobs projected into the two dimensional space defined by the features x₈ and x₁₀.

Ten Gaussian Blobs

Let's compute the miscoding of each feature with respect to the target variable.

from fastautoml.fastautoml import Miscoding
miscoding = Miscoding()
miscoding.fit(X, y)
msd = miscoding.miscoding_features(type='adjusted')

If we plot the values of the msd array, we will get something like the following figure:

Feature Relevance

We have used the adjusted version of the miscoding for an easier comparison with other feature selection techniques. From the figure we can see that the library has successfully identified the four relevant predictors (x₃, x₈, x₁₀ and x₁₆). Since we are using the adjusted version of miscodings, the actual values have to be interpreted in relative terms.

Let's now compute the correlation of each feature with respect to the target variable and compare the results.

>>> import numpy as np
>>> np.corrcoef(X, y)

Next figure shows the correlation between the individual features that compose X and the target variable y. Correlation fails to properly identify one of the relevant features (x₃).

Correlation Gaussian Blobs

As we have seen, Miscoding is able to detect those non-linear dependencies in which the correlation coefficient fails.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correlation vs Miscoding

Correlation vs Miscoding

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally