-
Notifications
You must be signed in to change notification settings - Fork 6
Correlation vs Miscoding
A common approach used in practical machine learning to identify the most relevant features of a dataset (that is, those features of X with higher predictive power over the target variable y) is to use the Pearson's correlation coefficient. Unfortunately, correlation is only able to detect linear relations, providing misleading results in case of non-linear dependencies. In this entry we are going to compare correlation with our own miscoding metric.
We will use as example a synthetic dataset where the target variable y is a collection of normally-distributed clusters of points, and the training set X is composed by both, relevant and irrelevant predictors. In particular we will generate 1.000 samples composed by 20 features that describe 10 clusters; only 4 of the features are relevant for prediction, and the other remaining 6 are just random values.
>>> from sklearn.datasets.samples_generator import make_classification
>>> X, y = make_classification(n_samples=1000, n_features=20, n_informative=4, n_redundant=0, n_classes=10, n_clusters_per_class=1, flip_y=0)Next figure shows the blobs projected into the two dimensional space defined by the features x8 and x10.

Let's compute the miscoding of each feature with respect to the target variable.
from fastautoml.fastautoml import Miscoding
miscoding = Miscoding()
miscoding.fit(X, y)
msd = miscoding.miscoding_features(type='adjusted')If we plot the values of the msd array, we will get something like the following figure:

We have used the adjusted version of the miscoding for an easier comparison with other feature selection techniques. From the figure we can see that the library has successfully identified the four relevant predictors (x3, x8, x10 and x16). Since we are using the adjusted version of miscodings, the actual values have to be interpreted in relative terms.
Let's now compute the correlation of each feature with respect to the target variable and compare the results.
>>> import numpy as np
>>> np.corrcoef(X, y)Next figure shows the correlation between the individual features that compose X and the target variable y. Correlation fails to properly identify one of the relevant features (x3).

As we have seen, Miscoding is able to detect those non-linear dependencies in which the correlation coefficient fails.