Skip to content

Correlation vs Miscoding

Rafael Garcia Leiva edited this page Jan 7, 2020 · 4 revisions

Correlation vs Miscoding

A common approach used in practical machine learning to identify the most relevant features of a dataset (that is, those features with higher predictive power) is to use the Pearson's correlation coefficient. However, it is well known that this metric is only able to detect linear correlations, providing misleading results in case of non-linear dependencies. In this entry we are going to compare correlation against our own miscoding metric.

For this example we are going to generate a synthetic dataset where the target variable y is a collection of normally-distributed clusters of points, and the training set X is composed by both, relevant and irrelevant predictors. In particular we will generate 1.000 samples composed by 10 features that describe 10 clusters; only 4 of the features are relevant for prediction, and the other remaining 6 are just random values.

from Nescience.Nescience import Miscoding

from sklearn.datasets.samples_generator import make_classification


X, y = make_classification(n_samples=1000, n_features=10, n_informative=4,

       n_redundant=0, n_classes=10, n_clusters_per_class=1, flip_y=0)


miscoding = Miscoding()

miscoding.fit(X, y)

msd = miscoding.miscoding_features(miscoding='adjusted')

We will use the adjusted version of the miscoding for an easier comparison with other feature selection techniques. If we plot the results (see Figure \ref{figure:miscoding_make_classification}) we will see that the library has successfully identified the four relevant predictors ($\mathbf{x}3$, $\mathbf{x}8$, $\mathbf{x}{10}$ and $\mathbf{x}{16}$). Since we are using the adjusted version of miscodings, the actual values have to be interpreted in relative terms.

\begin{figure}[h]

\centering

\includegraphics[width=0.6\textwidth]{feature_miscoding.png}

\caption{Miscoding of a Synthetic Dataset.}

\label{figure:miscoding_make_classification}

\end{figure}

We can compare miscoding with correlation, a common technique used in machine learning to identify the most relevant features of a dataset. In Figure \ref{figure:correlation_make_classification} is shown correlation between the individual features that compose $\mathbb{X}$ and the target variable $\mathbf{y}$. As we can observe, correlation fails to properly identify one of the relevant features ($\mathbf{x}_3$).

\begin{sourcecode}

{\scriptsize \begin{verbatim}

np.corrcoef(X, y)

\end{verbatim}}

\end{sourcecode}

\begin{figure}[h]

\centering

\includegraphics[width=0.6\textwidth]{feature_correlation.png}

\caption{Correlation of a Synthetic Dataset.}

\label{figure:correlation_make_classification}

\end{figure}

Clone this wiki locally