## Smart Correlation

After identifying correlated feature groups, we select the most relevant feature from each group.

- [Feature Selection in Machine Learning Book](https://www.trainindata.com/p/feature-selection-in-machine-learning-book)

In [1]:
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from feature_engine.selection import SmartCorrelatedSelection

In [2]:
# Toy dataset with redundant features

X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_redundant=7,
    n_classes=2,
    random_state=10,
)

X = pd.DataFrame(X)
y = pd.Series(y)

X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.283792,0.47101,-1.343721,-0.33699,0.116821,0.145666,-0.054484,-0.343668,-0.226413,-0.240955
1,-0.448534,0.009435,-2.024315,-0.261384,0.21931,0.345767,0.045181,-0.490948,0.409079,-0.667868
2,-2.387431,-0.2819,0.180289,-1.268721,1.183003,1.892637,0.299812,-2.589595,2.523974,-3.684599
3,-0.479035,0.761899,1.095608,-0.556597,0.198756,0.251093,-0.086045,-0.577749,-0.347582,-0.419675
4,1.119764,-0.803058,-0.083495,0.940198,-0.510735,-0.740669,0.026449,1.281034,-0.207904,1.362914


In [3]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

X_train.shape, X_test.shape

((700, 10), (300, 10))

## Remove correlated: Feature-engine

### Smart approach

From each group of correlated variables, we retain the one with the highest importance, derived from a machine learning model.

In [4]:
# To remove correlated features
sel = SmartCorrelatedSelection(
    method="pearson",
    threshold=0.8,
    selection_method='model_performance',
    estimator=RandomForestClassifier(n_estimators=5, random_state=10),
    scoring='roc_auc',
    cv=3,
)

# fit finds the correlated features
sel.fit(X_train, y_train)  

SmartCorrelatedSelection also allows us to retain features with higher variability or cardinality, or less missing data.

In [5]:
# the correlated features

sel.features_to_drop_

[0, 3, 4, 5, 9, 6, 8]

In [6]:
# groups of correlated features

sel.correlated_feature_sets_

[{0, 3, 4, 5, 7, 9}, {1, 6, 8}]

In [7]:
# to identify which feature from the group will be retained

sel.correlated_feature_dict_

{np.int64(7): {0, 3, 4, 5, 9}, np.int64(1): {6, 8}}

In [8]:
# remove correlated features

X_train_t = sel.transform(X_train)
X_test_t = sel.transform(X_test)

X_train_t.shape, X_test_t.shape

((700, 3), (300, 3))

In [9]:
X_train_t.head()

Unnamed: 0,1,2,7
105,-3.715633,0.250835,2.115155
68,3.029661,-1.979157,-1.822263
479,1.192578,1.439996,-0.613533
399,-0.505949,-0.049844,1.577046
434,1.894522,-1.161771,-1.321103


The transformer found 2 groups of correlated features in the dataset.

## Pandas

Categorical variables need to be encoded into numeric first.

In [10]:
# correlation matrix
corrmat = X_train.corr()

# create a df with 2 columns with feature names
# and the correlation between the features
corrmat = corrmat.abs().unstack()

# select highly correlated feature pairs
corrmat = corrmat[corrmat > 0.8]

# remove self-correlations
corrmat = corrmat[corrmat < 1]

# reset index and add columns names
corrmat = pd.DataFrame(corrmat).reset_index()
corrmat.columns = ['feature1', 'feature2', 'corr']

# the result
corrmat.head()

Unnamed: 0,feature1,feature2,corr
0,0,3,0.91892
1,0,4,0.992627
2,0,5,0.958919
3,0,7,0.997633
4,0,9,0.924165


In [11]:
# find groups of correlated features

grouped_feature_ls = []
correlated_groups = []

for feature in corrmat['feature1'].unique():

    if feature not in grouped_feature_ls:

        # find all features correlated to a single feature
        correlated_block = corrmat[corrmat['feature1'] == feature]
        grouped_feature_ls = grouped_feature_ls + list(
            correlated_block['feature2'].unique()) + [feature]

        # append the block of features to the list
        correlated_groups.append(correlated_block)

print(
    f"Found {len(correlated_groups)} correlated groups from {len(X_train)}  features.")

Found 2 correlated groups from 700  features.


In [12]:
# now we can print out each group. We see that some groups contain
# only 2 correlated features, some other groups present several features 
# that are correlated among themselves.

for group in correlated_groups:
    print(group)
    print()

   feature1  feature2      corr
0         0         3  0.918920
1         0         4  0.992627
2         0         5  0.958919
3         0         7  0.997633
4         0         9  0.924165

   feature1  feature2      corr
5         1         3  0.920382
6         1         6  0.940293
7         1         8  0.827460



In [13]:
# we can now investigate further features within one group.
# let's for example select group 1

group = correlated_groups[0]
group

Unnamed: 0,feature1,feature2,corr
0,0,3,0.91892
1,0,4,0.992627
2,0,5,0.958919
3,0,7,0.997633
4,0,9,0.924165


In [14]:
features = [0, 3, 4, 5, 7, 9]

In [15]:
# train a random forest 
rf = RandomForestClassifier(n_estimators=5, random_state=39)

rf.fit(X_train[features], y_train)

In [16]:
# Get the feature importance attributed by the 
# random forest model

importance = pd.concat(
    [pd.Series(features),
     pd.Series(rf.feature_importances_)], axis=1)

importance.columns = ['feature', 'importance']

# sort features by importance, most important first
importance.sort_values(by='importance', ascending=False)

Unnamed: 0,feature,importance
0,0,0.313275
5,9,0.196845
4,7,0.188994
2,4,0.170325
1,3,0.06979
3,5,0.060771


We would select feature 0 and discard the rest.