# Check correlation between selected features

Aim: Check correlation between 8 features selected by feature selection. These features are:

* S2BrainImagingTime_min
* S2StrokeType_Infarction
* S2NihssArrival
* S1OnsetTimeType_Precise
* S2RankinBeforeStroke
* StrokeTeam
* AFAnticoagulent_Yes
* S1OnsetToArrival_min

In [1]:
# import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from matplotlib import cm
from sklearn.preprocessing import StandardScaler

# Import data (combine all data, but drop stroke team)
train = pd.read_csv('../data/10k_training_test/cohort_10000_train.csv')
test = pd.read_csv('../data/10k_training_test/cohort_10000_test.csv')
data = pd.concat([train, test], axis=0)
data.drop('StrokeTeam', axis=1, inplace=True)
thrombolysis = data['S2Thrombolysis']

# Load features (drop stroke team if present)
number_of_features_to_use = 8
key_features = pd.read_csv('./output/feature_selection.csv')
key_features = list(key_features['feature'])[:number_of_features_to_use]
if 'StrokeTeam' in key_features:
    key_features.remove('StrokeTeam')

# Restrict data
data = data[key_features]

## Standardise data

After scaling data, the reported covariance will be the correlation between data features.

In [2]:
sc=StandardScaler() 
sc.fit(data)
data_std=sc.transform(data)
data_std = pd.DataFrame(data_std, columns =list(data))

## Calculate correlation between features

In [3]:
# Get covariance
cov = data_std.cov()

# Convert from wide to tall
cov = cov.melt(ignore_index=False)

# Remove self-correlation
mask = cov.index != cov['variable']
cov = cov[mask]

# Add absolute value
cov['abs_value'] = np.abs(cov['value'])

# Add R-squared
cov['r-squared'] = cov['value'] ** 2

# Sort by absolute covariance
cov.sort_values('abs_value', inplace=True, ascending=False)

# Round to four decimal places
cov = cov.round(4)

# Remove duplicate pairs of features
result = []
for index, values in cov.iterrows():
    combination = [index, values['variable']]
    combination.sort()
    string = combination[0] + "-" + combination[1]
    result.append(string)
cov['pair'] = result
cov.sort_values('pair', inplace=True)
cov.drop_duplicates(subset=['pair'], inplace=True)
cov.drop('pair', axis=1, inplace=True)

# Sort by r-squared
cov.sort_values('r-squared', ascending=False, inplace=True)

Display R-squared (sorted by R-squared)

In [4]:
cov[['variable', 'r-squared']]

Unnamed: 0,variable,r-squared
S2NihssArrival,S2RankinBeforeStroke,0.0454
S2StrokeType_Infarction,S2NihssArrival,0.0386
S1OnsetToArrival_min,S1OnsetTimeType_Precise,0.0344
S1OnsetToArrival_min,S2NihssArrival,0.0186
S2RankinBeforeStroke,S1OnsetTimeType_Precise,0.0131
AFAnticoagulent_Yes,S2RankinBeforeStroke,0.007
S2StrokeType_Infarction,AFAnticoagulent_Yes,0.0033
S1OnsetToArrival_min,S2RankinBeforeStroke,0.0022
S1OnsetTimeType_Precise,S2BrainImagingTime_min,0.0021
AFAnticoagulent_Yes,S2NihssArrival,0.0019


## Observations

There are only very weak correlations between the selected features with no R-squared being greater than 0.05.