# Feature exploration and engineering


After taking a closer look at the targets (see [here](https://www.kaggle.com/miykael/trends-exploration-of-the-targets)), let's also take a closer look at the features.

**First**, lets look at the two feature sets (IC from morphological data) and FNC (from a functional connectivity analysis). In a **second**, step we will take a closer look at the 3-dimension MRI images.

- [1. Load targets](#1.-Load-targets)
- [2. IC and FNC features](#2.-IC-and-FNC-features)
    - [2.1. Loading features](#2.1.-Loading-features)
    - [2.2. Explore features](#2.2.-Explore-features)
        - [2.2.1. Explore correlations within feature types](#2.2.1.-Explore-correlations-within-feature-types)
        - [2.2.2. Explore correlations between feature types and targets](#2.2.2.-Explore-correlations-between-feature-types-and-targets)
        - [2.2.3. Explore pairplots between feature types and targets](#2.2.3.-Explore-pairplots-between-feature-types-and-targets)
        - [2.2.4. Visualize all values within a feature](#2.2.4.-Visualize-all-values-within-a-feature)
- [3. Feature engineering using the MRI maps](#3.-Feature-engineering-using-the-MRI-maps)
    - [3.1 Transformation of mat files to NIfTIs (standard for MRI images)](#3.1-Transformation-of-mat-files-to-NIfTIs-(standard-for-MRI-images))
    - [3.2. Extraction of within subject correlations](#3.2.-Extraction-of-within-subject-correlations)
    - [3.3. Extraction of between subject correlations](#3.3.-Extraction-of-between-subject-correlations)
    - [3.4. Additional feature exploration 1: More MRI quality measurements](#3.4.-Additional-feature-exploration-1:-More-MRI-quality-measurements)
    - [3.5. Additional feature exploration 2: Find voxels with high correlation to targets](#3.5.-Additional-feature-exploration-2:-Find-voxels-with-high-correlation-to-targets)
- [4. Dataset merging and outlier removal](#4.-Dataset-merging-and-outlier-removal)
- [5. PCA and UMap exploration](#5.-PCA-and-UMap-exploration)

In [None]:
import os
import numpy as np
import pandas as pd
from glob import glob
from os.path import join as opj

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

from tqdm.notebook import tqdm

In [None]:
path = '/kaggle/input/trends-assessment-prediction/'

# 1. Load targets

However, before anything else, let's quickly recreate the targets from the [other](https://www.kaggle.com/miykael/trends-exploration-of-the-targets)).

In [None]:
# Load targets
targets = pd.read_csv(opj(path, 'train_scores.csv')).set_index('Id')

In [None]:
# Let's also create the rotated domain2 targets
rot = 0.90771256655

def rotate_origin(x, y, radians):
    """Rotate a point around the origin (0, 0)."""
    xx = x * np.cos(radians) + y * np.sin(radians)
    yy = -x * np.sin(radians) + y * np.cos(radians)
    return np.array([xx, yy]).T

d2 = rotate_origin(targets.iloc[:, 3].values, targets.iloc[:, 4].values, rot)
targets['d21_rot'] = d2[:, 0]
targets['d22_rot'] = d2[:, 1]

In [None]:
# Let's apply the power transformation to make the value distribution gaussian
pow_age = 1.0
pow_d1v1 = 1.5
pow_d1v2 = 1.5
pow_d2v1 = 1.5
pow_d2v2 = 1.5
pow_d21 = 1.5
pow_d22 = 1
powers = [pow_age, pow_d1v1, pow_d1v2, pow_d2v1, pow_d2v2, pow_d21, pow_d22 ]

for i, col in enumerate(targets.columns):
    targets[col] = np.power(targets[col], powers[i])

In [None]:
from sklearn.preprocessing import StandardScaler

# And last but not least, let's scale the target features using ab
scaler = StandardScaler()
targets.iloc[:, :] = scaler.fit_transform(targets)
targets.head()

# 2. IC and FNC features

## 2.1. Loading features

In [None]:
# Extract ID to separate train and test set
train_id = targets.index.values
sample_submission = pd.read_csv(opj(path, 'sample_submission.csv'))
test_id = np.unique(sample_submission.Id.str.split('_', expand=True)[0].astype('int'))
print(train_id.shape, test_id.shape)

In [None]:
# Load ICs from the loading file and separate them into train and test set
df_ic = pd.read_csv(opj(path, 'loading.csv'))
ic_train = df_ic[df_ic.Id.isin(train_id)].set_index('Id')
ic_test = df_ic[df_ic.Id.isin(test_id)].set_index('Id')
print(ic_train.shape, ic_test.shape)

In [None]:
# Load FNCs from file and separate them into train and test set
df_fnc = pd.read_csv(opj(path, 'fnc.csv'))
fnc_train = df_fnc[df_fnc.Id.isin(train_id)].set_index('Id')
fnc_test = df_fnc[df_fnc.Id.isin(test_id)].set_index('Id')
print(fnc_train.shape, fnc_test.shape)

## 2.2. Explore features

### 2.2.1. Explore correlations within feature types

In [None]:
def plot_corr_matrix(df_train, df_test, c_restrict=200):

    # Correlation matrix for ICA components
    fig, ax = plt.subplots(ncols=3, figsize=(20, 10))
    abs_max = 1.0
    sns.heatmap(df_train.iloc[:, :c_restrict].corr(), square=True, vmin=-abs_max, vmax=abs_max, cbar=False, ax=ax[0]);
    sns.heatmap(df_test.iloc[:, :c_restrict].corr(), square=True, vmin=-abs_max, vmax=abs_max, cbar=False, ax=ax[1]);
    sns.heatmap(df_train.iloc[:, :c_restrict].corr()-df_test.iloc[:, :c_restrict].corr(),
                square=True, vmin=-0.33, vmax=0.33, cbar=False, ax=ax[2]);
    ax[0].set_title('Train')
    ax[1].set_title('Test')
    ax[2].set_title('Difference (Train - Test)');

In [None]:
# Correlation matrix for IC features
plot_corr_matrix(ic_train, ic_test, c_restrict=100)

In [None]:
# Correlation matrix for FNC features
plot_corr_matrix(fnc_train, fnc_test, c_restrict=100)

### 2.2.2. Explore correlations between feature types and targets

In [None]:
def plot_corr_matrix_target(targets, df_train, c_restrict=100):

    # Merge target and feature matrix
    df_temp = pd.merge(targets.reset_index(), df_train.reset_index())
    df_temp = df_temp.set_index('Id').iloc[:, :c_restrict]
    
    # Correlation matrix for ICA components
    plt.figure(figsize=(16, 3))
    sns.heatmap(df_temp.corr().iloc[:7, 7:], square=True,
                vmin=-0.5, vmax=0.5, cbar=False, cmap='Spectral');

In [None]:
# Correlation between IC features and targets
plot_corr_matrix_target(targets, ic_train, c_restrict=100)

In [None]:
# Correlation between FNC features and targets
plot_corr_matrix_target(targets, fnc_train, c_restrict=100)

## Observation 1

For both feature types, the correlation with age seems to be the highest. Let's explore this in a bit more detail. What is the highest correlation features can reach with the 5 targets?

In [None]:
# Show highest correlation with target variables and IC dataset
df_corr = pd.concat([np.abs(ic_train.corrwith(targets.iloc[:, i])).sort_values(ascending=False).reset_index(drop=True) for i in range(7)], axis=1)
df_corr.columns = targets.columns
df_corr.head(5)

In [None]:
# Show highest correlation with target variables and FNC dataset
df_corr = pd.concat([np.abs(fnc_train.corrwith(targets.iloc[:, i])).sort_values(ascending=False).reset_index(drop=True) for i in range(7)], axis=1)
df_corr.columns = targets.columns
df_corr.head(5)

### Side note: Exploration of correlation after rotation 

For fun (and because I was hoping it could help), I've rotated the two targets in domain2, to see which rotation leads to the highest correlation within the two datasets (IC and FNC).

In [None]:
def plot_rotation_correlations(df_data, targets, ttt=3):

    corr_max = []
    for r in np.linspace(0, 3.14, 100):

        bla = targets.iloc[:, i].copy()
        bla.iloc[:] = rotate_origin(targets.iloc[:, ttt].values, targets.iloc[:, ttt+1].values, r)[:, 0]
        corr_max.append([np.rad2deg(r), df_data.corrwith(bla).sort_values(ascending=False).reset_index(drop=True).abs().max()])

    corr_max1 = np.array(corr_max)
    plt.figure(figsize=(14, 4))
    plt.scatter(corr_max1[:, 0], corr_max1[:, 1], s=3);

    corr_max = []
    for r in np.linspace(0, 3.14, 100):

        bla = targets.iloc[:, i].copy()
        bla.iloc[:] = rotate_origin(targets.iloc[:, ttt].values, targets.iloc[:, ttt+1].values, r)[:, 1]
        corr_max.append([np.rad2deg(r), df_data.corrwith(bla).sort_values(ascending=False).reset_index(drop=True).abs().max()])

    corr_max2 = np.array(corr_max)
    plt.scatter(corr_max2[:, 0], corr_max2[:, 1], s=3);

    best_corr = corr_max1[np.argmin(np.abs(corr_max1[:, 1] - corr_max2[:, 1])), 1]
    best_rot = corr_max1[np.argmin(np.abs(corr_max1[:, 1] - corr_max2[:, 1])), 0]
    plt.title('Equal correlation of %.4f\nat rotation of %.4f radians' % (best_corr, best_rot))
    plt.legend(['domain2_var1_rot', 'domain2_var2_rot'])

In [None]:
plot_rotation_correlations(ic_train, targets, ttt=3)

In [None]:
plot_rotation_correlations(fnc_train, targets, ttt=3)

Not sure what this kind of exploration can bring to the game. My assumption was, if I have equally high correlation on both target features, I might be able to predict the values well and inverse the rotation. But it never lead to anything useful.

### 2.2.3. Explore pairplots between feature types and targets

In [None]:
# Number of columns to investigate
n_invest = 10

In [None]:
sns.pairplot(ic_train.iloc[:, :n_invest], diag_kind="kde", corner=True);

In [None]:
sns.pairplot(fnc_train.iloc[:, :n_invest], diag_kind="kde", corner=True);

There seem to be a few interesting correlations between features, but exploring all of them would just take too much time.

### 2.2.4. Visualize all values within a feature

In [None]:
def plot_markers(key, df_temp, ncolmarker=5, split_at=5, plot_max=15):

    # Restrict dataframe to first X features
    df_temp = df_temp.iloc[:, :plot_max]

    # Compute dataset selecters
    ncolumns = np.arange(df_temp.shape[1])
    selecter = np.split(ncolumns, ncolumns[::split_at][1:])

    for s in selecter:

        print(key, s)
        df_temp.iloc[:, s].plot(kind='line',subplots=True, sharex=True, marker='.', lw=0,
                                ms=10, markeredgecolor='k', markeredgewidth=0.3,
                     figsize=(5 * ncolmarker, 4 * df_temp.iloc[:, s].shape[1]//ncolmarker), layout=(-1,ncolmarker));
        plt.show()

In [None]:
plot_markers('Visualization of IC features:', ic_train)

In [None]:
plot_markers('Visualization of fNC features:', fnc_train)

## Conclusion 1


The datasets seem to contain a few outliers. We will take care of them at the very end.

# 3. Feature engineering using the MRI maps

Getting the right features from the MRI images is difficult, as the ICA maps themselves are already a derivate from the original fMRI images. Nonetheless, I've explored multiple approaches of which two seemed to be useful.

### Note concerning computation time for feature extraction
Executing this notebook in it's entirety would take more than 9 hours (due to the feature extraction from the individual MRI maps), I therefore went ahead and included the output files to this kernel. I nonetheless left the code in here, for those who are interested.

In [None]:
# To bypass feature extraction and load precomputed files
load_pre_computed_files = True

## 3.1 Transformation of mat files to NIfTIs (standard for MRI images)

First things first, let's transform the data into proper MRI images so that we can profit from the nilearn package.

In [None]:
import h5py
import nilearn as nl
from nilearn import image, plotting

In [None]:
# Load brain mask
mask = nl.image.load_img(opj(path, 'fMRI_mask.nii'))

In [None]:
# This function was inspired by a fellow kaggler, who I can't find the source anymore
def read_img(filename, mask):
    with h5py.File(filename, 'r') as f:
        data = np.array(f['SM_feature'], dtype='float32')

    # It's necessary to reorient the axes, since h5py flips axis order
    data = np.moveaxis(data, [0, 1, 2, 3],
                             [3, 2, 1, 0])

    img = nl.image.new_img_like(mask, data, affine=mask.affine, copy_header=True)
    return img

In [None]:
# Only convert every n-th subject
sub_sample = 100

This `sub_sample` step is needed for this Kaggle kernel as it otherwise would take too long (and take too much space) to transfer all mat files to NIfTIs.

In [None]:
# Rewrite mat file to compressed NIfTI
directory='fMRI_train'
if not os.path.exists(directory):
    os.makedirs(directory)
for fname in tqdm(sorted(glob(opj(path, directory, '*.mat')))[::sub_sample]):
    new_filename = fname.replace('.mat', '.nii.gz')
    new_filename = new_filename.replace('/kaggle/input/trends-assessment-prediction/', '')
    read_img(fname, mask).to_filename(new_filename)

In [None]:
# Rewrite mat file to compressed NIfTI
directory='fMRI_test'
if not os.path.exists(directory):
    os.makedirs(directory)
for fname in tqdm(sorted(glob(opj(path, directory, '*.mat')))[::sub_sample]):
    new_filename = fname.replace('.mat', '.nii.gz')
    new_filename = new_filename.replace('/kaggle/input/trends-assessment-prediction/', '')
    read_img(fname, mask).to_filename(new_filename)

## 3.2. Extraction of within subject correlations

My first idea for feature extraction was to take the 53 maps from each subject and just compute the correlations between all of these maps. Hopefully these kind of features will somehow encode within subject variabilities.

As an example, for one subject, the output would look something like this:

In [None]:
# Load data from one subject
img = image.load_img(sorted(glob('fMRI_train/*.nii.gz'))[0])

# Mask the image to only look at correlation within voxels which have a value
data = img.get_fdata()[mask.get_fdata()>0]

# Compute correlation matrix
corr_matrix = np.corrcoef(data.T)

In [None]:
plt.figure(figsize=(6, 6))
sns.heatmap(corr_matrix, square=True, cbar=True);

As the upper triangle is a duplication of the lower one, lets ignore one and lets only keep one part of it:

In [None]:
# Only keep upper triangular correlation matrix without diagonal
triangular_mask = np.ravel(np.triu(np.ones((53, 53)), k=1))>0.5
corr_values = np.ravel(corr_matrix)[triangular_mask]
print(corr_values.shape, corr_values)

Now that we know what we want to do, let's repeat that for all NIfTI images in the training and test set.

In [None]:
# Let's create an output folder to store the new features
directory='datasets'
if not os.path.exists(directory):
    os.makedirs(directory)

In [None]:
if load_pre_computed_files:

    # Load precomputed intra subject correlation data for the training set
    hdf_path = opj('/kaggle', 'input', 'corr-features', 'intra_corr_train.h5')
    df_corr_intra_train = pd.read_hdf(hdf_path)

else:

    # Collect results
    corr_results = {}

    # Collect all train files
    train_files = sorted(glob('fMRI_train/*.nii.gz'))

    for t in tqdm(train_files):

        try:
            # Load mean image
            img = image.load_img(t)
            data = img.get_fdata()[mask.get_fdata()>0]

            t_id = t.split('/')[1].split('.')[0]
            corr_results[t_id] = np.ravel(np.corrcoef(data.T))

        except:
                print("Wasn't able to load: ", t)

    df_corr = pd.DataFrame(corr_results).T
    df_corr.columns = ['c%02d_c%02d' % (i + 1, j + 1)
                       for i in range(53) for j in range(53)]

    # Only keep upper triangular correlation matrix without diagonal
    triangular_mask = np.ravel(np.triu(np.ones((53, 53)), k=1))>0.5
    df_corr_intra_train = df_corr.loc[:, triangular_mask]

    # Save everything in CSV file
    df_corr_intra_train.to_hdf('datasets/df_corr_intra_train.hdf5', key='df_corr_intra_train', mode='w')

# Plopt head of dataframe
df_corr_intra_train.head()

In [None]:
if load_pre_computed_files:

    # Load precomputed intra subject correlation data for the training set
    hdf_path = opj('/kaggle', 'input', 'corr-features', 'intra_corr_test.h5')
    df_corr_intra_test = pd.read_hdf(hdf_path)

else:

    # Collect results
    corr_results = {}

    # Collect all test files
    test_files = sorted(glob('fMRI_test/*.nii.gz'))

    for t in tqdm(test_files):

        try:
            # Load mean image
            img = image.load_img(t)
            data = img.get_fdata()[mask.get_fdata()>0]

            t_id = t.split('/')[1].split('.')[0]
            corr_results[t_id] = np.ravel(np.corrcoef(data.T))

        except:
                print(t)

    df_corr = pd.DataFrame(corr_results).T
    df_corr.columns = ['c%02d_c%02d' % (i + 1, j + 1)
                       for i in range(53) for j in range(53)]

    # Only keep upper triangular correlation matrix without diagonal
    triangular_mask = np.ravel(np.triu(np.ones((53, 53)), k=1))>0.5
    df_corr_intra_test = df_corr.loc[:, triangular_mask]

    # Save everything in CSV file
    df_corr_intra_test.to_hdf('datasets/intra_corr_test.hdf5', key='intra_corr_test', mode='w')

# Plopt head of dataframe
df_corr_intra_test.head()

Now that the intra correlation features are generated, let's take a closer look at them! Just as we did for the IC and FNC features.

In [None]:
# Correlation matrix for IC features
plot_corr_matrix(df_corr_intra_train, df_corr_intra_test, c_restrict=100)

In [None]:
# Correlation between IC features and targets
plot_corr_matrix_target(targets, df_corr_intra_train, c_restrict=100)

In [None]:
# Show highest correlation with target variables and IC dataset
df_corr = pd.concat([np.abs(df_corr_intra_train.corrwith(targets.iloc[:, i])).sort_values(ascending=False).reset_index(drop=True) for i in range(7)], axis=1)
df_corr.columns = targets.columns
df_corr.head(5)

In [None]:
# Pairplots between intra correlation values and targets
sns.pairplot(df_corr_intra_train.iloc[:, :n_invest], diag_kind="kde", corner=True);

In [None]:
# Visualization of values in dataset
plot_markers('Visualization of intra correlation features:', df_corr_intra_train)

## Observation 2

These intra subject correlation features seem to have high correlation with age, but they also seem to be very outlier ridden. This needs to be cleaned at the end!

## 3.3. Extraction of between subject correlations

Similarly to the previous approach, in the "between subject" approach, I was creating 53 features per subject which represented the correlation between the ICA component and the average ICA component from all subjects. Hopefully these kind of features will somehow encode between subject variabilities.

To do this, I first had to compute the average MRI map per component:

In [None]:
from nilearn import image, plotting, masking
from nilearn.regions import connected_regions

In [None]:
# Creates the mean image for a given component
def get_mean_component(filenames, comp_ID=0):
    mean = image.math_img('img * 0', img=mask)
    for f in filenames:
        img = image.load_img(f).slicer[..., comp_ID]
        mean = image.math_img('mean + img', mean=mean, img=img)
    mean = image.math_img('img / %f' % len(filenames), img=mean)
    return mean

In [None]:
# Creating an output folder to store the average maps
directory='fMRI_maps'
if not os.path.exists(directory):
    os.makedirs(directory)

In [None]:
# Extract the mean images
n_maps = 8    # Change this parameter to 53 to get all components

filenames = sorted(glob('fMRI_train/*.nii.gz'))
for idx in tqdm(range(n_maps)):
    mean = get_mean_component(filenames, comp_ID=idx)
    mean.to_filename('fMRI_maps/mean_%02d.nii.gz' % (idx + 1))

In [None]:
# Let's plot the first n-th average maps (threshold at 95% max value)
for idx in range(n_maps):
    img = image.load_img('fMRI_maps/mean_%02d.nii.gz' % (idx + 1))
    data = img.get_fdata()
    threshold = np.percentile(data[data!=0], 95)
    img_thr = image.threshold_img(img, threshold=threshold)
    img_regions = image.mean_img(connected_regions(img_thr, min_region_size=4000)[0])
    plotting.plot_glass_brain(img_regions, black_bg=True, display_mode='lyrz',
                              title='mean_%02d' % (idx + 1))
    plt.show()

Now that we have these average maps ready, let's go through the training and test set and collect the correlation between the subject map and the average population map.

In [None]:
def combine_brain_values(didx='train'):
    """Helper function to combine all 53 component CSV files into one big one"""

    # List of file names
    csv_files = sorted(glob('datasets/inter_corr_*_%s_*.csv' % didx))

    # Create empty ID list
    merger = pd.read_csv(csv_files[0]).set_index('Id')
    merger.columns = [c + '_%02d' % 1 for c in merger.columns]

    # Go through files and concatenate them
    for i, f in enumerate(csv_files[1:]):

        new_df = pd.read_csv(f).set_index('Id')
        new_df.columns = [c + '_%02d' % (i + 2) for c in new_df.columns]

        merger = pd.merge(merger, new_df, on='Id')

    return merger


In [None]:
if load_pre_computed_files:

    # Load precomputed inter subject correlation data for the training set
    hdf_path = opj('/kaggle', 'input', 'corr-features', 'inter_corr_train.h5')
    df_corr_inter_train = pd.read_hdf(hdf_path)

else:

    # Collect value metrics from images
    train_files = sorted(glob('fMRI_train/*.nii.gz'))

    for idx in tqdm(range(n_maps)):

        # Load mean image
        mean = image.load_img('fMRI_maps/mean_%02d.nii.gz' % (idx + 1))
        data_mean = mean.get_fdata()[mask.get_fdata()>0]

        # Compute binary mask for region
        mask_region = data_mean > np.percentile(data_mean, 99)

        # Store results in results file
        results = {}

        for t in train_files:

            try:
                # Get file name
                t_id = t.split('/')[1].split('.')[0]

                # Load current volume
                img = image.index_img(t, idx)

                # Only extract data values from within mask
                data_img = img.get_fdata()[mask.get_fdata()>0]

                # Collect correlation coefficient to mean image
                corr_coef = np.corrcoef(data_img, data_mean)[0, 1]

                results[t_id] = [t_id, corr_coef]
            except:
                print(t)

        # Store result in CSV file
        df_results = pd.DataFrame(results).T
        df_results.columns = ['Id', 'corr_coef']
        df_results.to_csv('datasets/inter_corr_train_%02d.csv' % (idx + 1), index=False)

    # Load brain value components
    df_corr_inter_train = combine_brain_values(didx='train')
        
df_corr_inter_train.head()

In [None]:
if load_pre_computed_files:

    # Load precomputed inter subject correlation data for the training set
    hdf_path = opj('/kaggle', 'input', 'corr-features', 'inter_corr_test.h5')
    df_corr_inter_test = pd.read_hdf(hdf_path)

else:

    # Collect results
    test_files = sorted(glob('fMRI_test/*.nii.gz'))

    for idx in tqdm(range(n_maps)):

        # Load mean image
        mean = image.load_img('fMRI_maps/mean_%02d.nii.gz' % (idx + 1))
        data_mean = mean.get_fdata()[mask.get_fdata()>0]

        # Compute binary mask for region
        mask_region = data_mean > np.percentile(data_mean, 99)

        # Store results in results file
        results = {}

        for t in test_files:

            try:
                # Get file name
                t_id = t.split('/')[1].split('.')[0]

                # Load current volume
                img = image.index_img(t, idx)

                # Only extract data values from within mask
                data_img = img.get_fdata()[mask.get_fdata()>0]

                # Collect correlation coefficient to mean image
                corr_coef = np.corrcoef(data_img, data_mean)[0, 1]

                results[t_id] = [t_id, corr_coef]
            except:
                print(t)

        # Store result in CSV file
        df_results = pd.DataFrame(results).T
        df_results.columns = ['Id', 'corr_coef']
        df_results.to_csv('datasets/inter_corr_test_%02d.csv' % (idx + 1), index=False)
        
    # Load brain value components
    df_corr_inter_test = combine_brain_values(didx='test')
        
df_corr_inter_test.head()

Now that the inter correlation features are generated, let's take a closer look at them! Just as we did for the IC and FNC features.

In [None]:
# Correlation matrix for IC features
plot_corr_matrix(df_corr_inter_train, df_corr_inter_test, c_restrict=100)

In [None]:
# Correlation between IC features and targets
plot_corr_matrix_target(targets, df_corr_inter_train, c_restrict=100)

In [None]:
# Show highest correlation with target variables and IC dataset
df_corr = pd.concat([np.abs(df_corr_inter_train.corrwith(targets.iloc[:, i])).sort_values(ascending=False).reset_index(drop=True) for i in range(7)], axis=1)
df_corr.columns = targets.columns
df_corr.head(5)

In [None]:
# Pairplots between intra correlation values and targets
sns.pairplot(df_corr_inter_train.iloc[:, :n_invest], diag_kind="kde", corner=True);

In [None]:
# Visualization of values in dataset
plot_markers('Visualization of inter correlation features:', df_corr_intra_train)

## Observation 3

Also these inter subject correlation features seem to have high correlation with age, and the var1 features. But also here, a lot of outliers are still present.

## 3.4. Additional feature exploration 1: More MRI quality measurements

There are a multitude of additional features one can extract from structural and functional MRI images. Luckily, there exists already a great toolbox that can extract them automatically, called [MRIQC](https://mriqc.readthedocs.io/en/latest/measures.html). Even though these metrics are not meant for ICA maps, I've nonetheless investigated a few, such as:


    # Collect euclidean distance to mean image
    euclide_whole = np.linalg.norm(np.subtract(data_img, data_mean))
    euclide_region = np.linalg.norm(np.subtract(data_img[mask_region], data_mean[mask_region]))

    # Collect percentiles from whole image and region
    perc_to_check_r = [0.1, 1, 5, 50, 95, 99, 99.9]
    percentiles_whole = [np.percentile(data_img, p) for p in perc_to_check_r]
    percentiles_region = [np.percentile(data_img[mask_region], p) for p in perc_to_check_r]

    # Download GM, WM and CSF probability maps from ICBM 2009c asymmetric template
    # From: http://www.bic.mni.mcgill.ca/ServicesAtlases/ICBM152NLin2009
    template_gm = nl.image.load_img('templates/mni_icbm152_nlin_asym_09c/mni_icbm152_gm_tal_nlin_asym_09c.nii')
    template_gm = image.resample_to_img(template_gm, mask)
    pve_gm = image.math_img('img * mask', img=template_gm, mask=mask)

    template_wm = nl.image.load_img('templates/mni_icbm152_nlin_asym_09c/mni_icbm152_wm_tal_nlin_asym_09c.nii')
    template_wm = image.resample_to_img(template_wm, mask)
    pve_wm = image.math_img('img * mask', img=template_wm, mask=mask)

    template_csf = nl.image.load_img('templates/mni_icbm152_nlin_asym_09c/mni_icbm152_csf_tal_nlin_asym_09c.nii')
    template_csf = image.resample_to_img(template_csf, mask)
    pve_csf = image.math_img('img * mask', img=template_csf, mask=mask)

    # Create pve mask
    pve_concat = image.concat_imgs([pve_gm, pve_wm, pve_csf])
    pve_mask = image.math_img('np.sum(img, axis=-1)>0.5', img=pve_concat)

    # Find pve binary masks per tissue
    pve_argmax = image.math_img('np.argmax(img, axis=-1) * mask', img=pve_concat, mask=mask)
    pve_mask_gm, pve_mask_wm, pve_mask_csf = [image.math_img('(img==%d)*mask' % i, img=pve_argmax, mask=mask) for i in range(3)]

    # Extract volume data from within PVE masks
    pve_masked_gm_values = img.get_fdata()[pve_mask_gm.get_fdata().astype('bool')]
    pve_masked_wm_values = img.get_fdata()[pve_mask_wm.get_fdata().astype('bool')]
    pve_masked_csf_values = img.get_fdata()[pve_mask_csf.get_fdata().astype('bool')]

    # Collect percentiles from tissue masks
    perc_to_check_t = [1, 5, 50, 95, 99]
    percentiles_gm = [np.percentile(pve_masked_gm_values, p) for p in perc_to_check_t]
    percentiles_wm = [np.percentile(pve_masked_wm_values, p) for p in perc_to_check_t]
    percentiles_csf = [np.percentile(pve_masked_csf_values, p) for p in perc_to_check_t]

    # Compute smoothness to original image difference
    smoothness = np.linalg.norm(image.math_img(
        'img-smooth', img=img, smooth=image.smooth_img(
            img, 6)).get_fdata()[mask.get_fdata()>0])

    # Compute coefficient of joint variation (CJV) within GM and WM
    cjv = (pve_masked_wm_values.std() + pve_masked_gm_values.std()) / \
           np.abs(pve_masked_wm_values.mean() - pve_masked_gm_values.mean())

    # Compute signal to noise ratio
    snr_gm = pve_masked_gm_values.mean() / (pve_masked_gm_values.std() * np.sqrt(len(pve_masked_gm_values)/(len(pve_masked_gm_values)-1)))
    snr_wm = pve_masked_wm_values.mean() / (pve_masked_wm_values.std() * np.sqrt(len(pve_masked_wm_values)/(len(pve_masked_wm_values)-1)))
    snr_csf = pve_masked_csf_values.mean() / (pve_masked_csf_values.std() * np.sqrt(len(pve_masked_csf_values)/(len(pve_masked_csf_values)-1)))

    # Compute wm2max values
    muWM = pve_masked_wm_values.mean()
    wm2max_gm = muWM/np.percentile(pve_masked_gm_values, 99.95)
    wm2max_wm = muWM/np.percentile(pve_masked_wm_values, 99.95)
    wm2max_csf = muWM/np.percentile(pve_masked_csf_values, 99.95)

    # Collect standard deviation from whole image and region
    whole_std = data_img.std()
    region_std = data_img[mask_region].std()

However, none of them seemed to have helped with the final score.

## 3.5. Additional feature exploration 2: Find voxels with high correlation to targets

A second approach I've tried out was to look for voxels in the brain which have a high correlation with the target features. Within each of the 53 component and for each target feature, I looked for the top 10 correlating voxels and chose those two that themselves are the least correlated. At the end I've extracted the values within all of these "top correlating voxels".

This approach was very promising and let to incredible improvments in the score (locally) and the dream collapsed when I realized that it was helplessly overfitting and reached horrible scores in the public test set.

Here's nonetheless the code connected to this:

    # Compute correlation between data and target
    data_corr = []
    for v in range(data_set.shape[1]):
        dfocus = data_set[:, v]
        dfocus -= np.median(dfocus)
        chigh, clow = np.percentile(dfocus[dfocus>0], 99), np.percentile(dfocus[dfocus<0], 1)
        cselecter = np.logical_and(dfocus<chigh, dfocus>clow)

        data_corr.append(np.corrcoef(dfocus[cselecter], target_set[cselecter])[0, 1])
    data_corr = np.array(data_corr)

    # Find voxels with top target correlation
    top_sort = np.argsort(np.abs(data_corr))[::-1][:10]

    # Find voxel id of orthogonal (most uncorrelated voxel with top correlated voxel)
    orth_corr_id = np.argmin(np.abs(np.corrcoef(data_collection[:, top_sort].T)[0, :]))

    # Specify which voxels to look at
    vox_select = [top_sort[0], top_sort[orth_corr_id]]

    # Add top voxels to voxel_idx
    voxel_to_extract.append(vox_select)


# 4. Dataset merging and outlier removal

Now that the 4 datasets are ready (IC, FNC, inter-corr and intra-corr), let's merge them together and remove some outliers.

In [None]:
# Create merged dataset
merge_train = pd.merge(ic_train.reset_index(), fnc_train.reset_index()).set_index('Id')
merge_train = pd.merge(merge_train, df_corr_intra_train, left_index=True, right_index=True)
merge_train = pd.merge(merge_train, df_corr_inter_train, left_index=True, right_index=True)
print(merge_train.shape)

merge_test = pd.merge(ic_test.reset_index(), fnc_test.reset_index()).set_index('Id')
merge_test = pd.merge(merge_test, df_corr_intra_test, left_index=True, right_index=True)
merge_test = pd.merge(merge_test, df_corr_inter_test, left_index=True, right_index=True)
print(merge_test.shape)

Let's now remove samples that have more than x-times an outlier above a z-score of y.

In [None]:
# Detect very frequent extrem values with z-score outliers
df_zscore = (merge_train - merge_train.mean())/merge_train.std()

extrem_ids = []
for above_std, how_many_times in [[4, 8], [5, 4], [6, 2]]:

    # Detect extrem values
    extrem_values = np.sum(df_zscore.abs()>=above_std, axis=1)>=how_many_times
    new_extrems = list(np.array(merge_train[extrem_values].index))
    extrem_ids.extend(new_extrems)
    print('Found %d outliers with an absolute z-score above %d, at least %d times.' % (len(new_extrems), above_std, how_many_times))

extrem_ids = np.unique(extrem_ids)
print('Total of unique outliers found: %d' % len(extrem_ids))
extrem_ids

To be sure, let's also look for outliers due to missing values in the features (potentially due to feature extraction).

In [None]:
# Missing values discovered within the features.
outliers = np.ravel([np.array([t for t in train_id if not np.isin(t, ic_train.index.values)])])
outliers = np.hstack((outliers, np.array([t for t in train_id if not np.isin(t, fnc_train.index.values)])))
outliers = np.hstack((outliers, np.array([t for t in train_id if not np.isin(t, df_corr_intra_train.index.values)])))
outliers = np.hstack((outliers, np.array([t for t in train_id if not np.isin(t, df_corr_inter_train.index.values)])))
outliers = np.hstack((outliers, extrem_ids))
outliers = np.unique(outliers)
print(len(outliers))
outliers

In [None]:
# Remove outliers from features
ic_train = ic_train.drop(outliers, errors='ignore')
fnc_train = fnc_train.drop(outliers, errors='ignore')
df_corr_intra_train = df_corr_intra_train.drop(outliers, errors='ignore')
df_corr_inter_train = df_corr_inter_train.drop(outliers, errors='ignore')
merge_train = merge_train.drop(outliers, errors='ignore')
print(ic_train.shape, fnc_train.shape, df_corr_intra_train.shape, df_corr_inter_train.shape, merge_train.shape)

In [None]:
# Remove outliers from target
targets = targets.drop(outliers, errors='ignore')
targets.shape

Perfect, so let's store all of this in easy accessible files:

In [None]:
# Store datasets as hdf5 files
merge_train.to_hdf('datasets/merge_train.h5', key='merge_train', mode='w')
merge_test.to_hdf('datasets/merge_test.h5', key='merge_test', mode='w')
targets.to_hdf('datasets/targets.h5', key='targets', mode='w')

In [None]:
# Store scaler in a pickle file
import joblib
joblib.dump(scaler, 'datasets/targets_scaler.pkl');

In [None]:
# Before quitting, be conscious about space and let's clean our working directory
!rm -rf fMRI* datasets/intra* datasets/inter*

# 5. PCA and UMap exploration

And just to finish this, let's quickly look at some scree plots from PCA and UMap plots from the four datasets.

In [None]:
from sklearn.decomposition import PCA
for key, df_temp in [['ic features', ic_train],
                     ['fnc features', fnc_train],
                     ['intra corr features', df_corr_intra_train],
                     ['inter corr features', df_corr_inter_train],
                    ]:

    # Explore explained variance on PCA components
    s = StandardScaler()
    X_scaled = s.fit_transform(df_temp)

    # Create PCA reduced dataset
    pca = PCA()
    pca_train = pca.fit_transform(X_scaled)

    # Explore PCA components
    pve_cumsum = np.cumsum(pca.explained_variance_ratio_)
    plt.figure(figsize=(10, 5))
    plt.title('PCA Explained Variance Ratio: %s' % key)
    plt.step(range(len(pve_cumsum)), pve_cumsum)
    plt.show();
    
    for thresh in [0.8, 0.9, 0.95, 0.99]:
        txt = 'Explained Variance for {}: {}% | Components: {}'.format(
            key, int(thresh * 100),
            np.argmax(pve_cumsum>=thresh))
        print(txt)

In [None]:
import umap
for key, df_temp in [['ic features', ic_train],
                     ['fnc features', fnc_train],
                     ['intra corr features', df_corr_intra_train],
                     ['inter corr features', df_corr_inter_train],
                     ['merged features', merge_train],
                    ]:

    # Explore explained variance on PCA components
    s = StandardScaler()
    X_scaled = s.fit_transform(df_temp)

    # Create PCA reduced dataset
    pca = PCA(20)
    pca_train = pca.fit_transform(X_scaled)
    
    # Transform data with UMAP
    transf = umap.UMAP(n_neighbors=10)
    X_umap = transf.fit_transform(pca_train)

    # Plot Umap's with target colorization
    print('Plotting', key)
    fig, ax = plt.subplots(nrows=1, ncols=7, figsize=(21, 3))
    for i, c in enumerate(targets.columns):
        ax[i].scatter(X_umap[:, 0], X_umap[:, 1], s=1, c=targets[c].values, cmap='Spectral')
        ax[i].set_title(c)
        ax[i].axis('off')
    plt.show()

## Observation 4

Hmm... I've never seen these two clusters in the "intra corr features". That is new and very interesting. Perhaps somebody will find an explanation for that?