In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from nilearn import plotting, image 
import nibabel as nb
import h5py
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

# First Look at the Data

Hi there, 
I am currently finishing my PhD in cognitive neuroscience, so this is a nice distraction next to the writing process, so a little bit of domain knowledge might be here :) 

On the other hand, age prediction, resting state analysis, and structural imaging are not my area of expertise - so I don't know much about the different ICA algorithms used here. 

I will try to provide you with some high-level overview over the different kinds of data. And will update this notebook, when I have the time and get around to read up on some of the literature. 

We got many different files here, which are of course described on the Web-Page but let's see what we can learn by looking at it. 
Also if you don't know too much nilearn yet, it's a great toolset and maybe you find some useful functions here!

(final version so far, still missing links to more and better information - which I still might add later)

Also please forgive me some weird sentences and typos :) 

# The brain mask
> fMRI_mask.nii - a 3D binary spatial map

Well, that one's is the most obvious: 
If you want to know what part's of the nifti are inside the brain - especially across many many participants - so here is the brain mask. 
For this often differen algorithms are applied in pre-processing. For example, removing part's of the skull etc. But also a lot of realignment in the scanner. We don't have too many information right now what the dimensions are, but just as a note here: Functional connectivity is often prone to a lot of movement artifacts.

In [None]:
# Nifti MASK image
brain_mask = nb.load('../input/trends-assessment-prediction/fMRI_mask.nii')
plotting.plot_roi(brain_mask, title='fMRI_mask.nii');

# fnc.csv
> static FNC correlation features for both train and test samples 

First up, for people who do not know a lot about brain imaging studies.
#### What is resting state
In resting state measures participants lie in the MRI scanner with their eyes opened or closed (there is/was a huge debate on what is best way...) and are told to do nothing in particular. Sometimes with the addition not to think about anything specific or meditate. This usually goes on for several minutes - it can be really hard not to fall asleep - trust me. While participants lie in the scanner, typically the [**B**lood **O**xygen **L**evel **D**ependent](https://en.wikipedia.org/wiki/Blood-oxygen-level-dependent_imaging) signal is measured, which serves as a proxy of brain activity. 

This the "task" sounds really error-prone and unspecific, this kind of measuring has been the key for many insights in human neuroscience and the results are surprisingly stable. 

In resting state you have 4D data, a 3D image (made up of voxels) measured over time. 

### What has been done? 

> The second set are static functional network connectivity (FNC) matrices. These are the subject-level cross-correlation values among 53 component timecourses estimated from GIG-ICA of resting state functional MRI (fMRI).

This is different - but not uncommon - to typical resting state function connectivity. In classical **F**unctional **C**onnectivity you would use a brain atlas, and average all voxels in a brain region. Then you calculate the correlation (or other measures) between the different brain regions and get a brain connectivity matrix.

Here an example of an atlas (well one you wouldn't typically use, but it serves the purpose): 

In [None]:
from nilearn import datasets
aal = datasets.fetch_atlas_aal();
# This is just supposed to be an example - so it's not too important
try:
    plotting.plot_roi(aal['maps'], title='Example of a Brain Atlas (aal)');
except:
    print("Probably time out")

In this challenge, however, the approach is a bit different. Using ICA, different spatial maps have been create to extract brain networks which are typically present in the resting brain. 
As described - the maps were create on a different dataset to avoid information leakage. For each component the time-series has been correlated with the time-series of a different component. So we get (53 * 53 - 53) / 2 = 1378 featues. 

In [None]:
fnc_10 = next(pd.read_csv('../input/trends-assessment-prediction/fnc.csv', low_memory=True, chunksize=5))
fnc_10.head()

For every feature we get a description of the connections (i.e. which component with which component).

Let's extract the different names. 

In [None]:
fnc10_cols = fnc_10.columns.to_list()[1:]
fnc10_cols_filtered = [i.split('_')[0] for i in fnc10_cols]
print(np.unique(fnc10_cols_filtered))

### ICN_numbers.csv

We see there are a number of different names with numbers.
Using the file `ICN_numbers.csv`:
> ICN_numbers.txt - intrinsic connectivity network numbers for each fMRI spatial map; matches FNC names  

We could now assign the different names to indices in the matrices (for example in `fMRI_train`) 

But what do the names mean. Shooting from the hip we have, and that's it for me. As I said, I am not an expert and those abbreviations are a bit ambigous for me:
1. DMN - Default Mode Network (the idle, non-task network)
Based on the plots below we can try to name them:
2. SMN - somatosensory - motor - network
3. VSN - visual network
4. CON - still not really sure
5. ADN - auditory
6. CBN - something in the cerebelleum
7. SCN - something sub-cortical 

In [None]:
# Let's extract the indices for the different networks
# Network index:
ntwk_idx = {}
network_names = np.unique([i[:3] for i in fnc10_cols_filtered])
for ii in network_names:
    ntwk_idx[ii] = np.unique([np.int(i.split('(')[-1].split(')')[0]) for i in fnc10_cols_filtered if ii in i])
    
# Look up matrix index
icn_number = pd.read_csv('../input/trends-assessment-prediction/ICN_numbers.csv')

icn_idx = {}

for jj in ntwk_idx.keys():
    icn_idx[jj] = np.array(icn_number.index[icn_number.ICN_number.isin(ntwk_idx[jj])])

# SM_features
> fMRI_train - a folder containing 53 3D spatial maps for train samples in .mat format  

We are doing a few leaps and jumps here, but all these aspects of the data are connected. 

If I understand the description correctly this data is again something different from the functional network connectivity above. 

While the networks in `fnc.csv` has been estimated on a different dataset, the ICA maps in the different mat files seem to be estimated from resting state of each participant

> The third set of features are the component spatial maps (SM). These are the subject-level 3D images of 53 spatial networks estimated from GIG-ICA of resting state functional MRI (fMRI).

Let's load some data, and maybe we can fill the gaps in the numerated list above.

In [None]:
# We load the data using h5py
test_mat1 = h5py.File('../input/trends-assessment-prediction/fMRI_test/11000.mat', mode='r')
print(test_mat1.keys())
test_mat1 = np.array(test_mat1.get('SM_feature'))
print('Dimensions of ICA feature map')
print(test_mat1.shape)
print('Dimenions of the brain mask')
print(brain_mask.shape)

## Let's also load a second participant
test_mat2 = h5py.File('../input/trends-assessment-prediction/fMRI_test/10006.mat', mode='r')
test_mat2 = np.array(test_mat2.get('SM_feature'))

If we want to visualize the maps, we need to know some information about the dimensions of the data. Plotting the brain-mask which was provied, we see that the dimensions are not quite in the correct order. 
Secondly, we have two dimensions of size 53, so we need to play around to find the correct ordering, 

In [None]:
# Somehow nilearn is not happy with plotting matrices anymore - so we have to create a nifti first:
def map_for_plotting(mat, brain_mask):
    # Assuming that we provide a 3D image
    # image.new_img_like creates a nifti by applying informaiton from the soure image (here brain_mask),
    # like the affine to a matrix.
    return image.new_img_like(brain_mask, mat.transpose([2, 1, 0]))

In [None]:
# Let's extract the indices for the different average networks
sample_maps1 = {}
sample_maps2 = {}
for ii in icn_idx.keys():
    # indices -1 because matlab
    sample_maps1[ii] = map_for_plotting(test_mat1[icn_idx[ii] -1].mean(0), brain_mask)
    sample_maps2[ii] = map_for_plotting(test_mat2[icn_idx[ii] -1].mean(0), brain_mask)

## Average SM Feature Maps

In [None]:
fig, axes = plt.subplots(len(sample_maps1), 2, figsize=(20, 10))

for n, ii in enumerate(sample_maps1.keys()):
    # We are plotting glass brains here - a nice way to visualize brain maps
    plotting.plot_glass_brain(sample_maps1[ii], title=ii, axes=axes[n, 0], plot_abs=False)
    plotting.plot_glass_brain(sample_maps2[ii], title=ii, axes=axes[n, 1], plot_abs=False)
axes[0, 0].set_title('Networks for Participant 1');
axes[0, 1].set_title('Networks for Participant 2');

Here are the networks for two participants plotted next to each other. As I alluded too earlier resting state networks are quite stable and similar, even across participants. 
On the other hand, they also display some distinction (I mean it is possible to identify participants based on their network connectivity alone).

I think there will be some information here for the different prediction targets but for some first pilotting of models I am going to skip this data and see how far I am coming. 
As I said before, I am not too familiar with the literature (right now) and these maps could be **the** key features to focus later on, but for some very first analyses it might be possible to get good results with the connectivity data alone. (Just wildly guessing)

# Back to FNC data
We now have some idea about the different networks involved in the functional network connectivity. We could now try to reshape the FNC data to a connectivity matrix. 

In [None]:
# This is probably totally inefficient - but let's try it
icn_mat_idx = icn_number.T.to_dict('list')
# Reverse the matrix:
icn_mat_idx = {i[0]: j for i, j in zip(icn_mat_idx.values(), icn_mat_idx.keys())}
# Map names to indices
name_matrix = {}

for fnco in fnc10_cols:
    name_matrix[fnco] = ([np.int(icn_mat_idx[np.int(i.split(')')[0])]) for i in fnco.split('(')[1:]])
    
# And now create a sample connectivity matrix:
con_matrix1 = np.zeros((53, 53))
con_matrix2 = np.zeros((53, 53))

for n in fnc10_cols:
    r_, c_ = name_matrix[n]
    con_matrix1[c_, r_] = fnc_10.iloc[0, :][n]
    con_matrix2[c_, r_] = fnc_10.iloc[1, :][n]
# And now add the transpose - its symmetrix
con_matrix1 += con_matrix1.T
con_matrix2 += con_matrix2.T

# Prepare labeling:
col_halves = np.array([jj.split('_')[-1]  for jj in name_matrix.keys()])
_, idx = np.unique(col_halves, return_index=True)
col_labels = col_halves[np.sort(idx)]

In [None]:
fig, ax = plt.subplots(1, 2,figsize=(20, 7.5))

sns.heatmap(con_matrix1, cmap='coolwarm', square=True, ax=ax[0], 
            xticklabels=col_labels, 
            yticklabels=col_labels, cbar=False, center=0, vmin=-1, vmax=1)

sns.heatmap(con_matrix2, cmap='coolwarm', square=True, ax=ax[1], 
            xticklabels=col_labels, 
            yticklabels=col_labels, cbar=False, center=0, vmin=-1, vmax=1)

ax[0].set_title('Example 1')
ax[1].set_title('Example 2');

I am not totally sure about the labels and whether everything is in the correct location, but it doesn't seem too shabby.

The two example matrix look quite as expected: We have a high connectivity (i.e., correlation) between components with the same name. And some connectivity between. 

We also see that different participants have different expressions. For example, participant 2 has a higher interconnectivity in the visual network (VSN) when compared to participant 2.

My first bet, is that these networks will provide much of the information we need for this challenge.

# Loading.csv

> loading.csv - sMRI SBM loadings for both train and test samples

This is unfortunately I cannot tell you that much about, except:

> The first set of features are source-based morphometry (SBM) loadings. These are subject-level weights from a group-level ICA decomposition of gray  matter concentration maps from structural MRI (sMRI) scans.

We again get different values of ICA deompositions, this time for structural images, they look something like this:

In [None]:
plotting.plot_anat(datasets.load_mni152_template(), title='MNI template');

On a technical note - structural images use a different contrast than BOLD images (all the data description above, i.e., functional connectivity). 
The aim of these images is to provide a better anatomical picture of the brain. Due to the measurement procedures, BOLD images usually have a realtively low resolution, as you want to squeeze in as many data-points along time as possible. For structural images the measurement can take several minutes for one image, achieving a higher spatial resolution. 

So for this type of data we receive information about the concentration of gray-matter in the brain. These do not describe much about the dynamics or interaction of the brain, but can tell different stories about how a person's life went. If I remember correctly, there was a study that playing tetris increases gray-matter density and volume in some brain area many years back. I don't know how reliable this particular study was, let's say you will find traces of some longer lasting process here. 

Particularly interesting for this study is, I think, that there is a reduction in gray matter volume over age. So something might be here for you to find :) 

# Target variables - age and assessments
After getting an  overview of the different kinds of data in the challenge, we can now have a look at the target variables in the training set.  These are stored in this .csv

> train_scores.csv - age and assessment values for train samples

We have age (which has been manipulated for privacy concerns) and multiple assessment variables (could be: Depression or mental health, working memory, memory, intelligence, personality...). Also, we know there are different sites in this datasets, which induces many possible biases like:
* Different fMRI scanners (manufacturers, models, scanner) have an effect on the measured data
* Different sample population
* Different measurement times (maybe one site is earlier in the morning, the other in the evening)
* Different weather (yup, has been found to have a slight effect one measurements) 

So we need to make sure that our predictions are robust enough! 

## Distributions

In [None]:
scores = pd.read_csv('../input/trends-assessment-prediction/train_scores.csv')
scores = scores.set_index('Id')
scores.head()

In [None]:
scores.isna().sum(0)

And we already see that there are quite some missing values in the sample, especially for the first domain. In this quick and dirty approach I will just replace the values by 0 for now, but will need to look into a better imputation method later.  

In [None]:
scores = scores.dropna()

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(15, 5))

for ax, data in zip(axes.flatten()[:5], scores.columns[:5]):
    deciles = np.percentile(scores[data].values, [10, 20, 30, 40, 50, 60, 70, 80, 90])
    
    sns.distplot(scores[data], ax=ax)
    # Ugly, but whatever
    for de in deciles:
        ax.axvline(de, c='black')
    
    ax.set_title(data)
    
axes.flatten()[-1].set_axis_off()

plt.tight_layout()

We se that most of the data seems to relatively normally distributed. As you can see in this plot, I also included the deciles of the distribution, the `NaN` filling, however added a left skew to the distritbutions of the assessment scores. 

To not bias our thinking about the whole dataset too much, I will use the deciles to discretize the data distribution, which will I then use to create a stratified sub-sample (will also be helpful for cross-validation)

## Correlation between target values
Directly running into an issue - the discretization is too fine, i.e. some combinations occur only once. 

A look at the correlation could help to find a better, less fine, discretization.

In [None]:
sns.heatmap(scores.corr())

Correlations in domain1 is quite high, so dropping one might be quite safe. For domain2 the two variables are not too much correlated, which is unfortunated. For a better split I will, however, drop domain2_var1 from the stratification, as domain2_var2 has less correlation with age. 

In [None]:
# Add discretization to data:
for sc in scores.columns[:5]:
    deciles = np.percentile(scores.loc[:, sc], [20, 40,  60, 80])
    discr = np.digitize(scores.loc[:, sc], deciles)
    scores.loc[:, sc + '_discrete'] = discr.astype(str)
    
# Everything to one variable:
scores.loc[:, 'stratify'] = (scores['age_discrete'] + '_'
                             + scores['domain1_var1_discrete'] + '_' 
                             + scores['domain2_var2_discrete'])

In [None]:
scores.stratify.value_counts()

Not great, but could be sufficient. 

In [None]:
# And now draw a stratified sample, we will statistically analyse 20% of the data
from sklearn.model_selection import train_test_split

train_idx, _ = train_test_split(scores.index, train_size=0.2, random_state=223, stratify=scores.stratify)

In [None]:
scores_stat = scores.loc[train_idx]

fig, axes = plt.subplots(2, 3, figsize=(15, 5))

for ax, data in zip(axes.flatten()[:5], scores_stat.columns[:5]):
    deciles = np.percentile(scores_stat[data].values, [10, 20, 30, 40, 50, 60, 70, 80, 90])
    
    sns.distplot(scores_stat[data], ax=ax)
    sns.distplot(scores[data], ax=ax)
    ax.legend(['Subsample', 'Original'])
    # Ugly, but whatever
    for de in deciles:
        ax.axvline(de, c='black')
    
    ax.set_title(data)

    
axes.flatten()[-1].set_axis_off()

plt.tight_layout()

The deciles are calculated based on the subsample here. This stratification approach seems to provide a good representation of the original dataset. So we can start to do some basic statistics and learn a bit about the relationship between features and targets. 

# Statistical Analysis

## FNC - correlations
Let's investigate the functional connectivity, in a non-spatial mass-univariate analysis, we can for example look at the correlation between functional connectivity values across participants and the target variables. Based on certain patterns, we could think later about constricting our analysis of the `SM` maps for some targets. 

In [None]:
fnc = pd.read_csv('../input/trends-assessment-prediction/fnc.csv', index_col='Id')

In [None]:
fnc_sample = fnc.loc[train_idx, :]

In [None]:
# Correlations: 
corr_df = []
for col_score in ['age', 'domain1_var1', 'domain1_var2', 'domain2_var1', 'domain2_var2']:
    tmp_corr = fnc_sample.corrwith(scores_stat.loc[:, col_score]).to_frame().transpose()
    corr_df.append(tmp_corr)

fig, axes = plt.subplots(1, 5, figsize=(15, 5), sharex=True, sharey=True)

for col_score, ax, tmp_df in zip(['age', 'domain1_var1', 'domain1_var2', 'domain2_var1', 'domain2_var2'], axes.flatten(), corr_df):
    ax.hist(tmp_df.transpose().values.ravel())
    ax.set_title(col_score)
plt.suptitle('Histograms of pearson correlations');

We have really low correlation values, so maybe plotting them in a connectivity matrix won't provide the best visualization, so I am thresholding the matrices at the 80% percentile of the absolute correlation values. 

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(15,10), sharex=True, sharey=True)

for col_score, ax, tmp_corr in zip(['age', 'domain1_var1', 'domain1_var2', 'domain2_var1', 'domain2_var2'], axes.flatten(), corr_df):

    perc_ = np.percentile(np.abs(tmp_corr.values.ravel()), 90)
    tmp_matrix = np.zeros((53, 53))
    for n in fnc10_cols:
        r_, c_ = name_matrix[n]
        tmp_matrix[c_, r_] = tmp_corr.iloc[0, :][n]
            
    tmp_matrix[np.abs(tmp_matrix) < perc_] = 0
    tmp_matrix += tmp_matrix.T

    sns.heatmap(tmp_matrix, cmap='coolwarm', square=True, ax=ax, 
            xticklabels=col_labels, 
            yticklabels=col_labels, cbar=False, center=0, vmin=-0.25, vmax=0.25)

    ax.set_title(col_score)
    
axes.flatten()[-1].set_axis_off()
print(perc_)
plt.tight_layout()

There seems to be a lot of noise in here. But for example:
* Patterns in domain2_var2 and age seem to be going into opposite directions
* The VSN network seems to be important for age. Negative correlation between SMN and higher correlations with DMN
* SCN and correlations to SMN/ADN show stronger patterns in many of the variables.

### Applying dimensionality might help us find multi-variate patterns

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.8, whiten=True, svd_solver='full')
pca.fit(fnc_sample.values)

components = pca.transform(fnc_sample.values)
components = pd.DataFrame(components, index=scores_stat.index)

In [None]:
fig, axes = plt.subplots(1,3, figsize=(15,5))

axes[0].plot(pca.explained_variance_ratio_[:21])
axes[0].set(title='Elbowplot of PCA components', ylabel='Explained Variance', xlabel='Components')

tmp_matrix1 = np.zeros((53, 53))
tmp_matrix2 = np.zeros((53, 53))
for i, n in enumerate(fnc10_cols):
    r_, c_ = name_matrix[n]
    tmp_matrix1[c_, r_] = pca.components_[0, i]
    tmp_matrix2[c_, r_] = pca.components_[1, i]

tmp_matrix1 += tmp_matrix1.T
tmp_matrix2 += tmp_matrix2.T

sns.heatmap(tmp_matrix1, cmap='coolwarm', square=True, ax=axes[1], 
        xticklabels=col_labels, 
        yticklabels=col_labels, cbar=False, center=0)
axes[1].set(title='Component 0')
axes[1].set_xticklabels(axes[1].get_xmajorticklabels(),  fontsize=6)

sns.heatmap(tmp_matrix2, cmap='coolwarm', square=True, ax=axes[2], 
        xticklabels=col_labels, 
        yticklabels=col_labels, cbar=False, center=0)
axes[2].set(title='Component 1')
axes[2].set_xticklabels(axes[2].get_xmajorticklabels(), fontsize=6)

plt.tight_layout()

Using an elbowplot on the explained variance scores, doesnt really provide us with too much hope, to really find interesting patterns in the PCA analyis. Explained variance of the different components seems to taper off really quickly. Only the first one explainin a reasonable amount.

In [None]:
pca_corr = []
for kk in range(10):
    pca_corr.append(scores_stat[['age', 'domain1_var1', 'domain1_var2', 'domain2_var1', 'domain2_var2']].corrwith(components.loc[:,kk]))

pd.concat(pca_corr, axis=1)

Our hunch is confirmed. Only the first components seems to have a small correlation with age

## T1 images - structural components

So here we have only 26 components. Let's do the same correlation game again!

In [None]:
loadings_samp = pd.read_csv('../input/trends-assessment-prediction/loading.csv', index_col='Id')

In [None]:
loadings_samp = loadings_samp.loc[train_idx, :]

In [None]:
# Correlations: 
fig, axes = plt.subplots(1, 5, figsize=(15,5))
load_corr = []
for ax, col_score in zip(axes.flatten(), ['age', 'domain1_var1', 'domain1_var2', 'domain2_var1', 'domain2_var2']):
    tmp_corr = loadings_samp.corrwith(scores_stat.loc[:, col_score]).to_frame().transpose()
    load_corr.append(tmp_corr)
    ax.hist(tmp_corr.values.ravel())
    ax.set_title(col_score)
    
load_corr = pd.concat(load_corr)
load_corr.loc[:, 'Assessment'] = ['age', 'domain1_var1', 'domain1_var2', 'domain2_var1', 'domain2_var2'] 
load_corr.set_index('Assessment', inplace=True)

Here we seem to be a little bit more lucky, there are some ICs that are anti correlated with age! A hunch, which I mentioned before, as grey matter density seems to decrease with higher age. 
Some areas in domain2_var2 seem also be related with grey-matter values. 

# Looking at SM maps

I'm trying to make this approach a bit easier. Nilearn does not only have great plotting capabilites, but I can also be used to extract great amounts of data. 

I mentioned atlasses and parcellations in the FNC section before. To reduce the huge amounts of data in the SM files, I am using a relatively low parcellation to extract for each of the networks average values per region.

In an ADHD prediction challenge, I had the feeling that too high resolutions are not really helpful, so I am going with the middle way. I am not too familiar with the new trend in atlasses - so I took one I heard about at one time. 

In [None]:
try:
    basc_data = datasets.fetch_atlas_basc_multiscale_2015(version='sym', data_dir=None, resume=True, verbose=1)
except:
    print("Probably time out")

In [None]:
basc_197 = nb.load(basc_data['scale197'])
plotting.plot_roi(basc_197)

There are 197 regions in the data. We have 53 components, so we get to 10441 different features. Which is a lot less than using all pixels, but still a lot. To extract the data, we can use a masker function from nilearn

In [None]:
from nilearn import input_data
# We also use the brain_mask from the beginning
basc197_masker = input_data.NiftiLabelsMasker(basc_197, mask_img=brain_mask)

def load_matlab(participant_id, masker, path='../input/trends-assessment-prediction/fMRI_train/'):
    mat = np.array(h5py.File(f'{path}{participant_id}.mat', mode='r').get('SM_feature'))
    mat = masker.fit_transform(nb.Nifti1Image(mat.transpose([3,2,1,0]), affine=masker.mask_img.affine))
    return mat.flatten()

In [None]:
# This takes ages about (like 8 min ... so time 5 for the whole data set)
from joblib import Parallel, delayed

sm_data = Parallel(n_jobs=-1)(delayed(load_matlab)(ii, basc197_masker) for ii in tqdm(list(train_idx)))
sm_data = np.stack(sm_data)

So now we have the data... But I am not really sure what do to with this. Let's try feature reduction and look at some correlations. 

In [None]:
from sklearn.decomposition import FastICA
pca_2 = PCA(n_components=0.6, whiten=True)
pca_2.fit(sm_data)

components2 = pca_2.fit_transform(sm_data)

In [None]:
fig, axes = plt.subplots(1,1, figsize=(15,5))

axes.plot(pca_2.explained_variance_ratio_[:30])
axes.set(title='Elbowplot of PCA components', ylabel='Explained Variance', xlabel='Components')

plt.tight_layout()

### The correlation game

In [None]:
components2 = pd.DataFrame(components2, index=scores_stat.index)
pca2_corr = []
for kk in range(20):
    pca2_corr.append(scores_stat[['age', 'domain1_var1', 'domain1_var2', 'domain2_var1', 'domain2_var2']].corrwith(components2.loc[:,kk]))

pca2_scorr = pd.concat(pca2_corr, axis=1)
pca2_scorr

Based on the elbow plots the explained variance ratios are decreasing quite quickly, which is expected based on the input-data. 

In [None]:
fig, axes = plt.subplots(1, 5, figsize=(15,5), sharex=True, sharey=True)

for n, ax in enumerate(axes.flatten()):
    ax.plot(pca2_scorr.iloc[n, :])
    ax.set_title(pca2_scorr.index[n])
    ax.axhline(0)

Again some components seem to be quite highly correlated with age, the other assessment variables are not so high (as in the other analysis). 

This is not a really meaningful analysis, but might inform you, whether to include the individual score in the analysis. 

# Conclusion
We have a lot of data in different forms. Integrating all the data in one single model migth solve difficult, especially as FNC connectivity and SM values provide many many different features. 

* Age seems to be a variable that appears to be highly correlated with many of the features in the data.

All the other variables are less expressed. 

Furthermore, we seem to able to stratify the data based on a subset of deciles and a couple of the variables for a potentially good cross-validation. 

# Predictions
Well for version 11 I said this is the final version. But sometimes I also like to run some basic models :)

Also to see if the partition we are using helps us in terms of getting a good estimate of the prediction error. 

I am not going to use the `SM` maps, only loadings and FNC data - it's easier I guess.

In [None]:
# Not the best approach but let's load some data again and delete some
import gc
try:
    del fnc
    del sm_data
    del pca
    del pca_2
except:
    pass
gc.collect()

In [None]:
# Loading data again - we have the scores with the stratifier variable alread
fnc = pd.read_csv('../input/trends-assessment-prediction/fnc.csv', index_col='Id')
loading = pd.read_csv('../input/trends-assessment-prediction/loading.csv', index_col='Id')

As a very baseline model let's use a RidgeRegression - I am not the biggest fan of SVMs (don't really know why)

In [None]:
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_absolute_error
from sklearn.decomposition import PCA

In [None]:
fnc_train = fnc.loc[scores.index, :]
loading_train = loading.loc[scores.index, :]

In [None]:
# Test whether indices align
assert np.all(fnc_train.index == loading_train.index) 
assert np.all(fnc_train.index == scores.index) 
assert np.all(loading_train.index == scores.index)

In [None]:
SKF = StratifiedKFold(n_splits=4)
targets = ['age', 'domain1_var1', 'domain1_var2', 'domain2_var1', 'domain2_var2']
weighting = [.3, .175, .175, .175, .175]

In [None]:
# I hope I did the calculation correctly
def absolute_normalized_error(y_true, y_pred, multioutput):
    output_errors = np.sum(np.abs(y_pred - y_true), axis=0) / np.sum(y_pred, axis=0)    
    return np.average(output_errors, weights=multioutput)


In [None]:
REG_FNC = make_pipeline(PCA(n_components=50, whiten=False), RobustScaler(), RidgeCV(alphas=np.logspace(-5, 5, 11))) # Some dimensionality reduction might be in order
REG_LOA = make_pipeline(RobustScaler(), RidgeCV(alphas=np.logspace(-5, 5, 11))) # Not so much here

trues, preds_fnc, preds_load, preds_comb = [], [], [], []
scores_fnc, scores_load, scores_comb = [], [], []

for tr, te in SKF.split(fnc_train, scores.stratify):
    REG_FNC.fit(fnc_train.iloc[tr, :].values, scores.iloc[tr][targets])
    REG_LOA.fit(loading_train.iloc[tr, :].values, scores.iloc[tr][targets])
    
    preds_fnc.append(REG_FNC.predict(fnc_train.iloc[te,:]))
    preds_load.append(REG_LOA.predict(loading_train.iloc[te,:]))
    preds_comb.append((preds_fnc[-1] + preds_load[-1]) / 2)
    trues.append(scores.iloc[te][targets])
    scores_fnc.append(absolute_normalized_error(trues[-1], preds_fnc[-1],  multioutput=weighting))
    scores_load.append(absolute_normalized_error(trues[-1], preds_load[-1],  multioutput=weighting))
    scores_comb.append(absolute_normalized_error(trues[-1], preds_comb[-1],  multioutput=weighting))
    

In [None]:
print(f'Error based on FNC: {np.mean(scores_fnc)} +/- {np.std(scores_fnc)}')
print(f'Error based on Load: {np.mean(scores_load)} +/- {np.std(scores_load)}')
print(f'Error based on Load: {np.mean(scores_comb)} +/- {np.std(scores_comb)}')

Score is currently below baseline, but let's see how we do on the leaderboard.

In [None]:
REG_FNC.fit(fnc_train, scores[targets])
REG_LOA.fit(loading_train, scores[targets])

In [None]:
# Get the test data
sample_submission = pd.read_csv('../input/trends-assessment-prediction/sample_submission.csv')

In [None]:
# Extract the test index
test_index = sample_submission.Id.str.split('_', expand=True)[0].unique().astype('int')
fnc_test = fnc.loc[test_index, :]
loading_test = loading.loc[test_index, :]

In [None]:
# Calculate the average prediction value
prediction = (REG_FNC.predict(fnc_test) + REG_LOA.predict(loading_test)) / 2

In [None]:
# Submit prediction
predictions = pd.DataFrame(prediction, index=test_index, columns=targets).reset_index()
predictions = predictions.rename(columns={'index': 'Id'})
predictions = predictions.melt(id_vars='Id', value_vars=targets, value_name='Predicted')
predictions.loc[:, 'Id'] = predictions.loc[:, 'Id'].astype(str) + '_' + predictions.loc[:, 'variable']
predictions = predictions[['Id', 'Predicted']]
predictions.to_csv('ridge_baseline_submission.csv', index=False)