# TReNDS Neuroimaging - Data exploration

So this is a documentation of my travels down this rabbit hole. I try to write down all my thoughts when exploring data, that's how I learn. Also, I'm hoping that you guys can help me know just how wrong I am. I don't think you are going to find any contest-winning ideas here, I find most ML models scary and try to stick with Linear Regression as far as I can.

My apologies to @PAB97 who is the original author of this Notebook. I stole it and I've mostly modified it by adding my own mistakes. Actually, I think I have already modified it beyond recognition. Sorry.

**This kernel will be a work in progress, and I will keep on updating it as the competition progresses and I gain more insight about the data.**

If you find this kernel useful, please consider upvoting it, it motivates me to write more quality content.

## <a href='#2'>Data exploration</a>

In [None]:
# Importing dependencies

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

from tqdm.notebook import tqdm
from glob import glob
import gc

import nilearn as nl
import nilearn.plotting as nlplt
import nibabel as nib

import h5py

import lightgbm as lgb

from scipy.stats import skew, kurtosis

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

import os, random

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense

from tensorflow.keras.utils import Sequence

In [None]:
# Loading train scores

MAIN_DATA_PATH = '/kaggle/input/trends-assessment-prediction/'

train_scores_df = pd.read_csv(MAIN_DATA_PATH + 'train_scores.csv')
icn_numbers_df = pd.read_csv(MAIN_DATA_PATH + 'ICN_numbers.csv')
loading_df = pd.read_csv(MAIN_DATA_PATH + 'loading.csv')
fnc_df = pd.read_csv(MAIN_DATA_PATH + 'fnc.csv')

## SBM Loadings

Per the competition documentation,

> The first set of features are source-based morphometry (SBM) loadings. These are subject-level weights from a group-level ICA (independent component analysis) decomposition of gray matter concentration maps from structural MRI (sMRI) scans.

Let's think about this for a bit. There are a few interesting code words here. These are sMRI scans, as oppose to the fMRI below. These scans reveal the structure of the brain, i.e. where all the gray goo are located. The fMRI show sort of show how the goo is working in different situation.

So, thinking back, we are trying to predict the age and some secret features of subjects. Does the structure of the brain change with age? And is it important in other, secret aspects? This is exciting!

In this context we are looking at a process called Independent Component Analysis. The goal of this process is to find a set of features, 'components', that have the least to do with each other. The classical example is the coctail problem, where you want to find out what people at the coctail party are saying (I honestly can't think of many problems I'm less interested in finding the solution to; coctail parties are boring). 

You have signals from some microphones. If you can find a set of features that are really independent and super not-random then you can be fairly sure that these features represent the individual voices of the people at the party. I guess the theory here is there is no correlation between what different people say at coctail parties, which is a deep insight in itself. 

So we're thinking that there are some hidden features that describe our brains. These features are mixed in some secret way and result of that is what shows up in the sMRI scan. The ICA gives us an indication of what the features might be.

In this case, we are working on group level ICA. I think that refers to the process of taking a lot of this sMRI scans of lots of people (not just these subjects) and find the features where these subjects stand out. So these values represent the features where these subjects stand out the most.

A dumb analogy could be that by analyzing lots and lots of coctail party conversations you could note that most of them are about the weather, the economy, humblebragging about oneself and outright bragging about one's children. An individual coctail party could be described by how the relative amounts and intensities of these respective conversations differ from whats normal. "Remember the party at the Robinsons, where we talked soo much about the weather? So nice!"

I haven't been to many coctail parties myself, for some reason I'm never invited.

In [None]:
loading_df.head()
loading_melted = loading_df.melt(id_vars='Id')
loading_melted.head()

Let's have look at the distributions of these values.

In [None]:
g = sns.FacetGrid(loading_melted, col='variable', col_wrap=3, height=6, sharex=False, sharey=False)
g.map(sns.distplot, 'value')

In [None]:
loading_df.describe()
plt.figure(figsize=(20,5))
sns.violinplot(data=loading_melted, x='variable', y='value')

So we have some features that honestly make no sense to me what so ever. I was hoping I would end up being able to look at a sMRI scan after this and say something like: "Hmm, that IC_14 sure looks worrying."

Before we continue to the even cooler features, from the fabeled **f**MRI scans, let's see if we already can make some predictions.

### <a href='#2-1'>Target distributions</a>

The train_scores.csv file contains the targets that we need to predict.

In [None]:
train_scores_df.head()

In [None]:
# Plot the distribution of the target variables

train_scores_df_melted = train_scores_df.melt(id_vars='Id')

g = sns.FacetGrid(train_scores_df_melted, col='variable', height=4)
g.map(sns.distplot, 'value')

plt.subplots_adjust(top=0.85)
g.set_titles('{col_name}')
g.set_xlabels('')
g.fig.suptitle('Target distributions')

Let's now build our training set composed of multiple sets of features.

In [None]:
features_df = pd.merge(train_scores_df, loading_df, on=['Id'], how='left')
features_df.head()

Since we still have relatively few features, let's investigate the correlation between target variables and features.

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
cols = features_df.columns[1:]
target_cols = features_df.columns[1:6]
feature_cols = features_df.columns[6:]
sns.heatmap(features_df[cols].corr(), ax=ax, cmap='RdYlGn')

OK, we know some stuff now. Let's see what a simple old Linear Regression model gives us.

In [None]:
X = features_df[feature_cols]
y = features_df['age']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

linreg = LinearRegression()
linreg.fit(X_train, y_train)

y_pred = linreg.predict(X_test)

rmse = mean_squared_error(y_test, y_pred) ** 0.5

print(f"R^2:", linreg.score(X_test, y_test))
print(f"RMSE: {rmse}")

OK, done! I can't possibly imagine a better result. The prize money is all mine.

Also, let's see if there is some features that are more important than others. I really like Lasso regularization for this. We're still in the good ol' linear regression domain, but we see if we can penalize some parameters and still get a decent result.

The Lasso isn't great for actual modeling, I think, but it has the really nice feature of either penalizing a feature completely or not at all. This can give you a shortlist of the most interesting features.

In [None]:
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.05, normalize=True)

lasso.fit(X,y)

plt.style.use('ggplot')
plt.figure(figsize=(12,8))
plt.plot(X.columns, lasso.coef_)

_ = plt.xticks(rotation=70)

I honestly love this. Just for fun, let's see what happens when you focus only on these three features for our Linear Regression.

In [None]:
cool_features = ['IC_06', 'IC_15', 'IC_22']

linreg = LinearRegression()
linreg.fit(X_train[cool_features], y_train)

y_pred = linreg.predict(X_test[cool_features])

rmse = mean_squared_error(y_test, y_pred) ** 0.5

print(f"R^2:", linreg.score(X_test[cool_features], y_test))
print(f"RMSE: {rmse}")

Ok, so let's think about these numbers. The $R^2$ is the fraction $SS_{res} \over SS_{tot}$. That numerator is how much my model differs from the observed values and the denominator is how much the actual values tend to differ. The result is how much of the variation in the target features is captured in my model.

The RMSE is the square root of the Mean Squared Error. The MSE is the $SS_{res}$ from above, that is how much the model differs from the observed values. The neat thing with taking the square root of the MSE, is that you get the same dimension as the target variable. This gives us the rather comprehensible insight that our model is off by ten years or so on average when estimating the age of the subjects. I get worse results every day when I try to estimate the age of people.

This is the reason I choose to focus mainly on age for now. I don't know what domain 1 etc is, so at this exploratory stage I have no idea how to think about the values. Baby steps...

(Also, we all know that they keep them values secret in order to hide that they are analyzing the passengers from Flight 828, and I'm NOT taking part in that dirty business.)

Anyway, a $R^2$ of 34% is not great, but I really like the fact that is not that much lower when we use just three features. Often, when the features make more sense, these kind of insights can really make this EDA phase exciting. 

## Submission

Let's do something we can submit. We'll go with Ridge Regularized Linear Regression for now, and try to forecast all the target variables.

In [None]:
from sklearn.linear_model import RidgeCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

outputs = list()

for target in target_cols:
    scaler = StandardScaler()
    ridge = RidgeCV(alphas=np.logspace(5, -5, 11), cv=5)

    pipeline = Pipeline(steps=[
        ('scaler', scaler),
        ('regression', ridge)
    ])

    y = features_df[target]
    y = y.fillna(y.mean())

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

    pipeline.fit(X_train, y_train)

    y_pred = pipeline.predict(X_test)

    rmse = mean_squared_error(y_test, y_pred) ** 0.5
    print(f"R^2 for {target}:", pipeline.score(X_test, y_test))
    print(f"RMSE for {target}: {rmse}")
    print(f"Best alpha for {target}", ridge.alpha_)
    
    predictions = pipeline.predict(loading_df.iloc[:,1:])
    ids = loading_df.Id.astype(str) + '_' + target

    outputs.append(pd.DataFrame({'Id': ids,
                                'Predicted': predictions}))

output = pd.concat(outputs)
output

We don't get great results for the mysterious variables. This shouldn't come as a surprise - look at the heatmap above. There doesn't seem initially to be that much of a correlation between those variables and our features.

## <a href="5-1">Submission</a>

In [None]:
sample_submission = pd.read_csv("/kaggle/input/trends-assessment-prediction/sample_submission.csv")
sample_submission

In [None]:
sample_submission = pd.read_csv("/kaggle/input/trends-assessment-prediction/sample_submission.csv")
output = sample_submission.drop('Predicted',axis=1).merge(output,on='Id',how='left')
output.to_csv('submission.csv',index=False)