# Loading Fake Timeseries Surface Data

This notebook is designed to explore some functionality with loading DataFiles and using Loaders.

This example will require some extra optional libraries, including nibabel and nilearn! Note: while nilearn is not imported, when trying to import SingleConnectivityMeasure, if nilearn is not installed, this will give an ImportError.

We will also use fake data for this example - so no special datasets required!

In [1]:
import BPt as bp
import nibabel as nib
import numpy as np
import pandas as pd
import os

In [2]:
def save_fake_timeseries_data():
    '''Save fake timeseries and fake surface data.'''
    
    X = np.random.random(size = (20, 100, 10242))
    os.makedirs('fake_time_data', exist_ok=True)
    
    for x in range(len(X)):
        np.save('fake_time_data/' + str(x) + '_lh', X[x])
    for x in range(len(X)):
        np.save('fake_time_data/' + str(x) + '_rh', X[x])
        
save_fake_timeseries_data()

In [3]:
# Init a Dataset
data = bp.Dataset()

Next, we are interested in loading in the files to the dataset as data files. There are a few different ways to do this, but we will use the method add_data_files. We will try and load the timeseries data first.

First we need a dictionary mapping desired column name to location or a file glob (which is easier so let's use that).

In [4]:
# The *'s just mean wildcard
files = {'timeseries_lh': 'fake_time_data/*_lh*',
         'timeseries_rh': 'fake_time_data/*_rh*'}

# Now let's try loading with 'auto' as the file to subject function
data.add_data_files(files, 'auto')

Unnamed: 0,timeseries_lh,timeseries_rh
13,Loc(0),Loc(36)
9,Loc(1),Loc(27)
8,Loc(2),Loc(37)
2,Loc(3),Loc(35)
16,Loc(4),Loc(38)
11,Loc(5),Loc(20)
6,Loc(6),Loc(29)
7,Loc(7),Loc(31)
1,Loc(8),Loc(26)
17,Loc(9),Loc(39)


Or we could be more explicit if autop didn't work

In [5]:
def file_to_subj(loc):
    return loc.split('/')[-1].split('_')[0]

# Actually load it this time
data = data.add_data_files(files, file_to_subj)
data

Unnamed: 0,timeseries_lh,timeseries_rh
13,Loc(0),Loc(36)
9,Loc(1),Loc(27)
8,Loc(2),Loc(37)
2,Loc(3),Loc(35)
16,Loc(4),Loc(38)
11,Loc(5),Loc(20)
6,Loc(6),Loc(29)
7,Loc(7),Loc(31)
1,Loc(8),Loc(26)
17,Loc(9),Loc(39)


What's this though? Why are the files showing up as Loc(int). Whats going on is that the data files are really stored as just integers, see:

In [6]:
data['timeseries_lh']

13     0.0
9      1.0
8      2.0
2      3.0
16     4.0
11     5.0
6      6.0
7      7.0
1      8.0
17     9.0
19    10.0
15    11.0
10    12.0
3     13.0
14    14.0
0     15.0
18    16.0
5     17.0
4     18.0
12    19.0
Name: timeseries_lh, dtype: float64

They correspond to locations in a stored file mapping (note: you don't need to worry about any of this most of the time)

In [7]:
data.file_mapping[0], data.file_mapping[1], data.file_mapping[2]  

(DataFile(loc='/home/sage/BPt/Examples/Short_Examples/fake_time_data/13_lh.npy'),
 DataFile(loc='/home/sage/BPt/Examples/Short_Examples/fake_time_data/9_lh.npy'),
 DataFile(loc='/home/sage/BPt/Examples/Short_Examples/fake_time_data/8_lh.npy'))

Let's add a fake target to our dataset now

In [8]:
data['t'] = np.random.random(len(data))
data.set_target('t', inplace=True)
data

Unnamed: 0,timeseries_lh,timeseries_rh
13,Loc(0),Loc(36)
9,Loc(1),Loc(27)
8,Loc(2),Loc(37)
2,Loc(3),Loc(35)
16,Loc(4),Loc(38)
11,Loc(5),Loc(20)
6,Loc(6),Loc(29)
7,Loc(7),Loc(31)
1,Loc(8),Loc(26)
17,Loc(9),Loc(39)

Unnamed: 0,t
13,0.699462
9,0.49149
8,0.9711
2,0.302325
16,0.267458
11,0.086958
6,0.938947
7,0.768547
1,0.954625
17,0.958833


Next we will generate a Loader to apply a parcellation, then extract a measure of connectivity.

In [10]:
from neurotools.transform import SurfLabels

lh_parc = SurfLabels(labels='data/lh.aparc.annot', vectorize=False)
rh_parc = SurfLabels(labels='data/rh.aparc.annot', vectorize=False)

We can see how this object works on example data first.

In [11]:
ex_lh = data.file_mapping[0].load()
ex_lh.shape

(100, 10242)

In [12]:
trans = lh_parc.fit_transform(ex_lh)
trans.shape

(100, 35)

We essentially get a reduction from 10242 features to 35.

Next, we want to transform the matrix into a correlation matrix.

In [13]:
from BPt.extensions import SingleConnectivityMeasure
scm = SingleConnectivityMeasure(kind='covariance', discard_diagonal=True, vectorize=True)

In [14]:
scm.fit_transform(trans).shape

(595,)

The single connectivity measure is just a wrapper designed to let the ConnectivityMeasure from nilearn work with a single subject's data at a time.

Next, let's use the input special Pipe wrapper to compose these two objects into their own pipeline

In [15]:
lh_loader = bp.Loader(bp.Pipe([lh_parc, scm]), scope='_lh')
rh_loader = bp.Loader(bp.Pipe([rh_parc, scm]), scope='_rh')

Define a simple pipeline with just our loader steps, then evaluate with mostly default settings.

In [16]:
pipeline = bp.Pipeline([lh_loader, rh_loader, bp.Model('linear')])

results = bp.evaluate(pipeline, data)
results

Predicting target = t
Using problem_type = regression
Using scope = all (defining a total of 2 features).
Evaluating 20 total data points.


Folds:   0%|          | 0/5 [00:00<?, ?it/s]


Training Set: (16, 2)
Validation Set: (4, 2)
Fit fold in 0.5 seconds.
explained_variance: -0.0218
neg_mean_squared_error: -0.1000

Training Set: (16, 2)
Validation Set: (4, 2)
Fit fold in 0.4 seconds.
explained_variance: -0.4561
neg_mean_squared_error: -0.1339

Training Set: (16, 2)
Validation Set: (4, 2)
Fit fold in 0.5 seconds.
explained_variance: -0.1182
neg_mean_squared_error: -0.2164

Training Set: (16, 2)
Validation Set: (4, 2)
Fit fold in 0.7 seconds.
explained_variance: 0.5562
neg_mean_squared_error: -0.2443

Training Set: (16, 2)
Validation Set: (4, 2)
Fit fold in 0.4 seconds.
explained_variance: -0.5722
neg_mean_squared_error: -0.1593



BPtEvaluator
------------
explained_variance: -0.1224 ± 0.3961
neg_mean_squared_error: -0.1708 ± 0.0529

Saved Attributes: ['estimators', 'preds', 'timing', 'train_subjects', 'val_subjects', 'feat_names', 'ps', 'mean_scores', 'std_scores', 'weighted_mean_scores', 'scores', 'fis_', 'coef_', 'cv']

Avaliable Methods: ['to_pickle', 'compare', 'get_X_transform_df', 'get_inverse_fis', 'get_preds_dfs', 'subset_by', 'get_fis', 'get_coef_', 'permutation_importance']

Evaluated With:
target: t
problem_type: regression
scope: all
subjects: all
random_state: 1


Don't be discouraged that this didn't work, we are after all trying to predict random noise with random noise ... 

In [17]:
# These are the steps of the pipeline
fold0_pipeline = results.estimators[0]
for step in fold0_pipeline.steps:
    print(step[0])

loader_pipe0
loader_pipe1
linear regressor


We can investigate pieces, or use special functions like

In [18]:
results.get_X_transform_df(data, fold=0)

Unnamed: 0,timeseries_rh_0,timeseries_rh_1,timeseries_rh_2,timeseries_rh_3,timeseries_rh_4,timeseries_rh_5,timeseries_rh_6,timeseries_rh_7,timeseries_rh_8,timeseries_rh_9,...,timeseries_lh_585,timeseries_lh_586,timeseries_lh_587,timeseries_lh_588,timeseries_lh_589,timeseries_lh_590,timeseries_lh_591,timeseries_lh_592,timeseries_lh_593,timeseries_lh_594
0,4.1e-05,6e-06,-9.1e-05,2.170003e-06,-4.9e-05,3.3e-05,-4.395016e-05,4.2e-05,4.5e-05,3.8e-05,...,1.261718e-06,-8e-06,2e-06,-4.764704e-06,-3.2e-05,-1.6e-05,-1.029797e-05,7.3e-05,-2.130444e-05,3.5e-05
1,4.8e-05,1.5e-05,3.7e-05,-3.172942e-05,-3.2e-05,-7e-06,-2.9112e-05,1.8e-05,2.4e-05,1.2e-05,...,-2.103391e-07,1.9e-05,-7e-06,8.692337e-06,-3e-06,-1.5e-05,-3.171588e-05,-6e-06,1.958359e-05,-6.3e-05
10,1.2e-05,-9e-06,4.3e-05,-3.082831e-05,-3.3e-05,2.2e-05,-2.819585e-05,-0.000108,-3.1e-05,8.5e-05,...,-2.802208e-05,1.7e-05,-6e-06,-9.766284e-06,8e-06,-2e-05,4.961625e-06,7.6e-05,-6.8216e-06,-7e-06
12,-2.2e-05,2e-06,-6.1e-05,2.000923e-05,-1.6e-05,9e-06,4.455068e-05,-1.5e-05,-7.3e-05,3.6e-05,...,9.487218e-06,4.4e-05,7e-06,1.79274e-05,-6e-06,-1.2e-05,9.688726e-06,-4.5e-05,-4.647321e-05,1.6e-05
13,-5.1e-05,-3.4e-05,-2e-05,-8.405775e-06,-5.7e-05,-1.7e-05,1.508181e-07,8.7e-05,3.1e-05,2e-06,...,9.997186e-06,2.6e-05,-4e-06,1.03075e-05,-1e-05,-9e-06,1.847159e-05,-5.2e-05,1.300642e-05,-3.7e-05
15,-5.1e-05,-3.2e-05,1e-05,-5.925282e-05,1e-06,1.9e-05,-6.689757e-06,8.3e-05,9.2e-05,-3.3e-05,...,-2.774693e-06,-7.6e-05,-3.6e-05,-9.361259e-07,5e-06,-1.7e-05,2.055214e-05,3.9e-05,-3.868443e-05,6.9e-05
16,-2.5e-05,6e-06,-1.1e-05,-2.388397e-05,-5.9e-05,2.5e-05,-4.392766e-05,-9e-06,2.4e-05,-4.7e-05,...,8.979452e-07,7e-05,2e-06,-1.632291e-05,-1.4e-05,-1.2e-05,-1.701748e-05,-2e-05,-4.699877e-05,4e-06
17,1.5e-05,6.5e-05,-3.9e-05,-6.399311e-05,-2e-05,3.2e-05,-1.694909e-05,-0.000169,-1.6e-05,4e-06,...,2.768429e-05,3e-05,-2e-06,1.784644e-06,-6e-06,-2.3e-05,-1.059582e-05,-5.2e-05,1.809554e-05,-1.6e-05
19,-1.1e-05,-1.6e-05,-2.2e-05,3.125629e-05,-2.4e-05,-5.1e-05,3.88844e-05,8.7e-05,5.1e-05,3.1e-05,...,-1.270376e-05,-2.6e-05,-8e-06,-1.780856e-05,9e-06,1.5e-05,-2.457031e-05,4.8e-05,2.336591e-05,7e-06
2,-2.2e-05,-3.4e-05,5.4e-05,-9.361137e-06,1.2e-05,7.9e-05,2.344356e-05,7e-05,1.2e-05,-1e-05,...,-2.456365e-05,1e-05,8e-06,-3.172093e-06,-1.8e-05,-1.8e-05,1.593863e-05,-2e-06,-6.267709e-05,-4.7e-05
