### Regularized (banded) CV regression workflow for Neuroscout
This notebook implements an encoding model for a single subject using Regularized Ridge Regression, as implemented in https://github.com/gallantlab/himalaya. In Neuroscout, this same pipeline should be run for all subjects.
- Input needed from the user
    - Define datasets (independent model fitting for all datasets)
    - Define cross-validation strategy
        - Across runs
        - Within runs
    - Define estimator
    - Define preprocessing steps (e.g., scaling?)
    - Define bands
    - Pass parameters
    - Output: scores, parameters, predicted time series
- Define outputs

In [1]:
import pyns
import pandas as pd
import nibabel as nib
import numpy as np
import glob
from copy import deepcopy
from pathlib import Path



In [2]:
api = pyns.Neuroscout()

In [3]:
dataset_name = 'Budapest'

### Choose subject

Here, we can explore the runs available in this dataset. Let's choose the first subject we see, and analyze all of their runs

In [4]:
# Select subject from first run available in dataset
api.runs.get(dataset_name='Budapest')[0]

{'acquisition': None,
 'dataset_id': 27,
 'duration': 535.0,
 'id': 1435,
 'number': 3,
 'session': None,
 'subject': 'sid000005',
 'task': 48,
 'task_name': 'movie'}

In [5]:
subject = api.runs.get(dataset_name='Budapest')[0]['subject']
subject

'sid000005'

### Fetch predictors from Neuroscout and create design matrix
Let's retrieve predictor events for multiple sets of predictors. \
For now, let's pick two sets: <b>MFCC</b> + <b>mel</b> features (plus some confounds).

In [17]:
mfccs = [f'mfcc_{i}' for i in range(20)]
mel = [f'mel_{i}' for i in range(64)]
confounds = ['rot_x', 'rot_y', 'rot_z', 'trans_x', 'trans_y', 'trans_z',
             'a_comp_cor_00', 'a_comp_cor_01', 'a_comp_cor_02',
             'a_comp_cor_03','a_comp_cor_04','a_comp_cor_05']

all_vars = mfccs + mel + confounds

In [18]:
from pyns.fetch_utils import fetch_neuroscout_predictors, fetch_images

`fetch_neuroscout_predictors` will retrive the named predictors from the Neurscout API, and (optionally) resampled them to `TR`. All timepoints are concatenated into a single file, with identifying columns (i.e. `subject`, `run`)

In [None]:
X_vars = fetch_neuroscout_predictors(
    predictor_names=all_vars, dataset_name=dataset_name, subject=subject, 
    resample=True, return_type='df')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['amplitude'] = pd.to_numeric(df['amplitude'])


In [None]:
X_vars.head()

In [None]:
# Split into design variables and meta-data
X = X_vars[all_vars]
X_meta = X_vars[X_vars.columns.difference(all_vars)]

### Fetch fMRI data and load images

To retrieve Neuroscout data, we use `datalad` to fetch the preprocessed images remote.
pyNS includes a helper function to facilitate installing and fetching the dataset using datalad:f

In [11]:
preproc_dir, img_paths = fetch_images('Budapest', '/tmp/', subject=subject)



action summary:
  get (notneeded: 5)


Here, we'll use pybids to identify the file we need to fetch, and use DataLad to fetch them

Finally, to prepare the images for analysis, we'll load them into a single array, and accompanying dataframe with meta-data for every volume

In [None]:
def _stack_images(image_objects):
    """ Stack images into single array, and collect metadata entities into dataframe """
    arrays = []
    entities = []
    image_objects = sorted(image_objects, key=lambda x: x.entities['run'])
    for img in image_objects:
        data = np.asanyarray(nib.load(img.path).dataobj)
        run_y = data.reshape([data.shape[0] * data.shape[1] * data.shape[2], data.shape[3]]).T
        arrays.append(run_y)
        entities += [dict(img.entities)] * run_y.shape[0]
    entities = pd.DataFrame(entities)
    return np.vstack(arrays), entities

In [None]:
Y, img_idx = _stack_images(img_paths)

In [None]:
Y.shape

In [None]:
# Image volume
Y[0:100]

In [24]:
# Meta-data
img_idx.head()

Unnamed: 0,datatype,desc,extension,run,space,subject,suffix,task
0,func,preproc,.nii.gz,1,MNI152NLin2009cAsym,sid000005,bold,movie
1,func,preproc,.nii.gz,1,MNI152NLin2009cAsym,sid000005,bold,movie
2,func,preproc,.nii.gz,1,MNI152NLin2009cAsym,sid000005,bold,movie
3,func,preproc,.nii.gz,1,MNI152NLin2009cAsym,sid000005,bold,movie
4,func,preproc,.nii.gz,1,MNI152NLin2009cAsym,sid000005,bold,movie


### Preprocessing and model fitting

In [25]:
from sklearn.model_selection import KFold, GroupKFold, PredefinedSplit
from himalaya.ridge import GroupRidgeCV
from himalaya.scoring import correlation_score

Cross-validated model fitting, prediction, and scoring loosely based on scikit-learn's [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html). Returns a `results` dictionary with `'coefficients'`, `'test_predictions'`, and `'test_scores'` keys containing lists of numpy arrays for each outer cross-validation fold.

In [47]:
def _model_cv(estimator, X_vars, y, bands=None, groups=None,
              scoring=correlation_score, cv=None,
              inner_cv=None, confounds=None, split=None):
    # Container for results
    results = {
        'coefficients': [],
        'test_predictions': [],
        'test_scores': []}
    
    # If confounds, stack at the end
    if confounds is not None:
        bands.append(confounds)
    
    if bands is not None:
        X = []
        for band in bands:
            X.append(X_vars[band].as_matrix())
    else:
        X = X_vars.as_matrix()

    # Extract number of samples for convenience
    n_samples = y.shape[0]
    
    # Set default cross-validation to KFold if not specified
    cv = KFold() if not cv else cv
    
    # Loop through outer cross-validation folds
    for train, test in cv.split(np.arange(n_samples), groups=groups):
        
        # Get training model for list of model bands
        X_train = [x[train] for x in X] if type(X) == list else X[train]
        X_test = [x[test] for x in X] if type(X) == list else X[test]
        
        # Create inner cross-validation loop if specified
        if inner_cv:
            # Split inner cross-validation with groups if supplied
            inner_groups = np.array(groups)[train] if groups else groups
            inner_splits = inner_cv.split(np.arange(n_samples)[train],
                                          groups=inner_groups)
            
            # Update estimator with inner cross-validator
            estimator.set_params(cv=inner_splits)
            print(np.unique(inner_groups))
        
        # Fit the regression model on training data
        estimator.fit(X_train, y[train])
        
        # Zero out coefficients for confounds if provided
        if confounds is not None:
            estimator.coef_[-len(confounds):] = 0
        
        # Compute predictions with optional splitting by band
        kwargs = {}
        if split is not None:
            kwargs['split'] = split
        test_prediction = estimator.predict(X_test, **kwargs)
        
        # Test scores should also optionally split by band
        test_score = scoring(y[test], test_prediction)
        
        # Populate results dictionary
        results['coefficients'].append(estimator.coef_)
        results['test_predictions'].append(test_prediction)
        results['test_scores'].append(test_score)
        
    return results

In [57]:
y = Y[:, :100]

# Default estimator should be GroupRidgeCV
estimator = GroupRidgeCV(groups='input')

bands = [mfccs, mel]
confunds = confounds

# Default cross-validation should be leave-one-run-out
# Use `run_id` to define cross-validation strategy
n_runs = len(X_meta['run_id'].unique())
cv = GroupKFold(n_splits=n_runs)
inner_cv = GroupKFold(n_splits=n_runs - 1)

In [63]:
y.shape

(3052, 100)

In [68]:
# Run model with specified cross-validation, groups, confounds, and split outputs
results = _model_cv(estimator, X, y, cv=cv, bands=bands, inner_cv=inner_cv, groups=X_meta['run_id'].tolist(), split=True)



[1433 1434 1435 1436]
[........................................] 100% | 11.79 sec | 100 random sampling with cv | 
[1433 1434 1435 1437]
[........................................] 100% | 13.84 sec | 100 random sampling with cv | 
[1434 1435 1436 1437]
[........................................] 100% | 14.90 sec | 100 random sampling with cv | 
[1433 1434 1436 1437]
[........................................] 100% | 14.70 sec | 100 random sampling with cv | 
[1433 1435 1436 1437]
[........................................] 100% | 15.13 sec | 100 random sampling with cv | 


In [77]:
results['coefficients'][0].shape

(84, 100)

In [94]:
results['test_scores'][0]

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.00145748,
        -0.01041323, -0.02745176, -0.05784031, -0.00137035, -0.06062118,
        -0.03022619, -0.07361765, -0.04044312,  0.02127786, -0.00946764,
         0.01171312, -0.02639117, -0.0458949 ,  0.00366606, -0.00655776,
        -0.02817272, -0.02004944,  0.04037386,  0.01514264,  0.03978828,
         0.12336703,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0. 

In [309]:
# Using single-band (non-banded) model with sklearn RidgeCV
from sklearn.linear_model import RidgeCV
results = _model_cv(RidgeCV(), X_vars, y, cv=cv, inner_cv=inner_cv, groups=X_meta['run_id'].tolist())



[1433 1434 1435 1436]
[1433 1434 1435 1437]
[1434 1435 1436 1437]
[1433 1434 1436 1437]
[1433 1435 1436 1437]


### Handling outputs

In [None]:
### 

### Validate against other workflows

In [None]:
### 