# Dataset objects in pyrsa

These exercises show how to load and structure a dataset object.

In this demo, we will first provide a walkthrough for loading a single-subject dataset from a .mat file and arranging it into a pyRSA dataset object.

We then demonstrate how to create dataset objects using data from multiple subjects.

In [None]:
# relevant imports
import numpy as np
from scipy import io
import matplotlib.pyplot as plt
import pyrsa
import pyrsa.data as rsd # abbreviation to deal with dataset

## 1. Single-subject dataset example

### Getting started

We will use a dataset where one subject was presented with 92 different visual stimuli while brain responses were measured in 100 voxels.
The different visual stimuli (each row) are the conditions, and the voxels (each column) are the measurement channels.

In [None]:
# import the measurements for the dataset
measurements = io.matlab.loadmat('92imageData/simTruePatterns.mat')
measurements = measurements['simTruePatterns']
nCond = measurements.shape[0]
nVox = measurements.shape[1]

# plot the imported data
plt.imshow(measurements,cmap='gray') 
plt.xlabel('Voxels')
plt.ylabel('Conditions')
plt.title('Measurements')

## Creating the dataset object

We will now arrange the loaded data into a dataset object for use in pyrsa.

A dataset object contains all the information needed to calculate a representational dissimilarity matrix (RDM). Therefore, the dataest must include:
 - measurements: [NxP] numpy.ndarray. These are the observations (N) from each measurement channel (P).
 - obs_descriptors: dict that defines the condition label associated with each observation in measurements

Because we also want to include helpful information about this dataset, we include the additional information:
 - descriptors: dict with metadata about this dataset object (e.g. experiment session #, subject #, experiment name). Basically general descriptions
 - channel_descriptors: dict that identifies each column (channel) in measurements

To start, we will note the session # (e.g. the first scanning session) and the subject # for this dataset. In addition, we will create labels for each of the 92 conditions and 100 voxels. Finally, we package this information into a pyrsa dataset object.

In [None]:
# now create a  dataset object
des = {'session': 1, 'subj': 1}
obs_des = {'conds': np.array(['cond_' + str(x) for x in np.arange(nCond)])}
chn_des = {'voxels': np.array(['voxel_' + str(x) for x in np.arange(nVox)])}
#obs_des = {'conds': np.array(['cond_' + str(x) for x in np.arange(1,nCond+1)])} # indices from 1
#chn_des = {'conds': np.array(['voxel' + str(x) for x in np.arange(1,nVox+1)])} # indices from 1
data = rsd.Dataset(measurements=measurements,
                           descriptors=des,
                           obs_descriptors=obs_des,
                           channel_descriptors=chn_des)
print(data)

Sometimes we wish to consider only a subset of data - either a subset of observations (conditions), or subset of measurement channels.  This might be to only consider the measurement channels where all the subjects have data, or conditions which occur across all subjects / sessions. Using dataset functionality, we can subset the datasets according to a subset of the conditions or channels via 'subset_obs' and 'subset_channel', respectively.

In [None]:
# create an example dataset with random data, subset some conditions
nChannel = 50
nObs = 12
randomData = np.random.rand(nObs, nChannel)
des = {'session': 1, 'subj': 1}
obs_des = {'conds': np.array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5])}
chn_des = {'voxels': np.array(['voxel_' + str(x) for x in np.arange(nChannel)])}
data = rsd.Dataset(measurements=randomData,
                        descriptors=des,
                        obs_descriptors=obs_des,
                        channel_descriptors=chn_des
                        )
# select a subset of the dataset: select data only from conditions 0:4
sub_data = data.subset_obs(by='conds', value=[0,1,2,3,4])
print(sub_data)

Additionally, you might want to split the data in a certain way and analyze the splits as separate datasets. For instance, if your data is organized such that there are different ROIs, you might wish to perform the subsequent analyses separately for each ROI. Similarly, you could split the observations. This is supported with 'split_obs' and 'split_channel' options on the dataset object.

In [None]:
# Split by channels
nChannel = 3 
nChannelVox = 10 # three ROIs, each with 10 voxels
nObs = 4
randomData = np.random.rand(nObs, nChannel*nChannelVox)
des = {'session': 1, 'subj': 1}
obs_des = {'conds': np.array([0, 1, 2, 3])}
chn_des = np.matlib.repmat(['ROI1','ROI2','ROI3'],1,nChannelVox)
chn_des = {'ROIs': np.array(chn_des[0])}
data = rsd.Dataset(measurements=randomData,
                        descriptors=des,
                        obs_descriptors=obs_des,
                        channel_descriptors=chn_des
                        )
split_data = data.split_channel(by='ROIs')
print(split_data)

## 2. Multi-subject dataset example

First, we generate random data for a number of subjects. For simplicity, here we set each subject to have the same number of voxels and conditions.

In [None]:
# create a datasets with random data
nVox = 50 # 50 voxels/electrodes/measurement channels
nCond = 10 # 10 conditions
nSubj = 5 # 5 different subjects
randomData = np.random.rand(nConds, nChannel, nSubj)

We can then create a list of dataset objects by appending each dataset for each subject.

In [None]:
obs_des = {'conds': np.array(['cond_' + str(x) for x in np.arange(nCond)])}
chn_des = {'voxels': np.array(['voxel_' + str(x) for x in np.arange(nVox)])}

data = [] # list of dataset objects
for i in np.arange(nSubj):
    des = {'session': 1, 'subj': i+1}
    # append the dataset object to the data list
    data.append(rsd.Dataset(measurements=randomData[:,:,0],
                        descriptors=des,
                        obs_descriptors=obs_des,
                        channel_descriptors=chn_des
                        )
               )