# Estimating dissimilarities

This tutorial shows how to estimate Representational Dissimilarity Matricies (RDMs) from data.

In [None]:
# relevant imports
import numpy as np
from scipy import io
import matplotlib.pyplot as plt
import pyrsa
import pyrsa.data as rsd # abbreviation to deal with dataset
import pyrsa.rdm as rsr

We first generate an example dataset we want to calculate RDM(s) from. If you are unfamiliar with the dataset object in pyrsa, have a look at `example_dataset.ipynb`. 

For this tutorial we use simulated data for the 92 image dataset, which come with the toolbox and are here just loaded:

In [None]:
# create a dataset object
measurements = io.matlab.loadmat('92imageData/simTruePatterns.mat')
measurements = measurements['simTruePatterns2']
nCond = measurements.shape[0]
nVox = measurements.shape[1]
# now create a  dataset object
des = {'session': 1, 'subj': 1}
obs_des = {'conds': np.array(['cond_' + str(x) for x in np.arange(nCond)])}
chn_des = {'voxels': np.array(['voxel_' + str(x) for x in np.arange(nVox)])}
data = rsd.Dataset(measurements=measurements,
                   descriptors=des,
                   obs_descriptors=obs_des,
                   channel_descriptors=chn_des)

### Calculating our first RDM
The main function to calculate RDMs from data is `pyrsa.rdm.calc_rdm` which we have abbreviated access as `rsr.calc_rdm` here. The function takes a dataset object as its main input. Additionally, we here pass the descriptor 'conds' to specify that we want to create a RDM of dissimilarities between conditions as specified by 'conds'. If this input is not provided the RDM is calculated assuming that each row is a separate pattern or condition. To avoid confusion, we generally recommend to pass the `descriptor` argument.

In [None]:
# calculate a RDM
RDM_euc = rsr.calc_rdm(data, descriptor='conds')
print(RDM_euc)

As you see the RDMs object can be printed for easy inspection.
The calculated dissimilarities are saved as a vector of strung-out upper-triangular elements of the RDM matrix. Note also that the RDM object inherits the descriptors from the dataset object.

By default `calc_rdm` computes squared euclidean distances between mean patterns. If we want to compute a different type of RDM, we can do so by passing the `method` parameter. See https://rsa3.readthedocs.io/en/latest/distances.html for a discussion of different methods for calculating RDMs.
For example we can calculate correlation distances like this:

In [None]:
RDM_corr = rsr.calc_rdm(data, method='correlation', descriptor='conds')

To access the dissimilarities saved in the rdms object use the `get_matrices` and `get_vectors` functions. These functions always have a starting dimension for multiple rdms as the rdms object can store multiple rdms as we discuss below.

In [None]:
dist_vectors = RDM_euc.get_vectors() # here a vector
dist_matrix = RDM_euc.get_matrices()
print(dist_matrix)
print(dist_matrix.shape)
print(dist_vectors.shape)

Also, for a quick look we can plot the RDM using `pyrsa.vis.show_rdm`:

In [None]:
pyrsa.vis.show_rdm(RDM_euc)

If you already calculated a RDM in some different way you can turn your RDM into a RDM object for use in pyrsa by using the constructor `pyrsa.rdm.RDMs`. If you want to use descriptors for the conditions or rdms you put into the object you need to specify them as dictionaries of lists as for the dataset object.

The following thus creates a naked RDMs object, which only contains the dissimilarities and no specific descriptors.

In [None]:
# create an RDM object with given entries:
dissimilarities = RDM_euc.get_vectors()
RDM_euc_manual = rsr.RDMs(dissimilarities)

## create RDM object for several RDMs
When we have multiple datasets we can compute the RDMs for each by simply passing the whole list to the function. This is convenient when we want to compute RDMs for multiple subjects, conditions, brain areas, etc.

To illustrate this let's start by creating a list of 5 datasets with noisy copies of the measurements we already have, labeling them as coming from different subjects in the descriptor `'subj'`:

In [None]:
data_list = []
for i in range(5):
    m_noisy = measurements + np.random.randn(*measurements.shape)
    des = {'session': 1, 'subj': i}
    data_list.append(rsd.Dataset(measurements=m_noisy,
                   descriptors=des,
                   obs_descriptors=obs_des,
                   channel_descriptors=chn_des))

As promised we can now calculate the RDMs for all subjects in one go:

In [None]:
rdms = rsr.calc_rdm(data_list)

Note, that `rdms` is a single object, which contains all RDMs. The functions for accessing the vector representation and the matrix representation are still available. Additionally, the number of RDMs and the descriptiors we gave to the dataset objects are kept:

In [None]:
print('The number of RDMs is:')
print(rdms.n_rdm)
print()
print('The descriptors for the RDMs are:')
print(rdms.rdm_descriptors)
print()
print('The patterns or conditions are still described at least by their label:')
print(rdms.pattern_descriptors['pattern'])

### To access the parts of the rdms object a few functions are available:
To access only a subset of the rdms in the object use the `subset` and `subsample` functions:
The inputs to these functions are a descriptor used for the selection and a list (or other iterable) of selected values.

The difference between the two function lies in how they treat repetitions. If you pass a value twice subsample will repeat the rdm in the returned object, while subset will return every rdm at most once.

In [None]:
# same output:
r1 = rdms.subset('subj', [1, 3, 4])
r2 = rdms.subsample('subj', [1, 3, 4])
# different output
r3 = rdms.subset('subj', [1, 3, 3, 4])
r4 = rdms.subsample('subj', [1, 3, 3, 4])
# r3 has 3 rdms r4 has 4 rdms

Equivalent syntax for selecting a subset of the patterns is implemented as `subset_pattern` and `subsample_pattern`.

For repeated values subsample will fill in dissimilarities between patterns and themselves as `np.nan`.

In [None]:
# same output:
r1 = rdms.subset_pattern('pattern', [1, 3, 4, 5, 6, 72])
r2 = rdms.subsample_pattern('pattern', [1, 3, 4, 5, 6, 72])
# different output
r3 = rdms.subset_pattern('pattern', [1, 3, 3, 4, 5, 6, 72])
r4 = rdms.subsample_pattern('pattern', [1, 3, 3, 4, 5, 6, 72])
# r3 has 6 conditions r4 has 7 conditions

Indexing and iterating over RDMs is also supported, i.e. `rdms[0]` will return the first rdm and `for rdm in rdms:` are legal commands. These commands return copies though!, i.e. `rdms[0]` and `rdm` will be copies of the corresponding rdms and changing them will not affect the original rdms object.

And of course we can still show the rdm in a plot using `pyrsa.vis.show_rdm`:

In [None]:
pyrsa.vis.show_rdm(rdms)

# Crossvalidated dissimilarities
When we have multiple independent measurements of a pattern we can use crossvalidated distances to achieve an unbiased estimate of the dissimilarities between patterns. Essentially, this is meant to counteract the upward bias caused by adding noise to the measurements. You may have noticed this bias by comparing the noisy RDMs we just created and the clean rdm we created at the beginning of this tutorial.

To illustrate how to do this using pyrsa, we first create a dataset with multiple (3) measurements for each pattern:

In [None]:
data_list = []
for i in range(3):
    m_noisy = np.repeat(measurements, 3, axis=0)
    m_noisy += np.random.randn(*m_noisy.shape)
    
conds = np.array(['cond_' + str(x) for x in np.arange(nCond)])
sessions = np.tile([1, 2, 3], 92)
conds = np.repeat(conds, 3)
obs_des = {'conds': conds, 'sessions': sessions}

des = {'session': 1, 'subj': 1}

dataset = rsd.Dataset(
    measurements=m_noisy,
    descriptors=des,
    obs_descriptors=obs_des,
    channel_descriptors=chn_des)

Importantly, we added a sessions descriptor which marks which measurement comes from which session. We can now compute the crossvalidated distances simply using the `'crossnobis'` rdm calculation method. To specify which measurements come from the same session we pass `'sessions'` as the `cv_descriptor`. 

In [None]:
rdm_cv = pyrsa.rdm.calc_rdm(dataset, method='crossnobis', descriptor='conds', cv_descriptor='sessions')

In [None]:
pyrsa.vis.show_rdm(rdm_cv)

Looking at this rdm, we can see that this indeed removed the overall upward bias, although the rdm is still noisy of course.

If you have multiple datasets for multiple subjects this will still work fine with the crossnobis dissimilarity.