# Querying and Managing BIDS Datasets

## PyBIDS Library

PyBIDS is a Python library designed to query, manage, and interact with neuroimaging datasets formatted according to the Brain Imaging Data Structure (BIDS) standard.

It facilitates efficient access to file paths, metadata, and data entities, allowing researchers to automate data analysis pipelines and integrate with tools like Pandas and NiBabel.

Key features and functionalities include:

* **Dataset Layout:** The **`BIDSLayout`** class provides comprehensive indexing and querying capabilities for BIDS-structured data, including derivatives.

* **File Handling:** The **`BIDSFile`** class allows for easy extraction of metadata, associated files, and paths.

* **Querying:** Users can efficiently retrieve specific files based on various entities (e.g., subject, task, session).

* **Metadata Access:** Accesses JSON sidecars and associated data efficiently.

* **Reporting:** Includes functionality to generate reports on dataset content.

**Key Components:**

* **`bids.layout`**: Module for indexing and querying datasets.

* **`bids.layout.BIDSLayout`**: Core class for interacting with the dataset.

* **`bids.layout.BIDSFile`**: Represents individual files with methods to retrieve associated metadata and file information.

## Install PyBIDS Library

In [1]:
!pip install -q pybids

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m216.2/216.2 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.3/117.3 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.5/163.5 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.6/82.6 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.6/182.6 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for docopt (setup.py) ... [?25l[?25hdone


## Download BIDS Dataset

We use a BIDS-formatted neuroimaging dataset called `7t_trt`, commonly used to illustrate how to query, navigate, and summarize BIDS datasets.

It provides a concrete example of a 7-Tesla Test-Retest (trt) functional MRI dataset.

It follows the Brain Imaging Data Structure (BIDS) standard, containing subject directories (e.g., `sub-01`), sessions (e.g., `ses-1`), and imaging data (e.g., `func/` folder).

Download the `7t_trt` BIDS dataset:

In [2]:
!git clone https://github.com/tariqzahratahdi/bids-test-dataset-7t_trt

Cloning into 'bids-test-dataset-7t_trt'...
remote: Enumerating objects: 161, done.[K
remote: Counting objects: 100% (161/161), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 161 (delta 99), reused 161 (delta 99), pack-reused 0 (from 0)[K
Receiving objects: 100% (161/161), 16.78 KiB | 8.39 MiB/s, done.
Resolving deltas: 100% (99/99), done.


## Imports

We import the **`BIDSLayout`** class, which we will use to manage and query the layout of files on disk.

In [3]:
# imports
from bids import BIDSLayout

## The **`BIDSLayout`**

The **`BIDSLayout`** class is the core of PyBIDS.

A **`BIDSLayout`** object represents a BIDS project file tree and provides methods for querying and manipulating BIDS files.

We create a **`BIDSLayout`** object by passing in the BIDS dataset location as an argument:

In [4]:
# set dataset path
data_path = 'bids-test-dataset-7t_trt'

# create a BIDSLayout object
layout = BIDSLayout(data_path)

We print some basic information about the layout:

In [5]:
# print some basic information about the layout
layout

BIDS Layout: ...ntent/bids-test-dataset-7t_trt | Subjects: 10 | Sessions: 20 | Runs: 20

## Querying the **`BIDSLayout`**

We use the **`get_subjects()`** method to get the list of the subjects:

In [6]:
# get list of subjects
layout.get_subjects()

['01', '02', '03', '04', '05', '06', '07', '08', '09', '10']

We use the **`get_sessions()`** method to get the list of the sessions:

In [7]:
# get list of sessions
layout.get_sessions()

['1', '2']

We use the **`get_tasks()`** method to get the list of the tasks:

In [8]:
# get list of tasks
layout.get_tasks()

['rest']

### The **`get()`** Method

The **`get()`** method returns a list of all the BIDS files in our dataset:

In [None]:
# get a list of all files in the layout
all_files = layout.get()

# print the number of files in the layout
print('number of files in the layout: ', len(all_files))

number of files in the layout:  340


Print the first 10 files in the layout:

In [None]:
# print the first 10 files in the layout
print("The first 10 files are:")
all_files[:10]

The first 10 files are:


[<BIDSJSONFile filename='/content/bids-test-dataset-7t_trt/dataset_description.json'>,
 <BIDSDataFile filename='/content/bids-test-dataset-7t_trt/participants.tsv'>,
 <BIDSFile filename='/content/bids-test-dataset-7t_trt/README'>,
 <BIDSFile filename='/content/bids-test-dataset-7t_trt/README.md'>,
 <BIDSImageFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-1/anat/sub-01_ses-1_T1map.nii.gz'>,
 <BIDSImageFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-1/anat/sub-01_ses-1_T1w.nii.gz'>,
 <BIDSImageFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_magnitude1.nii.gz'>,
 <BIDSImageFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_magnitude2.nii.gz'>,
 <BIDSJSONFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_phasediff.json'>,
 <BIDSImageFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_phasediff.nii.gz'>]

The returned object is a Python list. By default, each element in the list is a **`BIDSFile`** object.

To get a list of only the filenames, we set the **`return_type`** argument to **`'filename'`**:

In [None]:
# get a list of first ten filenames in the layout
layout.get(return_type='filename')[:10]

['/content/bids-test-dataset-7t_trt/dataset_description.json',
 '/content/bids-test-dataset-7t_trt/participants.tsv',
 '/content/bids-test-dataset-7t_trt/README',
 '/content/bids-test-dataset-7t_trt/README.md',
 '/content/bids-test-dataset-7t_trt/sub-01/ses-1/anat/sub-01_ses-1_T1map.nii.gz',
 '/content/bids-test-dataset-7t_trt/sub-01/ses-1/anat/sub-01_ses-1_T1w.nii.gz',
 '/content/bids-test-dataset-7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_magnitude1.nii.gz',
 '/content/bids-test-dataset-7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_magnitude2.nii.gz',
 '/content/bids-test-dataset-7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_phasediff.json',
 '/content/bids-test-dataset-7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_phasediff.nii.gz']

#### Filtering Files by Entities

The **`get()`** method accepts arguments that allow us to filter the result set based on specified criteria.

We can pass any BIDS-defined keywords, which are called *entities* in PyBIDS.

Here are a few of the most common entities:
* **`suffix`**: The part of a BIDS filename just before the extension (e.g., **`'bold'`**, **`'events'`**, **`'physio'`**, etc.).
* **`subject`**: The subject label.
* **`session`**: The session label.
* **`run`**: The run index.
* **`task`**: The task name.

In the following example, we retrieve all BOLD runs with **`.nii.gz`** extensions for subject **`'01'`**:

In [None]:
# Retrieve files of all BOLD runs for subject 01
layout.get(subject='01', extension='nii.gz', suffix='bold')

[<BIDSImageFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-fullbrain_run-1_bold.nii.gz'>,
 <BIDSImageFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-fullbrain_run-2_bold.nii.gz'>,
 <BIDSImageFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-prefrontal_bold.nii.gz'>,
 <BIDSImageFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-2/func/sub-01_ses-2_task-rest_acq-fullbrain_run-1_bold.nii.gz'>,
 <BIDSImageFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-2/func/sub-01_ses-2_task-rest_acq-fullbrain_run-2_bold.nii.gz'>,
 <BIDSImageFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-2/func/sub-01_ses-2_task-rest_acq-prefrontal_bold.nii.gz'>]

#### Filtering by metadata

All of the entities listed above are found in the names of BIDS files.

But sometimes we want to search for files based not just on their names, but also based on metadata defined in JSON files.

When we initialize a **`BIDSLayout`** object, all metadata files associated with BIDS files are automatically indexed.

This means we can pass any key that occurs in any JSON file in our project as an argument to the **`get()`** method.

We can combine these with any number of core BIDS entities (like **`subject`**, **`run`**, etc.).

In the following example, we retrieve all files where:
* (a) the value of **`SamplingFrequency`** (a metadata key) is `100`,
* (b) the **`acquisition`** type is `'prefrontal'`,
* (c) the **`subject`** is `'01'` or `'02'`:

In [None]:
# Retrieve all files where SamplingFrequency (a metadata key) = 100
# and acquisition = prefrontal, for the first two subjects
layout.get(subject=['01', '02'], SamplingFrequency=100, acquisition="prefrontal")

[<BIDSDataFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-prefrontal_physio.tsv.gz'>,
 <BIDSDataFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-2/func/sub-01_ses-2_task-rest_acq-prefrontal_physio.tsv.gz'>,
 <BIDSDataFile filename='/content/bids-test-dataset-7t_trt/sub-02/ses-1/func/sub-02_ses-1_task-rest_acq-prefrontal_physio.tsv.gz'>,
 <BIDSDataFile filename='/content/bids-test-dataset-7t_trt/sub-02/ses-2/func/sub-02_ses-2_task-rest_acq-prefrontal_physio.tsv.gz'>]

#### Other **`return_type`** values

While we’ll typically want to work with either **`BIDSFile`** objects or filenames, we can also ask the **`get()`** method to return unique values (or ids) of particular entities.

For example, say we want to know which subjects have at least one T1w file.<br>
We can request that information by setting **`return_type='id'`**.<br>
When using this option, we also need to specify a target entity (or metadata keyword) called **`target`**.<br>
This combination tells the **`BIDSLayout`** object to return the unique values for the specified target entity.

In the following example, we ask for all of the unique subject IDs that have at least one file with a **`T1w`** suffix:

In [None]:
# get the ids of subjects that have T1w files
layout.get(return_type='id', target='subject', suffix='T1w')

['01', '02', '03', '04', '05', '06', '07', '08', '09', '10']

If our **`target`** is a BIDS entity that corresponds to a particular directory in the BIDS spec (e.g., **`subject`** or **`session`**) we can also use **`return_type='dir'`** to get all matching subdirectories:

In [None]:
# get the directories of subjects
layout.get(return_type='dir', target='subject')

['/content/bids-test-dataset-7t_trt/sub-01',
 '/content/bids-test-dataset-7t_trt/sub-02',
 '/content/bids-test-dataset-7t_trt/sub-03',
 '/content/bids-test-dataset-7t_trt/sub-04',
 '/content/bids-test-dataset-7t_trt/sub-05',
 '/content/bids-test-dataset-7t_trt/sub-06',
 '/content/bids-test-dataset-7t_trt/sub-07',
 '/content/bids-test-dataset-7t_trt/sub-08',
 '/content/bids-test-dataset-7t_trt/sub-09',
 '/content/bids-test-dataset-7t_trt/sub-10']

## The **`BIDSFile`**

The **`get()`** method of a **`BIDSLayout`** object returns a list of objects of class **`BIDSFile`**.

A **`BIDSFile`** object is a lightweight container for individual files in a BIDS dataset.

It provides easy access to a variety of useful attributes and methods.

In the following example, we pick a file from our existing layout:

In [None]:
# get list of selected files in the dataset
bf_list = layout.get(subject='01', acquisition='fullbrain', suffix='physio')

bf_list

[<BIDSDataFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-fullbrain_run-1_physio.tsv.gz'>,
 <BIDSDataFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-fullbrain_run-2_physio.tsv.gz'>,
 <BIDSDataFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-2/func/sub-01_ses-2_task-rest_acq-fullbrain_run-1_physio.tsv.gz'>,
 <BIDSDataFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-2/func/sub-01_ses-2_task-rest_acq-fullbrain_run-2_physio.tsv.gz'>]

In [None]:
# select a file in the list
bf = bf_list[0]

bf

<BIDSDataFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-fullbrain_run-1_physio.tsv.gz'>

Here are some of the attributes and methods available in a **`BIDSFile`**:

* **`.path`**: The full path of the associated file.
* **`.filename`**: The associated file’s filename (without directory).
* **`.dirname`**: The directory containing the file.
* **`.get_entities()`**: Returns information about entities associated with this **`BIDSFile`** (optionally including metadata).
* **`.get_image()`**: Returns the file contents as a nibabel image (only works for image files).
* **`.get_df()`**: Get file contents as a pandas DataFrame (only works for TSV files).
* **`.get_metadata()`**: Returns a dictionary of all metadata found in associated JSON files.
* **`.get_associations()`**: Returns a list of all files associated with this one in some way.

Get all the entities associated with the file, and their values:

In [None]:
# Get all the entities associated with the file, and their values
bf.get_entities()

{'acquisition': 'fullbrain',
 'datatype': 'func',
 'extension': '.tsv.gz',
 'run': 1,
 'session': '1',
 'subject': '01',
 'suffix': 'physio',
 'task': 'rest'}

Get all the metadata associated with the file:

In [None]:
# Get all the metadata associated with the file
bf.get_metadata()

{'Columns': ['cardiac', 'respiratory', 'trigger', 'oxygen saturation'],
 'SamplingFrequency': 100,
 'StartTime': 0}

Get all the entities and all the metadata associated with the file in one shot:

In [None]:
# Get all the entities and all the metadata associated with the file
bf.get_entities(metadata='all')

{'Columns': ['cardiac', 'respiratory', 'trigger', 'oxygen saturation'],
 'SamplingFrequency': 100,
 'StartTime': 0,
 'acquisition': 'fullbrain',
 'datatype': 'func',
 'extension': '.tsv.gz',
 'run': 1,
 'session': '1',
 'subject': '01',
 'suffix': 'physio',
 'task': 'rest'}

Get all the files associated with our target file.<br>
Notice how we get back both the JSON sidecar for our target file, and the BOLD run that our target file contains physiological recordings for.

In [None]:
# Get all files associated with the target file
bf.get_associations()

[<BIDSJSONFile filename='/content/bids-test-dataset-7t_trt/task-rest_acq-fullbrain_run-1_physio.json'>,
 <BIDSImageFile filename='/content/bids-test-dataset-7t_trt/sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-fullbrain_run-1_bold.nii.gz'>]

### Filename parsing

The **`parse_file_entities()`** function allows to manually extract BIDS entities from the filename:

In [None]:
from bids.layout import parse_file_entities

parse_file_entities(bf.path)

{'subject': '01',
 'session': '1',
 'task': 'rest',
 'acquisition': 'fullbrain',
 'run': 1,
 'suffix': 'physio',
 'datatype': 'func',
 'extension': '.tsv.gz'}