In [None]:
# autoreload modules
%load_ext autoreload
%autoreload 2
%matplotlib ipympl

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from b2aiprep.dataset import VBAIDataset

We have reformatted the data into a [BIDS](https://bids-standard.github.io/bids-starter-kit/folders_and_files/folders.html)-like format. The data is stored in a directory with the following structure:

```
data/
    sub-01/
        ses-01/
            beh/
                sub-01_ses-01_questionnaire.json
    sub-02/
        ses-01/
            beh/
                sub-01_ses-01_questionnaire.json
    ...
```

i.e. the data reflects a subject-session-datatype hierarchy. The `beh` subfolder, for behavioural data, was chosen to store questionnaire data as it is the closest approximation to the data collected.

We have provided utilities which load in data from the BIDS-like dataset structure. The only input needed is the folder path which stores the data.

In [None]:
# TODO: allow user to specify input folder input
dataset = VBAIDataset('../output')

In [None]:
# every user has a sessionschema which we can get info for the users from
qs = dataset.load_questionnaires('sessionschema')
q_dfs = []
for subject_id, questionnaire in qs.items():
    # get the dataframe for this questionnaire
    df = dataset.questionnaire_to_dataframe(questionnaire)
    q_dfs.append(df)

# concatenate all the dataframes
sessionschema_df = pd.concat(q_dfs)
sessionschema_df = pd.pivot(sessionschema_df, index='record_id', columns='linkId', values='valueString')
sessionschema_df

The above process involves: (1) finding all the questionnaire files for a specific named questionnaire, (2) loading in the data from the JSON files, and (3) concatenating the data into a single dataframe. For convenience, the `load_and_pivot_questionnaire` helper function automatically performs these tasks for a given questionnaire. Let's try it with the demographics dataframe.

In [None]:
demographics_df = dataset.load_and_pivot_questionnaire('demographics')
demographics_df.head()

We can iterate through a couple of columns and summarize the data.

In [None]:
for column in ['children', 'country', 'ethnicity', 'gender_identity', 'grandparent', 'housing_status']:
    print(demographics_df[column].value_counts(), end='\n\n')

## Participants dataframe

We can get a dataframe summarizing the participants in the dataset.

In [None]:
participant_df = dataset.load_and_pivot_questionnaire('participant')
participant_df.head()

In [None]:
# bar chart of participant by enrollment institution
plt.figure(figsize=(10, 5))
participant_df['enrollment_institution'].value_counts().plot(kind='bar')
plt.show()

## Session data

Load in the `QuestionnaireResponse` objects for the session schema.

In [None]:
session_schema = dataset.load_questionnaires('sessionschema')
# show the first item
record_id = list(session_schema.keys())[0]

# Each element is a QuestionnaireResponse, a pydantic object
# you can serialize it to a python dictionary with .dict()
# and to a json with .json()
# otherwise attributes are accessible like any other python object
print(session_schema[record_id].json(indent=2))

In [None]:
# helper function which loads in the above as a dataframe
session_df = dataset.load_and_pivot_questionnaire('sessionschema')
session_df.head()

We can look at a specific questionnaire which is collected for each session in a similar way.

In [None]:
session_confounders = dataset.load_questionnaires('confounders')
# show the first item
record_id = list(session_confounders.keys())[0]

# Each element is a QuestionnaireResponse, a pydantic object
# you can serialize it to a python dictionary with .dict()
# and to a json with .json()
# otherwise attributes are accessible like any other python object
print(session_confounders[record_id].json(indent=2))

## Acoustic tasks

Let's look at the acoustic tasks now. Acoustic task files are organized in the following way:

```
data/
    sub-01/
        ses-01/
            beh/
                sub-01_ses-01_task-<TaskName>_acoustictaskschema.json
                sub-01_ses-01_task-<TaskName>_rec-<TaskName>-1_recordingschema.json
                ...
```

where `TaskName` is the name of the acoustic task, including:

* `Audio-Check`
* `Cinderalla-Story`
* `Rainbow-Passage`

etc. The audio tasks are listed currently in b2aiprep/prepare.py:_AUDIO_TASKS.

In [None]:
acoustic_tasks = dataset.load_questionnaires('acoustictaskschema')
acoustic_tasks
# show the first item
record_id = list(acoustic_tasks.keys())[0]
print(acoustic_tasks[record_id].json(indent=2))

Each row in the above corresponds to a different acoustic task: an audio check, prolonged vowels, etc. The `value_counts()` method for pandas DataFrames lets us count all the unique values for a column.

In [None]:
acoustic_tasks_df = dataset.load_and_pivot_questionnaire('acoustictaskschema')
acoustic_tasks_df.head()

Above will list out all of the acoustic tasks, as every acoustic task is associated with a single "acoustictaskschema" `QuestionnaireResponse` object.

In [None]:
acoustic_tasks_df['acoustic_task_name'].value_counts()