# Audio tutorial

This tutorial walks through loading in individual audio records as well as the entire dataset.

In [None]:
import matplotlib.pyplot as plt
import IPython.display as Ipd
import pandas as pd
import torch
from pathlib import Path
import numpy as np

from b2aiprep.process import SpeechToText
from b2aiprep.process import Audio, specgram

from b2aiprep.dataset import VBAIDataset

We have reformatted the data into a [BIDS](https://bids-standard.github.io/bids-starter-kit/folders_and_files/folders.html)-like format. The data is stored in a directory with the following structure:

```
data/
    sub-01/
        ses-01/
            beh/
                sub-01_ses-01_questionnaire.json
            audio/
                sub-01_ses-01_recording.wav
    sub-02/
        ses-01/
            beh/
                sub-02_ses-01_questionnaire.json
            audio/
                sub-02_ses-01_recording.wav
    ...
```

i.e. the data reflects a subject-session-datatype hierarchy. The `beh` subfolder, for behavioural data, was chosen to store questionnaire data as it is the closest approximation to the data collected. The `audio` subfolder contains audio data.

We have provided utilities which load in data from the BIDS-like dataset structure. The only input needed is the folder path which stores the data.

In [None]:
dataset = VBAIDataset('../output')

## Finding data

The dataset functions are convenient, but they require some prior knowledge: we needed to know that `sessionschema` was the name for the questionnaire with session information. We also needed to know that `demographics` was the name for the questionnaire where general demographics were collected. For convenience, the dataset object has another method which tells you all of the questionnaire names available. It accomplishes this by iterating through every file of the BIDS dataset. Note that this can be an expensive operation! Luckily, if there are less than 10,000 files, it goes pretty fast. Let's try it out.

In [None]:
participant_df = dataset.load_and_pivot_questionnaire('participant')
participant_df.head()

## Acoustic tasks

Let's look at the acoustic tasks now. Acoustic task files are organized in the following way:

```
data/
    sub-01/
        ses-01/
            beh/
                sub-01_ses-01_task-<TaskName>_acoustictaskschema.json
                sub-01_ses-01_task-<TaskName>_rec-<TaskName>-1_recordingschema.json
                ...
```

where `TaskName` is the name of the acoustic task, including:

* `Audio-Check`
* `Cinderalla-Story`
* `Rainbow-Passage`

etc. The audio tasks are listed currently in b2aiprep/prepare.py:_AUDIO_TASKS.

In [None]:
acoustic_task_df = dataset.load_and_pivot_questionnaire('acoustictaskschema')
acoustic_task_df.head()

Each row in the above corresponds to a different acoustic task: an audio check, prolonged vowels, etc. The `value_counts()` method for pandas DataFrames lets us count all the unique values for a column.

In [None]:
acoustic_task_df['acoustic_task_name'].value_counts()

The hierarchy for a single session is the following:

```
subject
└── session
    └── acoustic_task
        └── recording
```

That is, a subject has multiple sessions, each session has multiple acoustic tasks, and each acoustic task has multiple recordings. Very often these relationships are 1:1, and so for a given acoustic task there is only one recording, but this is not always the case.

We can load in a dataframe for the recordings using the same load and pivot command.

In [None]:
recording_df = dataset.load_and_pivot_questionnaire('recordingschema')
print(f"{recording_df['recording_id'].nunique()} recordings.")
recording_df.head()

## Audio

We have provided three utilities to support loading in audio and related data:

* `load_recording(recording_id)` - loads in audio data given a `recording_id`
* `load_recordings()` - loads in *all* of the audio data (takes a while!). the list returned will be in the same order as `dataset.recordings_df`
* `load_spectrograms()` - similar to the above, but loads in precomputed spectrograms.

We can first take a look at the `load_recording` function with the first `recording_id` from the dataframe. This should be an audio check.

In [None]:
recording_id = recording_df.loc[0, 'recording_id']
audio = dataset.load_recording(recording_id)

# convert to uint32 - probably should use the bits_per_sample from the original metadata!
signal = audio.signal.squeeze()
signal = (np.iinfo(np.uint32).max * (signal - signal.min())) / (signal.max() - signal.min())

# display a widget to play the audio file
Ipd.display(Ipd.Audio(data=signal, rate=audio.sample_rate))

Spectrograms are a useful transformation of the audio data. We can load in the spectrograms for the first recording using the `load_spectrograms` function.

In [None]:
spectrograms = dataset.load_spectrograms()

In [None]:
# display a single spectrogram
i = 5

f = plt.figure(figsize=(10, 5))
# convert spectrogram to decibel
log_spec = 10.0 * torch.log10(torch.maximum(spectrograms[i], torch.tensor(1e-10)))
log_spec = torch.maximum(log_spec, log_spec.max() - 80)

ax = f.gca()
ax.matshow(log_spec.T, origin="lower", aspect="auto")

# the time axis is dependent on the window hop used by the spectrogram
# this is usually 25ms for speech
plt.xlabel('Time (window index)')
# for this axis, the frequency domain goes from 0 : NFFT/2*fs
plt.ylabel('Frequency (NFFT index)')
plt.show()

We can then alternatively load in all the recordings.

**WARNING**: This takes a little under 15GB of memory!

In [None]:
audio_data = dataset.load_recordings()

## Transcribe a single audio file

Now that we have loaded in the audio recordings, we can try to apply ML models. For example, we can transcribe the audio.

In [None]:
audio = audio_data[5].to_16khz()

# display a widget to play the audio file
Ipd.display(Ipd.Audio(data=audio.signal.squeeze(), rate=audio.sample_rate))

# Select the best device available
device = "cuda" if torch.cuda.is_available() else "cpu"
device = "mps" if torch.backends.mps.is_available() else device

# default arguments for the transcription model
stt_kwargs = {
    "model_id": "openai/whisper-base",
    "max_new_tokens": 128,
    "chunk_length_s": 30,
    "batch_size": 16,
    "device": device,
}
stt = SpeechToText(**stt_kwargs)
transcription = stt.transcribe(audio, language="en")
print(transcription)