## 🚀 **Building a Custom DataLoader in PyTorch for the ICBHI Respiratory Sound Dataset**

### **Overview**
This notebook guides you through the process of creating a custom DataLoader in PyTorch specifically for the ICBHI (International Conference on Biomedical Health Informatics) respiratory sound dataset. The goal is to efficiently load, process, and prepare this dataset for machine learning tasks, particularly for training and evaluating models that can diagnose respiratory conditions based on audio data.

### **Objectives**
1. **Data Collection:**
   - Retrieve and organize respiratory sound audio files.
   - Extract patient identifiers from the file names to match them with other relevant data.

2. **Data Integration:**
   - Load patient demographic data and diagnosis information.
   - Merge these data sources with the audio files to create a comprehensive dataset.

3. **Custom DataLoader Creation:**
   - Implement a custom PyTorch `Dataset` class that handles the loading and preprocessing of the dataset.
   - Prepare the data for feeding into a deep learning model, ensuring that all necessary transformations are applied.

### **Structure**
The notebook is structured as follows:
1. **Data Preparation:** Gathering and merging all relevant data sources.
2. **DataLoader Implementation:** Writing the custom PyTorch `Dataset` class.
3. **DataLoader Usage:** Demonstrating how to use the custom DataLoader in a PyTorch training loop.


In [None]:
import pandas as pd
import os
import glob
import torch
import torchaudio
from torch.utils.data import Dataset, DataLoader

In [None]:
from google.colab import files
files.upload()  # Upload kaggle.json
# Import the 'files' module from Google Colab to handle file uploads.
# Call the 'upload' method from the 'files' module to prompt the user to upload the 'kaggle.json' file.
# This file contains the Kaggle API credentials necessary for accessing Kaggle datasets.

In [None]:
!mkdir -p ~/.kaggle
!mv /content/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!ls -la ~/.kaggle/
# Create a directory for Kaggle API configuration files in the user's home directory.
# Move the uploaded 'kaggle.json' file to the newly created '.kaggle' directory.
# Set permissions on 'kaggle.json' to read/write for the owner only (600), ensuring the file is secure.
# List the files in the '.kaggle' directory to confirm that 'kaggle.json' is correctly placed and has the proper permissions.

total 16
drwxr-xr-x 2 root root 4096 Aug 29 11:01 .
drwx------ 1 root root 4096 Aug 29 11:01 ..
-rw------- 1 root root   69 Aug 29 11:01 kaggle.json


In [None]:
# Use the Kaggle API command to download the 'Respiratory Sound Database' dataset.
# The dataset will be downloaded as a zip file to the current working directory.
!kaggle datasets download -d vbookshelf/respiratory-sound-database

Dataset URL: https://www.kaggle.com/datasets/vbookshelf/respiratory-sound-database
License(s): unknown
Downloading respiratory-sound-database.zip to /content
 99% 3.67G/3.69G [00:31<00:00, 158MB/s]
100% 3.69G/3.69G [00:31<00:00, 125MB/s]


## 🗂️ **Extracting the Respiratory Sound Database**

Now that we've successfully downloaded the Respiratory Sound Database, the next step is to extract the contents of the zip file. 📦

The command below will:

- **Unzip** the downloaded file, making the dataset's contents accessible in your Colab environment. 🗃️
- **Extract** the files into the current working directory, so we can begin working with the data right away. 🛠️

In [None]:
!unzip /content/respiratory-sound-database.zip

Archive:  /content/respiratory-sound-database.zip
replace Respiratory_Sound_Database/Respiratory_Sound_Database/audio_and_txt_files/101_1b1_Al_sc_Meditron.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

## 📊 **Data Loading and Preparation**

In this section, we load and prepare the dataset by retrieving audio files and loading patient demographic and diagnosis data. The objective is to set up a comprehensive dataset that combines audio files with relevant patient information for further analysis or machine learning.

1. **Headers Definition:**
   - First, we define a list of headers for the patient demographic data. This ensures that each column in the resulting DataFrame has a clear and meaningful name, which is essential for understanding and analyzing the data.

2. **Retrieving Audio Files:**
   - We use the `glob` library to collect all `.wav` audio files from the specified directory. This method allows us to efficiently gather all audio data for the patients in one step. These audio files will later be linked to the patient data based on the patient number.

3. **Loading Patient Demographic Data:**
   - Next, we load the patient demographic data from a `.txt` file using the `pandas` library. This data includes critical patient information such as age, sex, BMI, and, for children, weight and height. Since the file doesn’t have headers, we manually add them to make the data easier to work with and more interpretable.

4. **Loading Diagnosis Data:**
   - Finally, we load the diagnosis data from a CSV file. This data contains the diagnosis for each patient, which is crucial for our analysis and machine learning tasks. Like the demographic data, this file also lacks headers, so we add them manually to ensure clarity.

Together, these steps prepare our dataset by combining the audio files with patient demographics and diagnosis information, creating a well-structured dataset ready for further processing or machine learning model training.


In [None]:
# Define headers for the patient demographic data
header = ["Patient number", "Age", "Sex", "Adult BMI (kg/m2)", "Child Weight (kg)", "Child Height (cm)"]

# Step 1: Retrieve all audio files from the specified directory using glob
# This step gathers all .wav audio files (since the directory contains .txt and .wav files), which will be used in conjunction with the patient data
audio_files = glob.glob('/content/respiratory_sound_database/Respiratory_Sound_Database/audio_and_txt_files/*.wav')

# Step 2: Load patient demographic data from a .txt file
# This data includes information such as age, sex, BMI, etc.
# We add column names using the predefined headers to make the data more interpretable
patient_data = pd.read_csv('/content/demographic_info.txt', delimiter='\s+', header=None, names=header)

# Step 3: Load diagnosis data from a CSV file
# The diagnosis data includes patient numbers and their corresponding diagnoses
# Adding headers ensures that each column is properly labeled for easier analysis
diagnosis_data = pd.read_csv('/content/respiratory_sound_database/Respiratory_Sound_Database/patient_diagnosis.csv', header=None, names=['Patient number', 'Diagnosis'])

## 🔄 **Merging Audio Files with Patient Data**

In this section, we combine the audio file data with patient demographic and diagnosis information to create a comprehensive dataset. This is crucial for ensuring that each audio file is correctly linked with the relevant patient information, allowing for a more meaningful analysis and machine learning model training. What we aim to create is a sort of annotations file that links each audio file with the diagnosis corresponding to the patient number. This file will later be used as parameter to our custom dataset object.

1. **Creating a DataFrame for Audio Files:**
   - We start by organizing the audio file data into a DataFrame, which will serve as the base for merging with patient demographic and diagnosis data.

2. **Extracting the Filename from the Full Path:**
   - Since the audio file paths include directory information, we extract only the filename, which contains the patient number. This step simplifies the process of linking the audio files with patient data.

3. **Extracting the Patient Number:**
   - The patient number, crucial for merging data, is embedded within the audio file names. We extract this number to use as a key for merging the audio data with the patient demographic and diagnosis information.

4. **Merging with Diagnosis Data:**
   - We merge the audio file DataFrame with the diagnosis data using the patient number as the key. This ensures that each audio file is linked with the correct diagnosis, providing a foundation for building diagnostic models.

5. **Optionally Merging with Patient Demographic Data:**
   - If additional patient details are needed, we also merge the audio file data with the demographic data. This enriches the dataset with more context, such as age, sex, and BMI, which can be valuable for further analysis.

6. **Final Dataset Preparation:**
   - After merging, we display the final DataFrame to ensure that the data is correctly combined. Finally, we save the merged dataset to a CSV file, which can be used for further processing or as input for machine learning models.


In [None]:
# Step 1: Create a DataFrame containing the audio file names
audio_df = pd.DataFrame(audio_files, columns=['AudioFile'])

# Step 2: Extract the filename from the full path
# We use the basename function to isolate the filename, removing the directory path
audio_df['AudioFile'] = audio_df['AudioFile'].apply(lambda x: os.path.basename(x))

# Step 3: Extract the patient number from the audio file name
# The patient number is embedded in the filename; we extract the first three digits and convert them to integers
audio_df['Patient number'] = audio_df['AudioFile'].str.extract(r'(\d{3})').astype(int)

# Step 4: Merge the audio DataFrame with the diagnosis data
# This step links each audio file with the corresponding patient diagnosis based on the patient number
merged_df = pd.merge(audio_df, diagnosis_data, on='Patient number', how='left')

# Step 5: Optionally merge the resulting DataFrame with the patient demographic data
# This provides additional context such as age, sex, and BMI for each patient
merged_df = pd.merge(merged_df, patient_data, on='Patient number', how='left')

# Step 6: Display the merged DataFrame to verify the result
print(merged_df)

# Step 7: Save the merged DataFrame to a CSV file for further use
merged_df.to_csv('merged_data.csv', index=False)


## 🛃 **CustomDataset Class**

The `CustomDataset` class is designed to facilitate working with audio data in PyTorch. It handles loading audio files and their corresponding labels, making it easier to integrate custom datasets into PyTorch's data loading pipeline.

- **Initialization (`__init__` method):**
  The class is initialized with:
  - The path to the directory containing audio files.
  - A CSV file containing labels for each audio file.
  - An optional transformation function to be applied to the audio data.

- **Dataset Length (`__len__` method):**
  This method returns the total number of samples in the dataset, which corresponds to the number of rows in the labels CSV file.

- **Fetching Items (`__getitem__` method):**
  Given an index, this method:
  - Retrieves the file path of the audio sample.
  - Loads the audio file using `torchaudio`.
  - Extracts the corresponding label from the labels DataFrame.
  - Returns a tuple containing the audio signal and its label.

- **Audio Path Retrieval (`get_audio_path` method):**
  Constructs and returns the full file path to an audio sample based on its index and the base directory where audio files are stored.

- **Label Retrieval (`get_audio_label` method):**
  Extracts and returns the label for a given audio sample based on its index in the labels DataFrame.

This class provides a structured way to manage and preprocess audio data for machine learning tasks using PyTorch.


In [None]:
class CustomDataset(Dataset):
    def __init__(self, data, labels, transform=None):
        """
        Initializes the CustomDataset.

        Parameters:
        - data (str): Directory path where audio files are stored.
        - labels (str): Path to the CSV file containing labels for the audio files.
        - transform (callable, optional): Optional transformation function to apply to the audio data.
        """
        self.data = data
        self.labels = pd.read_csv(labels)  # Load labels from the CSV file into a DataFrame
        self.transform = transform

    def __len__(self):
        """
        Returns the total number of samples in the dataset.

        Returns:
        - int: Number of samples, which is the length of the labels DataFrame.
        """
        return len(self.labels)

    def __getitem__(self, idx):
        """
        Retrieves an audio sample and its label based on the index.

        Parameters:
        - idx (int): Index of the sample to retrieve.

        Returns:
        - tuple: (signal, label) where `signal` is the loaded audio signal and `label` is the corresponding label.
        """
        audio_path = self.get_audio_path(idx)  # Get the file path for the audio sample
        label = self.get_audio_label(idx)      # Get the label for the audio sample
        signal, sr = torchaudio.load(audio_path)  # Load the audio file
        return signal, label

    def get_audio_path(self, idx):
        """
        Constructs the file path for an audio sample based on its index.

        Parameters:
        - idx (int): Index of the sample.

        Returns:
        - str: Full path to the audio file.
        """
        path = os.path.join(self.data, self.labels.iloc[idx, 0])  # Join directory path with filename from labels DataFrame
        return path

    def get_audio_label(self, idx):
        """
        Retrieves the label for a given audio sample based on its index.

        Parameters:
        - idx (int): Index of the sample.

        Returns:
        - str: Label for the audio sample.
        """
        label = self.labels.iloc[idx, 2]  # Extract label from the labels DataFrame
        return label


## 🧰 **Dataset usage**

In this section, we perform the following steps:

1. **Define Paths:**
   - Specify the directory where the audio files are stored.
   - Provide the path to the CSV file containing the annotations (labels) for these audio files.

2. **Initialize the Dataset:**
   - Create an instance of the `CustomDataset` class using the defined paths for audio data and annotations.

3. **Print Dataset Size:**
   - Output the total number of samples in the dataset. This helps verify that the dataset is loaded correctly and provides insight into its size.

This process ensures that the dataset is properly set up and gives a clear understanding of the amount of data available for analysis or model training.


In [None]:
# Define the path to the directory containing audio files
audio_data_path = '/content/respiratory_sound_database/Respiratory_Sound_Database/audio_and_txt_files'

# Define the path to the CSV file containing annotations (labels) for the audio files
annotations_path = '/content/merged_data.csv'

# Create an instance of the CustomDataset class with the specified paths
icbhidataset = CustomDataset(audio_data_path, annotations_path)

# Print the total number of samples in the dataset
print(f"There are {len(icbhidataset)} samples in the dataset")

There are 920 samples in the dataset


## ⌨ **Accessing and Inspecting a Dataset Sample**

In this section, we perform the following steps to inspect a sample from the dataset:

1. **Retrieve a Sample:**
   - We access the first sample from the dataset using `ICBHI[0]`. This retrieves a tuple consisting of the audio signal and its corresponding label.

2. **Inspect Signal and Label:**
   - We then print the shape of the audio signal and the label associated with this sample. This helps us understand the structure of the audio data (e.g., its dimensions) and verify the label information.

By examining a sample, we gain insights into the format of the audio data and the type of labels provided, which is crucial for ensuring data consistency and preparing for further analysis or model training.


In [None]:
# Retrieve the first sample from the dataset
signal, label = icbhidataset[56]

# Print the shape of the audio signal and the corresponding label
signal.shape, label

(torch.Size([1, 882000]), 'COPD')

## 👏 **Conclusion**
By completing these steps, we’ve established a solid foundation for working with the respiratory sound dataset. This preparation is crucial for subsequent stages such as data preprocessing, feature extraction, and model training. With the dataset correctly set up and verified, we can now proceed to further analysis and experimentation to advance our research or project objectives. Feel free to build on this setup with additional analysis, feature extraction, or model training as needed.