
## Creating the final audio dataset and metadata file

This notebook assembles a clean dataset for training:
- copies audio from `datasets/<state>/<label>/` into a single `dataset_final/audio/` folder
- renames every file to `<label>_<state>_<id>.wav` for traceability
- writes `metadata.csv` linking each file to its label and driving context id (1=braking, 2=moving, 3=startup, 4=idle)

Expected results:

```dataset_final/
│
├── audio/
│       car_knocking_moving_state_0001.wav
│       normal_brakes_breaking_state_0002.wav
│       low_oil_idle_state_0030.wav
│       ...
│
└── metadata.csv
```

metadata.csv:
| file                                  | label           | context |
| ------------------------------------- | --------------- | ------- |
| normal_brakes_breaking_state_0002.wav | normal_brakes   | 1       |
| serpentine_belt_idle_state_0010.wav   | serpentine_belt | 4       |
| dead_battery_startup_state_0003.wav   | dead_battery    | 3       |
| car_knocking_moving_state_0012.wav    | car_knocking    | 2       |


In [1]:
import os
import csv
import shutil
import pandas as pd


In [None]:
root = "data/raw"                                   # base dataset root with context folders
output_audio = "data/dataset_final/audio"                # final audio output folder
output_csv = "data/dataset_final/metadata.csv"           # final CSV output file

os.makedirs(output_audio, exist_ok=True)

# Map folder names to context ids used by the model
context_map = {
    "breaking_state": "1",
    "moving_state": "2",
    "startup_state": "3",
    "idle_state": "4"
}

rows = []

for state_folder in os.listdir(root):
    state_path = os.path.join(root, state_folder)
    if not os.path.isdir(state_path):
        continue  # skip non-folder entries

    context = context_map.get(state_folder, "")

    # Traverse each label within the driving context
    for label_folder in os.listdir(state_path):
        label_path = os.path.join(state_path, label_folder)
        if not os.path.isdir(label_path):
            continue

        label = label_folder                       # class name

        # Collect audio files inside the label folder
        wav_files = [f for f in os.listdir(label_path) if f.lower().endswith(('.wav', '.waf'))]

        for i, file in enumerate(wav_files):
            src = os.path.join(label_path, file)

            # Build a clean, traceable filename
            new_filename = f"{label}_{state_folder}_{i:04d}.wav"
            dst = os.path.join(output_audio, new_filename)

            # Copy instead of move to keep the original data intact
            shutil.copy(src, dst)

            # Track metadata row for CSV
            rows.append([new_filename, label, context])

# create CSV metadata file
with open(output_csv, "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["file", "label", "context"])
    writer.writerows(rows)

print("✅ Final dataset created successfully")
print(f"Total audios: {len(rows)} written to {output_csv}")


✅ Final dataset created successfully
Total audios: 1548 written to dataset_final/metadata.csv



### Validate the dataset created

Quick sanity checks to ensure the metadata file has no missing context values.


In [7]:
df = pd.read_csv(output_csv)

# Inspect any rows missing context
df[df["context"].isna()]



Unnamed: 0,file,label,context
616,worn_out_brakes_braking_state_0000.wav,worn_out_brakes,
617,worn_out_brakes_braking_state_0001.wav,worn_out_brakes,
618,worn_out_brakes_braking_state_0002.wav,worn_out_brakes,
619,worn_out_brakes_braking_state_0003.wav,worn_out_brakes,
620,worn_out_brakes_braking_state_0004.wav,worn_out_brakes,
...,...,...,...
764,normal_brakes_braking_state_0072.wav,normal_brakes,
765,normal_brakes_braking_state_0073.wav,normal_brakes,
766,normal_brakes_braking_state_0074.wav,normal_brakes,
767,normal_brakes_braking_state_0075.wav,normal_brakes,


In [8]:
def extract_context_from_filename(filename):
    """Infer context id from filename if it was missing when copying."""
    name = filename.lower()
    if "braking_state" in name:
        return 1
    elif "moving_state" in name:
        return 2
    elif "startup_state" in name:
        return 3
    elif "idle_state" in name:
        return 4
    else:
        return 0  # unknown


# Fill missing contexts by looking at the filename
df["context"] = df["context"].fillna(df["file"].apply(extract_context_from_filename))
df["context"].unique()


array([4., 1., 2., 3.])

In [9]:
# Confirm there are no remaining null context values
df[df["context"].isna()]


Unnamed: 0,file,label,context



Now, we have no missing data. Saving the cleaned metadata below.


In [10]:
# Persist the cleaned metadata file
df.to_csv(output_csv, index=False)
