# Dataset Preparation
This notebook handles the **organization of the dataset** before feature extraction and model training.

## What This Notebook Does:
- Loads metadata (`meta.csv`) to get labels.
- Checks for missing data.
- Moves audio files in the dataset into **two separate folders**:
  - `data/processed_data/real/` (for real audios)
  - `data/processed_data/fake/` (for fake audios)
- Uses a progress bar (`tqdm`) for better tracking.

In [4]:
# Importing required libraries
import os
import pandas as pd
import shutil
from tqdm import tqdm

In [5]:
# Load the metadata CSV file
metadata_path = './data/release_in_the_wild/meta.csv'
metadata = pd.read_csv(metadata_path)

# Display the first few rows of the metadata
metadata.head()

Unnamed: 0,file,speaker,label
0,0.wav,Alec Guinness,spoof
1,1.wav,Alec Guinness,spoof
2,2.wav,Barack Obama,spoof
3,3.wav,Alec Guinness,spoof
4,4.wav,Christopher Hitchens,bona-fide


In [6]:
# Check for missing data in meta.csv
missing_data = metadata.isnull().sum()

if missing_data.any():
    print("Missing values found:")
    print(missing_data)
else:
    print("No missing values found")

No missing values found


In [7]:
# Create folders for real and fake audio files
os.makedirs('./data/processed_data/real', exist_ok=True)
os.makedirs('./data/processed_data/fake', exist_ok=True)

In [8]:
# Path to the audio files
audio_path = './data/release_in_the_wild'

# Iterate over metadata
for _, row in tqdm(metadata.iterrows(), total=len(metadata), desc="Processing"):
    src_path = os.path.join(audio_path, row['file'])

    # Check if the file exists before moving it
    if not os.path.exists(src_path):
        print(f"File not found: {src_path}. Skipping...")
        continue

    # Determine the destination folder based on the label (real or fake)
    dest_folder = './data/processed_data/fake' if row['label'] == 'spoof' else './data/processed_data/real'
    dest_path = os.path.join(dest_folder, row['file'])

    try:
        # Copy the file to the appropriate destination folder
        shutil.copy(src_path, dest_path) 
    except Exception as e:
        print(f"Error moving file {row['file']}: {e}")

Processing: 100%|██████████| 31779/31779 [01:20<00:00, 394.09it/s]
