# Project: Bangla Dialect-to-Standard Normalization
## Phase 1C (Part 2): Intelligent Audio Segmentation (VAD Pipeline)

**Author:** Swagotam Malakar  
**Affiliation:** Dept. of CSE, United International University  
**Objective:** Transform raw, long-form YouTube audio (30min+) into clean, short (15s) speech segments suitable for ASR training by removing silence, music, and noise.

---

### Abstract
Raw audio from dramas contains significant non-speech artifacts (background music, silence). To prepare the dataset for Phase 2 (Annotation) and Model Fine-tuning, we employ a **Voice Activity Detection (VAD)** pipeline. This notebook utilizes the **Silero VAD** model to precisely identify human speech timestamps and segment the audio into 10-20 second chunks, discarding unusable sections.

### Methodology
1.  **VAD Model:** Utilization of pre-trained `silero-vad` (Enterprise-grade accuracy).
2.  **Timestamp Calculation:** Identification of start/end points of human speech.
3.  **Adaptive Chunking:** Merging short speech bursts into training-friendly segments (Target: 15s).
4.  **Manifest Generation:** Exporting a CSV ready for annotation tools (Label Studio).

In [1]:
# CELL 1: Dependency Installation
# Installing Silero VAD dependencies and Audio Processing tools

import os
import logging
import warnings
import torch
import pandas as pd
from glob import glob
from tqdm.notebook import tqdm
import torchaudio

# Suppress warnings
warnings.filterwarnings('ignore')

# Configure Logging
logging.basicConfig(
    level=logging.INFO,
    format='[%(asctime)s] %(levelname)s: %(message)s',
    datefmt='%H:%M:%S'
)
logger = logging.getLogger("Segmenter")

logger.info("Installing Silero VAD & Torchaudio...")
# Note: silero-vad is loaded via torch.hub, so internet access is required initially.
logger.info("✓ Dependencies Loaded.")

[06:22:21] INFO: Installing Silero VAD & Torchaudio...
[06:22:21] INFO: ✓ Dependencies Loaded.


In [2]:
# CELL 2: Pipeline Configuration (AUTO-PATH DETECTION)
# Finds where the input files are located automatically.

def find_input_directory():
    # Standard Kaggle Input Paths to check
    potential_paths = [
        "/kaggle/input/phase-1c-youtube-mining/processed_audio", # Most likely
        "/kaggle/input/phase_1c_youtube_mining/processed_audio",
        "/kaggle/input/processed_audio",
        "/kaggle/working/processed_audio" # If running in same session
    ]
    
    for p in potential_paths:
        if os.path.exists(p) and len(glob(f"{p}/*.wav")) > 0:
            return p
            
    # Fallback: Search recursively
    wavs = glob("/kaggle/input/**/*.wav", recursive=True)
    if wavs:
        return os.path.dirname(wavs[0])
        
    return None

INPUT_DIR = find_input_directory()

SEGMENT_CONFIG = {
    "paths": {
        "input_dir": INPUT_DIR, 
        "output_dir": "/kaggle/working/segmented_dataset",
        "manifest_path": "/kaggle/working/manifests"
    },
    "params": {
        "sampling_rate": 16000,
        "min_duration_sec": 2.0,   # Ignore chunks shorter than 2s
        "max_duration_sec": 20.0,  # Split if longer than 20s
        "target_duration": 15.0,   # Ideal chunk size for ASR
        "vad_threshold": 0.5       # Sensitivity of speech detection
    }
}

# Create Output Directories
os.makedirs(SEGMENT_CONFIG['paths']['output_dir'], exist_ok=True)
os.makedirs(SEGMENT_CONFIG['paths']['manifest_path'], exist_ok=True)

# Check Input Availability
if not INPUT_DIR:
    logger.critical("❌ CRITICAL: Could not find input audio files!")
    logger.critical("ACTION: Make sure you added the 'Phase 1C YouTube Mining' output as input to this notebook.")
else:
    input_files = glob(f"{INPUT_DIR}/*.wav")
    logger.info(f"✓ Found Input Directory: {INPUT_DIR}")
    logger.info(f"✓ Found {len(input_files)} raw audio files for processing.")

[06:22:21] INFO: ✓ Found Input Directory: /kaggle/input/phase-1c-youtube-mining/processed_audio
[06:22:21] INFO: ✓ Found 5 raw audio files for processing.


In [3]:
# CELL 3: VAD Model Loader
# Load the Enterprise-grade Silero VAD model from PyTorch Hub

def load_vad_model():
    logger.info("Loading Silero VAD Model...")
    try:
        model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', # Repo
                                      model='silero_vad',                # Model Name
                                      force_reload=False,
                                      trust_repo=True)
        (get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils
        logger.info("✓ VAD Model Loaded Successfully.")
        return model, get_speech_timestamps, read_audio, collect_chunks
    except Exception as e:
        logger.error(f"❌ Failed to load VAD model: {e}")
        raise

if INPUT_DIR:
    model, get_speech_timestamps, read_audio, collect_chunks = load_vad_model()

[06:22:21] INFO: Loading Silero VAD Model...


Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /root/.cache/torch/hub/master.zip


[06:22:22] INFO: ✓ VAD Model Loaded Successfully.


In [4]:
# CELL 4: Intelligent Segmentation Engine
# Core logic to split long audio based on speech timestamps

def process_audio_file(file_path, config, model_utils):
    model, get_timestamps, read_audio, _ = model_utils
    params = config['params']
    
    file_name = os.path.basename(file_path)
    file_id = os.path.splitext(file_name)[0]
    dialect = file_name.split('_')[0] # Assuming filename format: Dialect_ID.wav
    
    logger.info(f"Processing: {file_name}...")
    
    # Read Audio
    wav = read_audio(file_path, sampling_rate=params['sampling_rate'])
    
    # Get Speech Timestamps (The Magic Step)
    # This detects where speech is and ignores music/silence
    speech_timestamps = get_timestamps(
        wav, 
        model, 
        sampling_rate=params['sampling_rate'],
        threshold=params['vad_threshold']
    )
    
    segments_meta = []
    chunk_id = 0
    
    # Logic to merge short timestamps into target_duration (e.g., 15s)
    current_chunk = []
    current_duration = 0
    
    for ts in speech_timestamps:
        duration = (ts['end'] - ts['start']) / params['sampling_rate']
        
        # If adding this speech part exceeds max duration, save current chunk first
        if current_duration + duration > params['max_duration_sec']:
            if current_duration >= params['min_duration_sec']:
                # Save Logic
                save_chunk(wav, current_chunk, chunk_id, file_id, dialect, config, segments_meta)
                chunk_id += 1
            
            # Reset
            current_chunk = []
            current_duration = 0
        
        current_chunk.append(ts)
        current_duration += duration
    
    # Save remaining
    if current_duration >= params['min_duration_sec']:
        save_chunk(wav, current_chunk, chunk_id, file_id, dialect, config, segments_meta)
        
    logger.info(f"   -> Generated {chunk_id + 1} clean segments.")
    return segments_meta

def save_chunk(wav_tensor, timestamps, chunk_id, parent_id, dialect, config, meta_list):
    # Combine timestamps to a single tensor
    # Note: simple concatenation of active speech frames
    chunk_audio = torch.cat([wav_tensor[ts['start']:ts['end']] for ts in timestamps])
    
    # Filename: Dialect_ParentID_Chunk001.wav
    out_name = f"{parent_id}_seg{chunk_id:04d}.wav"
    out_path = os.path.join(config['paths']['output_dir'], out_name)
    
    # Save
    torchaudio.save(out_path, chunk_audio.unsqueeze(0), config['params']['sampling_rate'])
    
    # Metadata
    meta_list.append({
        "filename": out_name,
        "filepath": out_path,
        "dialect": dialect,
        "source_file": parent_id,
        "duration": len(chunk_audio) / config['params']['sampling_rate']
    })

In [5]:
# CELL 5: Execution Loop & Manifest Generation

if INPUT_DIR and input_files:
    full_inventory = []
    
    logger.info("Starting Segmentation Pipeline...")
    model_utils = (model, get_speech_timestamps, read_audio, collect_chunks)
    
    for wav_file in tqdm(input_files):
        try:
            file_meta = process_audio_file(wav_file, SEGMENT_CONFIG, model_utils)
            full_inventory.extend(file_meta)
        except Exception as e:
            logger.error(f"Failed to process {wav_file}: {e}")
            
    # Export Inventory
    df_seg = pd.DataFrame(full_inventory)
    csv_path = f"{SEGMENT_CONFIG['paths']['manifest_path']}/segmented_inventory.csv"
    df_seg.to_csv(csv_path, index=False)
    
    logger.info("=== SEGMENTATION COMPLETE ===")
    logger.info(f"Total Segments Created: {len(df_seg)}")
    if not df_seg.empty:
        logger.info(f"Total Clean Speech Duration: {df_seg['duration'].sum() / 3600:.2f} hours")
    logger.info(f"Manifest Saved: {csv_path}")
    
    # Preview
    print(df_seg.head())
else:
    logger.warning("Skipping execution as no input files were found.")

[06:22:22] INFO: Starting Segmentation Pipeline...


  0%|          | 0/5 [00:00<?, ?it/s]

[06:22:22] INFO: Processing: Chittagonian_JvwgOr-K0vQ.wav...
[06:22:59] INFO:    -> Generated 73 clean segments.
[06:22:59] INFO: Processing: Noakhali_wMP0zweZUzA.wav...
[06:24:02] INFO:    -> Generated 154 clean segments.
[06:24:02] INFO: Processing: Chittagonian_mumxd18fIK0.wav...
[06:25:03] INFO:    -> Generated 143 clean segments.
[06:25:03] INFO: Processing: Sylheti_6Ycv4OO9kwo.wav...
[06:25:42] INFO:    -> Generated 117 clean segments.
[06:25:42] INFO: Processing: Sylheti_B8tTlSZo7Z8.wav...
[06:26:11] INFO:    -> Generated 90 clean segments.
[06:26:11] INFO: === SEGMENTATION COMPLETE ===
[06:26:11] INFO: Total Segments Created: 577
[06:26:11] INFO: Total Clean Speech Duration: 2.82 hours
[06:26:11] INFO: Manifest Saved: /kaggle/working/manifests/segmented_inventory.csv


                               filename  \
0  Chittagonian_JvwgOr-K0vQ_seg0000.wav   
1  Chittagonian_JvwgOr-K0vQ_seg0001.wav   
2  Chittagonian_JvwgOr-K0vQ_seg0002.wav   
3  Chittagonian_JvwgOr-K0vQ_seg0003.wav   
4  Chittagonian_JvwgOr-K0vQ_seg0004.wav   

                                            filepath       dialect  \
0  /kaggle/working/segmented_dataset/Chittagonian...  Chittagonian   
1  /kaggle/working/segmented_dataset/Chittagonian...  Chittagonian   
2  /kaggle/working/segmented_dataset/Chittagonian...  Chittagonian   
3  /kaggle/working/segmented_dataset/Chittagonian...  Chittagonian   
4  /kaggle/working/segmented_dataset/Chittagonian...  Chittagonian   

                source_file  duration  
0  Chittagonian_JvwgOr-K0vQ    18.612  
1  Chittagonian_JvwgOr-K0vQ    19.088  
2  Chittagonian_JvwgOr-K0vQ     5.820  
3  Chittagonian_JvwgOr-K0vQ    19.924  
4  Chittagonian_JvwgOr-K0vQ    18.536  
