# Project: Bangla Dialect-to-Standard Normalization
## Phase 1C: Unstructured Dialect Corpus Acquisition (YouTube Mining)

**Author:** Swagotam Malakar  
**Affiliation:** Dept. of CSE, United International University  
**Objective:** Acquisition of high-fidelity regional speech data from public media sources (Dramas/Vlogs) to augment the failure-prone RegSpeech12 dataset.

---

### Abstract
Following the **Language Identity Crisis** observed in Phase 1B, it is evident that current acoustic models require domain-specific fine-tuning data. This notebook implements a targeted data mining pipeline to extract audio from verified YouTube sources (Regional Natoks/Vlogs). The pipeline includes automated audio extraction, downsampling to 16kHz (ASR Standard), and metadata cataloging.

### Methodology
1.  **Source Configuration:** Targeting specific dialects (Sylheti, Chittagonian, Noakhali) via curated URLs.
2.  **Audio Extraction:** Using `yt-dlp` for high-quality audio stream capture.
3.  **Signal Processing:** Conversion of raw streams to mono-channel, 16kHz WAV files using `ffmpeg`.
4.  **Inventory Management:** Automated generation of a training manifest.

In [1]:
# CELL 1: Environment Setup & Dependency Installation
# REQUIREMENT: Kaggle 'Internet' access must be enabled in the settings panel.

import os
import sys
import json
import logging
import pandas as pd
from datetime import datetime

# Configure Logging
logging.basicConfig(
    level=logging.INFO,
    format='[%(asctime)s] %(levelname)s: %(message)s',
    datefmt='%H:%M:%S'
)
logger = logging.getLogger("DataMiner")

logger.info("Installing Extraction Engine (yt-dlp) & Audio Processors (ffmpeg)...")

# Silent Install
!pip install -q yt-dlp pydub
!apt-get -qq install ffmpeg

logger.info("✓ Environment Ready for Data Mining.")

[06:05:52] INFO: Installing Extraction Engine (yt-dlp) & Audio Processors (ffmpeg)...


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m180.3/180.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m
[?25h

[06:06:03] INFO: ✓ Environment Ready for Data Mining.


In [2]:
# CELL 2: Mining Configuration (REAL DATA LINKS)
# ACTION: These links are verified active sources of high-quality dialect speech.

MINING_CONFIG = {
    "meta": {
        "session_id": "mining_session_001",
        "target_sampling_rate": 16000,
        "output_format": "wav"
    },
    "targets": [
        {
            "dialect": "Sylheti",
            "source_type": "Drama",
            "urls": [
                "https://www.youtube.com/watch?v=B8tTlSZo7Z8", 
                "https://www.youtube.com/watch?v=6Ycv4OO9kwo"
            ]
        },
        {
            "dialect": "Chittagonian",
            "source_type": "Drama",
            "urls": [
                "https://www.youtube.com/watch?v=JvwgOr-K0vQ",
                "https://www.youtube.com/watch?v=mumxd18fIK0"
            ]
        },
        {
            "dialect": "Noakhali",
            "source_type": "Drama",
            "urls": [
                "https://www.youtube.com/watch?v=wMP0zweZUzA"
            ]
        }
    ],
    "paths": {
        "raw_download": "/kaggle/working/raw_downloads",
        "processed_audio": "/kaggle/working/processed_audio",
        "manifest": "/kaggle/working/manifests"
    }
}

# Create Directories
for path in MINING_CONFIG['paths'].values():
    os.makedirs(path, exist_ok=True)
    logger.info(f"Directory Verified: {path}")

[06:06:03] INFO: Directory Verified: /kaggle/working/raw_downloads
[06:06:03] INFO: Directory Verified: /kaggle/working/processed_audio
[06:06:03] INFO: Directory Verified: /kaggle/working/manifests


In [3]:
# CELL 3: Extraction Engine (The Miner)
# This function handles the download and conversion pipeline safely.

import yt_dlp
from pydub import AudioSegment

def download_and_process(config):
    manifest = []
    
    # yt-dlp options for best audio
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'wav',
            'preferredquality': '192',
        }],
        'outtmpl': f"{config['paths']['raw_download']}/%(id)s.%(ext)s",
        'quiet': True,
        'no_warnings': True
    }

    logger.info("Starting Extraction Pipeline...")

    for target in config['targets']:
        dialect = target['dialect']
        logger.info(f"Processing Target Dialect: {dialect}")
        
        for url in target['urls']:
            try:
                logger.info(f"   Mining URL: {url}")
                
                # 1. Download
                with yt_dlp.YoutubeDL(ydl_opts) as ydl:
                    info = ydl.extract_info(url, download=True)
                    file_id = info['id']
                    title = info['title']
                    raw_path = f"{config['paths']['raw_download']}/{file_id}.wav"

                # 2. Process (Resample to 16kHz Mono)
                if os.path.exists(raw_path):
                    audio = AudioSegment.from_wav(raw_path)
                    audio = audio.set_frame_rate(config['meta']['target_sampling_rate'])
                    audio = audio.set_channels(1) # Mono
                    
                    processed_filename = f"{dialect}_{file_id}.wav"
                    processed_path = f"{config['paths']['processed_audio']}/{processed_filename}"
                    
                    audio.export(processed_path, format="wav")
                    
                    # 3. Catalog
                    manifest.append({
                        "filename": processed_filename,
                        "filepath": processed_path,
                        "dialect": dialect,
                        "source_url": url,
                        "original_title": title,
                        "duration_sec": audio.duration_seconds
                    })
                    
                    # Cleanup Raw
                    os.remove(raw_path)
                    logger.info(f"   ✓ Success: {title[:30]}...")
                    
            except Exception as e:
                logger.error(f"   ❌ Failed to mine {url}: {e}")

    return pd.DataFrame(manifest)

# Execute Miner
df_manifest = download_and_process(MINING_CONFIG)

  m = re.match('([su]([0-9]{1,2})p?) \(([0-9]{1,2}) bit\)$', token)
  m2 = re.match('([su]([0-9]{1,2})p?)( \(default\))?$', token)
  elif re.match('(flt)p?( \(default\))?$', token):
  elif re.match('(dbl)p?( \(default\))?$', token):
[06:06:03] INFO: Starting Extraction Pipeline...
[06:06:03] INFO: Processing Target Dialect: Sylheti
[06:06:03] INFO:    Mining URL: https://www.youtube.com/watch?v=B8tTlSZo7Z8


                                                      

[06:06:24] INFO:    ✓ Success: সিলেটি নাটক | মামু ভাগ্নার ডাব...
[06:06:24] INFO:    Mining URL: https://www.youtube.com/watch?v=6Ycv4OO9kwo


                                                      

[06:06:47] INFO:    ✓ Success: সিলেটি নাটক | এক দিনের পাগল | ...
[06:06:47] INFO: Processing Target Dialect: Chittagonian
[06:06:47] INFO:    Mining URL: https://www.youtube.com/watch?v=JvwgOr-K0vQ


                                                      

[06:07:05] INFO:    ✓ Success: চট্টগ্রামের ভাষায় নাটক - আঁর ন...
[06:07:05] INFO:    Mining URL: https://www.youtube.com/watch?v=mumxd18fIK0


                                                      

[06:07:33] INFO:    ✓ Success: Chatgayara Dhakay | FULL DRAMA...
[06:07:33] INFO: Processing Target Dialect: Noakhali
[06:07:33] INFO:    Mining URL: https://www.youtube.com/watch?v=wMP0zweZUzA


                                                       

[06:08:01] INFO:    ✓ Success: মিশন নোয়াখালী | Mission Noakha...


In [4]:
# CELL 4: Export Manifest & Summary

if not df_manifest.empty:
    # Save Manifest CSV
    save_path = f"{MINING_CONFIG['paths']['manifest']}/phase1c_inventory.csv"
    df_manifest.to_csv(save_path, index=False)
    
    logger.info("=== MINING REPORT ===")
    logger.info(f"Total Files Acquired: {len(df_manifest)}")
    logger.info(f"Total Duration: {df_manifest['duration_sec'].sum() / 60:.2f} minutes")
    logger.info(f"Storage Location: {MINING_CONFIG['paths']['processed_audio']}")
    
    # Preview
    print("\nDataset Sample:")
    print(df_manifest[['dialect', 'filename', 'duration_sec']].head())
else:
    logger.warning("No data mined. Please check internet connection.")

[06:08:01] INFO: === MINING REPORT ===
[06:08:01] INFO: Total Files Acquired: 5
[06:08:01] INFO: Total Duration: 231.19 minutes
[06:08:01] INFO: Storage Location: /kaggle/working/processed_audio



Dataset Sample:
        dialect                      filename  duration_sec
0       Sylheti       Sylheti_B8tTlSZo7Z8.wav   1782.410187
1       Sylheti       Sylheti_6Ycv4OO9kwo.wav   2364.371938
2  Chittagonian  Chittagonian_JvwgOr-K0vQ.wav   2208.043563
3  Chittagonian  Chittagonian_mumxd18fIK0.wav   3687.526188
4      Noakhali      Noakhali_wMP0zweZUzA.wav   3828.889312
