# Project: Bangla Dialect-to-Standard Normalization
## Phase 2A: Robust Automated Draft Transcription (P100 Optimized)

**Optimization:** Uses 'Medium' model + Aggressive Memory Management to prevent Kaggle Kernel Freezes.

---

### Abstract
This notebook generates draft transcriptions for the segmented dialect audio. It is specifically optimized for the Kaggle P100 GPU environment by using the Whisper 'Medium' architecture and implementing incremental checkpoint saving. This ensures that even if the kernel times out, progress is preserved.


In [1]:
# CELL 1: Environment Setup (CORRECTED)
# Installs Whisper FIRST, then imports it.

!pip install -q openai-whisper

import os
import gc
import logging
import warnings
import pandas as pd
import torch
import whisper
from glob import glob
from tqdm.notebook import tqdm

# Suppress warnings
warnings.filterwarnings('ignore')

# Configure Logging
logging.basicConfig(
    level=logging.INFO,
    format='[%(asctime)s] %(levelname)s: %(message)s',
    datefmt='%H:%M:%S'
)
logger = logging.getLogger("Transcriber")

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"‚úì Hardware Device: {DEVICE.upper()}", flush=True)

if DEVICE == "cpu":
    print("‚ö†Ô∏è WARNING: Running on CPU! This will be extremely slow.", flush=True)

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m803.2/803.2 kB[0m [31m41.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone
‚úì Hardware Device: CUDA


In [2]:
# CELL 2: Robust Data Discovery
# Finds your segmented inventory and audio files automatically

def locate_data():
    # 1. Find CSV
    csv_files = glob("/kaggle/input/**/segmented_inventory.csv", recursive=True)
    if not csv_files:
        print("‚ùå Error: 'segmented_inventory.csv' not found. Please add Phase 1C output as input.", flush=True)
        return None, None
    
    csv_path = csv_files[0]
    print(f"‚úì Found Inventory: {csv_path}", flush=True)
    
    # 2. Find Audio Directory
    df = pd.read_csv(csv_path)
    if df.empty:
        print("‚ùå Error: Inventory CSV is empty!", flush=True)
        return None, None

    # Check first file location
    sample_file = df.iloc[0]['filename']
    audio_files = glob(f"/kaggle/input/**/{sample_file}", recursive=True)
    
    if not audio_files:
        # Fallback for same-session
        if os.path.exists(f"/kaggle/working/segmented_dataset/{sample_file}"):
            return df, "/kaggle/working/segmented_dataset"
        print(f"‚ùå Error: Audio file '{sample_file}' not found.", flush=True)
        return df, None
    
    audio_dir = os.path.dirname(audio_files[0])
    print(f"‚úì Found Audio Directory: {audio_dir}", flush=True)
    return df, audio_dir

df_inventory, AUDIO_DIR = locate_data()

‚úì Found Inventory: /kaggle/input/phase-1c-part2-segmentation/manifests/segmented_inventory.csv
‚úì Found Audio Directory: /kaggle/input/phase-1c-part2-segmentation/segmented_dataset


In [3]:
# CELL 3: Load Optimized Model (Medium)
# Using 'medium' ensures stability on P100

MODEL_SIZE = "medium"

if AUDIO_DIR:
    print(f"‚è≥ Loading Whisper '{MODEL_SIZE}' model...", flush=True)
    try:
        model = whisper.load_model(MODEL_SIZE, device=DEVICE)
        print("‚úì Model Loaded Successfully.", flush=True)
    except Exception as e:
        print(f"‚ùå Model Load Failed: {e}", flush=True)
        raise

‚è≥ Loading Whisper 'medium' model...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1.42G/1.42G [00:06<00:00, 232MiB/s]


‚úì Model Loaded Successfully.


In [4]:
# CELL 4: Fail-Safe Transcription Loop

CHECKPOINT_PATH = "/kaggle/working/draft_transcriptions_checkpoint.csv"

if AUDIO_DIR and df_inventory is not None:
    results = []
    total_files = len(df_inventory)
    
    print(f"üöÄ Starting Inference on {total_files} files...", flush=True)
    
    for idx, row in tqdm(df_inventory.iterrows(), total=total_files):
        file_name = row['filename']
        file_path = os.path.join(AUDIO_DIR, file_name)
        
        try:
            if not os.path.exists(file_path):
                continue

            # Transcribe (Beam size 1 for speed)
            result = model.transcribe(file_path, language="bn", beam_size=1)
            
            results.append({
                "filename": file_name,
                "dialect": row['dialect'],
                "duration": row['duration'],
                "machine_transcript": result['text'].strip(),
                "human_correction": "" 
            })
            
            # PROGRESS UPDATE (Crucial for Save Version Logs)
            if (idx + 1) % 20 == 0:
                print(f"   --> Processed {idx + 1}/{total_files} files...", flush=True)
                
                # Memory Cleanup
                torch.cuda.empty_cache()
                gc.collect()
            
            # Checkpoint Save (Every 50 files)
            if (idx + 1) % 50 == 0:
                pd.DataFrame(results).to_csv(CHECKPOINT_PATH, index=False)

        except Exception as e:
            print(f"‚ö†Ô∏è Error on {file_name}: {e}", flush=True)
            continue

    # Final Save
    final_path = "/kaggle/working/draft_transcriptions_FINAL.csv"
    pd.DataFrame(results).to_csv(final_path, index=False)
    print(f"‚úÖ Job Complete. Saved to {final_path}", flush=True)
else:
    print("‚ùå Skipping transcription due to missing setup.", flush=True)

üöÄ Starting Inference on 577 files...


  0%|          | 0/577 [00:00<?, ?it/s]

   --> Processed 20/577 files...
   --> Processed 40/577 files...
   --> Processed 60/577 files...
   --> Processed 80/577 files...
   --> Processed 100/577 files...
   --> Processed 120/577 files...
   --> Processed 140/577 files...
   --> Processed 160/577 files...
   --> Processed 180/577 files...
   --> Processed 200/577 files...
   --> Processed 220/577 files...
   --> Processed 240/577 files...
   --> Processed 260/577 files...
   --> Processed 280/577 files...
   --> Processed 300/577 files...
   --> Processed 320/577 files...
   --> Processed 340/577 files...
   --> Processed 360/577 files...
   --> Processed 380/577 files...
   --> Processed 400/577 files...
   --> Processed 420/577 files...
   --> Processed 440/577 files...
   --> Processed 460/577 files...
   --> Processed 480/577 files...
   --> Processed 500/577 files...
   --> Processed 520/577 files...
   --> Processed 540/577 files...
   --> Processed 560/577 files...
‚úÖ Job Complete. Saved to /kaggle/working/draft_tra

In [5]:
# CELL 5: Verify Output
import os
import pandas as pd

out_file = "/kaggle/working/draft_transcriptions_FINAL.csv"
if os.path.exists(out_file):
    df = pd.read_csv(out_file)
    print(f"\nüìÑ Final File Generated: {out_file}")
    print(f"üìä Total Rows: {len(df)}")
    print("Preview:")
    print(df[['filename', 'machine_transcript']].head())
else:
    print("‚ùå Final CSV not found!")


üìÑ Final File Generated: /kaggle/working/draft_transcriptions_FINAL.csv
üìä Total Rows: 577
Preview:
                               filename  \
0  Chittagonian_JvwgOr-K0vQ_seg0000.wav   
1  Chittagonian_JvwgOr-K0vQ_seg0001.wav   
2  Chittagonian_JvwgOr-K0vQ_seg0002.wav   
3  Chittagonian_JvwgOr-K0vQ_seg0003.wav   
4  Chittagonian_JvwgOr-K0vQ_seg0004.wav   

                                  machine_transcript  
0                                                NaN  
1  ‡¶®‡ßá‡¶∂‡¶ø‡¶∞‡¶ï‡ßç‡¶≤‡ßá‡¶® ‡¶ï‡ßá‡¶®‡¶æ ‡¶Ü‡¶¶‡ßá‡¶®‡¶æ ‡¶Ø‡¶æ‡¶ï‡¶æ‡¶°‡¶ø ‡¶Ü‡¶ï‡¶æ‡¶°‡¶ø ‡¶™‡ßá‡¶®‡¶ø ‡¶Ü‡¶´‡¶æ‡¶∞‡¶æ‡¶ñ...  
2  ‡§§‡•ã‡§à‡§≤‡•á ‡§§‡§æ‡§∞ ‡§´‡•ç‡§∞‡•á‡§Æ‡•Ä ‡§ï‡•á ‡§≠‡•Å‡§à‡§∂‡•ç‡§¶‡•ã ‡§§‡§æ‡§∞‡•á ‡§Ö‡§≠‡•Å‡§Ø‡§≤‡§æ ‡§ó‡§∞‡•Ä ‡§ï‡•Ä...  
3  ‡§§‡•Å‡§Ü‡§∞ ‡§≠‡•Å‡§¶‡•ç‡§ß‡§ø‡•§ ‡§§‡•Å‡§Ü‡§∞ ‡§Æ‡§§‡§≤ ‡§µ‡§ö‡§∏‡•ç‡§§‡§∞ ‡§≠‡•Å‡§¶‡•ç‡§ß‡§ø ‡§§‡•ã‡§∞‡•Ä‡•§ ‡§Ö‡§∞ ‡§≠...  
4  ‡§´‡§æ‡§∞‡§æ ‡§´‡•Ç‡§§‡•Ä ‡§≠‡•á‡§∂‡§ø‡§∞‡•á ‡§ú‡§æ‡§®‡§æ, ‡§á‡§§‡•á ‡§∞‡§æ‡§§‡•ã ‡§ö‡§æ‡§®‡•á ‡§ú‡§æ‡§á‡§¨‡•Å ‡§ó‡•ã‡§à...  
