<a href="https://colab.research.google.com/github/w8091032/re0-web/blob/master/ASR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#安裝元件
!pip install datasets evaluate jiwer librosa soundfile
!pip install --upgrade bitsandbytes transformers==4.50.0 accelerate
!pip install ctranslate2==4.4.0 whisperx
!apt-get update
!apt-get install -y libcudnn8 libcudnn8-dev

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building dependency tree... Done
Reading

In [None]:
import os
import pandas as pd
import torch
import numpy as np
from datasets import Dataset, Audio
from transformers import (
    WhisperProcessor, # 用於載入預訓練的 processor

)
from dataclasses import dataclass
from typing import Any, Dict, List, Union
import evaluate # huggingface evaluate
from tqdm import tqdm
# from transformers import EarlyStoppingCallback # 微調用
from ctranslate2.converters import TransformersConverter # HF 模型轉 CTranslate2
import whisperx # WhisperX 用於推斷和時間戳
# from transformers import WhisperTokenizer # 通常 WhisperProcessor 已包含 Tokenizer
import json
import zipfile
from jiwer import wer # 計算 WER

print("所有函式庫匯入完成。")

所有函式庫匯入完成。


In [None]:
import librosa

def load_and_preprocess_audio(audio_path, target_sampling_rate=16000):
    """載入音訊檔案，轉換取樣率，並準備給 Whisper Processor 使用。"""
    try:
        # 使用 librosa 載入音訊，它會自動處理多種格式
        speech_array, sampling_rate = librosa.load(audio_path, sr=None) # sr=None 表示載入原始取樣率

        if sampling_rate != target_sampling_rate:
            speech_array = librosa.resample(speech_array, orig_sr=sampling_rate, target_sr=target_sampling_rate)
            sampling_rate = target_sampling_rate

        return {"array": speech_array, "sampling_rate": sampling_rate}
    except Exception as e:
        print(f"載入音訊 [{audio_path}] 失敗: {e}")
        return None

# --- Define example_audio_path BEFORE using it ---
example_audio_path = "/content/test_audio_1.wav" # Or "/content/test_audio_2.wav"
# Ensure "test_audio_1.wav" is actually uploaded to /content/

print(f"正在測試載入單個音訊檔案: {example_audio_path}")
audio_input_data = load_and_preprocess_audio(example_audio_path)

if audio_input_data:
    print(f"音訊載入成功，取樣率: {audio_input_data['sampling_rate']}, 數組長度: {len(audio_input_data['array'])}")
else:
    print(f"音訊載入失敗，請檢查路徑或檔案。")

正在測試載入單個音訊檔案: /content/test_audio_1.wav
音訊載入成功，取樣率: 16000, 數組長度: 56320


In [None]:

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"正在使用設備: {device}")

正在使用設備: cuda


In [None]:
def setup_directories(task_name: str = "task1", version: str = "v1_colab_test"):
    """建立結果目錄和版本子目錄"""
    base_result_dir = "model_result"
    task_dir = os.path.join(base_result_dir, task_name)
    version_dir = os.path.join(task_dir, version)
    model_dir = os.path.join(version_dir, "model") # 用於存放 HF 模型
    ct2_model_dir = os.path.join(version_dir, "ct2_model") # 用於存放 CTranslate2 模型


    os.makedirs(task_dir, exist_ok=True)
    os.makedirs(version_dir, exist_ok=True)
    os.makedirs(model_dir, exist_ok=True)
    os.makedirs(ct2_model_dir, exist_ok=True)

    print(f"Base result directory: {base_result_dir}")
    print(f"Task directory: {task_dir}")
    print(f"Version directory: {version_dir}")
    print(f"HuggingFace Model directory: {model_dir}")
    print(f"CTranslate2 Model directory: {ct2_model_dir}")
    return version_dir, model_dir, ct2_model_dir


current_version = "v1_inference"
version_dir, hf_model_save_path, ct2_model_save_path = setup_directories(version=current_version)

Base result directory: model_result
Task directory: model_result/task1
Version directory: model_result/task1/v1_inference
HuggingFace Model directory: model_result/task1/v1_inference/model
CTranslate2 Model directory: model_result/task1/v1_inference/ct2_model


In [None]:

validation_audio_files = [
    "/content/test_audio_1.wav",  # Colab 的 /content/ 目錄
    "/content/test_audio_2.wav"
]
print(f"使用測試音檔 (請確保已上傳): {validation_audio_files}")


if 'hf_model_save_path' not in globals():
    version_dir, hf_model_save_path, ct2_model_save_path = setup_directories(version="v_nozip_test")


使用測試音檔 (請確保已上傳): ['/content/test_audio_1.wav', '/content/test_audio_2.wav']


In [None]:
try:
    MODEL_ID_FROM_HUB = "openai/whisper-base"
    processor = WhisperProcessor.from_pretrained(MODEL_ID_FROM_HUB)
    print(f"WhisperProcessor 從 Hugging Face Hub ({MODEL_ID_FROM_HUB}) 載入成功。")
except Exception as e:
    print(f"從 Hugging Face Hub ({MODEL_ID_FROM_HUB}) 載入 WhisperProcessor 失敗: {e}")
    processor = None

def prepare_dataset(audio_files_list: List[str], dummy_texts: List[str] = None):

    if dummy_texts is None:
        dummy_texts = [""] * len(audio_files_list)
    if len(audio_files_list) != len(dummy_texts):
        raise ValueError("音檔列表和文本列表的長度必須一致!")
    val_dict = { "audio": audio_files_list, "text": dummy_texts }
    dataset = Dataset.from_dict(val_dict)
    dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
    return dataset

def prepare_features(batch, current_processor, language_code="en"):

    if not current_processor:
        raise ValueError("Processor 未被初始化!")
    audio = batch["audio"]
    batch["input_features"] = current_processor.feature_extractor(
        audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt"
    ).input_features[0]
    batch["labels"] = current_processor.tokenizer(
        batch["text"], max_length=448, truncation=True, padding="max_length"
    ).input_ids
    batch["file_path"] = audio["path"]
    if audio["path"]:
      batch["audio_file_name"] = os.path.splitext(os.path.basename(audio["path"]))[0]
    else:
      batch["audio_file_name"] = "unknown_file"
    return batch

TARGET_LANGUAGE = "en"

WhisperProcessor 從 Hugging Face Hub (openai/whisper-base) 載入成功。


In [None]:
# 儲存格 6


MODEL_ID_FROM_HUB = "openai/whisper-base"
try:
    processor = WhisperProcessor.from_pretrained(MODEL_ID_FROM_HUB)
    print(f"WhisperProcessor 從 Hugging Face Hub ({MODEL_ID_FROM_HUB}) 載入成功。")
except Exception as e:
    print(f"從 Hugging Face Hub ({MODEL_ID_FROM_HUB}) 載入 WhisperProcessor 失敗: {e}")
    processor = None


def prepare_dataset(audio_files_list: List[str], dummy_texts: List[str] = None):
    """
    根據音檔路徑列表和可選的文本列表創建 Hugging Face Dataset。
    如果 dummy_texts 未提供，則創建空文本。
    """
    if dummy_texts is None:
        dummy_texts = [""] * len(audio_files_list)

    if len(audio_files_list) != len(dummy_texts):
        raise ValueError("音檔列表和文本列表的長度必須一致!")

    val_dict = {
        "audio": audio_files_list,
        "text": dummy_texts
    }
    dataset = Dataset.from_dict(val_dict)
    dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
    return dataset


def prepare_features(batch, current_processor, language_code="en"): # 預設語言為英文
    """處理音訊並將目標文本編碼為標籤ID"""
    if not current_processor:
        raise ValueError("Processor 未被初始化!")

    # 處理音訊
    audio = batch["audio"] # HuggingFace Dataset 會自動載入音訊

    # 計算 log-mel 輸入特徵
    batch["input_features"] = current_processor.feature_extractor(
        audio["array"],
        sampling_rate=audio["sampling_rate"],
        # language=language_code, # feature_extractor 通常不需要 language, tokenizer 需要
        return_tensors="pt"
    ).input_features[0] # 取第一個元素，因為通常 feature_extractor 會返回 list of features


    batch["labels"] = current_processor.tokenizer(
        batch["text"], # 來自 prepare_dataset 提供的 dummy_text
        max_length=448, # 與圖片中的設定一致
        truncation=True,
        padding="max_length" # 或 "longest"
    ).input_ids # 通常是 .input_ids

    # 提取檔案路徑和檔名 (可選，用於追蹤)
    batch["file_path"] = audio["path"]
    if audio["path"]: # 避免 audio["path"] is None 的情況
      batch["audio_file_name"] = os.path.splitext(os.path.basename(audio["path"]))[0]
    else:
      batch["audio_file_name"] = "unknown_file"

    return batch

# 語言設定 (請根據你的模型和資料調整，例如 "en", "zh")
TARGET_LANGUAGE = "en"
print(f"目標語言設定為: {TARGET_LANGUAGE}")

WhisperProcessor 從 Hugging Face Hub (openai/whisper-base) 載入成功。
目標語言設定為: en


In [None]:
if 'validation_audio_files' in globals() and validation_audio_files and 'processor' in globals() and processor:
    print(f"正在使用 {len(validation_audio_files)} 個音檔準備 'val' 資料集...")
    raw_val_dataset = prepare_dataset(validation_audio_files)
    dataset_dict = {"val": raw_val_dataset}

    processed_dataset = {}
    for split in ["val"]:
        if split in dataset_dict and len(dataset_dict[split]) > 0: # <--- 新增 len 檢查
            print(f"正在處理 {split} 資料集...")
            processed_dataset[split] = dataset_dict[split].map(
                prepare_features,
                fn_kwargs={"current_processor": processor, "language_code": TARGET_LANGUAGE},
                remove_columns=dataset_dict[split].column_names,
                num_proc=1
            )
            print(f"{split} 資料集處理完成。處理後的欄位: {processed_dataset[split].column_names}")
        elif split in dataset_dict and len(dataset_dict[split]) == 0:
            print(f"{split} 資料集為空，跳過 map 處理。")
            processed_dataset[split] = dataset_dict[split] # 保留空的 Dataset
        else:
            print(f"找不到 {split} 資料集。")
elif not ('validation_audio_files' in globals() and validation_audio_files):
    print("提示: `validation_audio_files` 為空，已跳過資料集處理。")
    processed_dataset = {'val': Dataset.from_dict({"audio":[], "text":[]})} # 創建一個空的processed_dataset結構
else:
    print("錯誤: 'validation_audio_files' 未定義或為空，或者 'processor' 未成功載入。請檢查前面的儲存格。")

正在使用 2 個音檔準備 'val' 資料集...
正在處理 val 資料集...


Map:   0%|          | 0/2 [00:00<?, ? examples/s]

val 資料集處理完成。處理後的欄位: ['input_features', 'labels', 'file_path', 'audio_file_name']


In [None]:

def convert_hf_model_to_ct2(model_name_or_path: str,
                              output_dir: str,
                              quantization: str = "float16",
                              trust_remote_code: bool = True,
                              force: bool = True):

    print(f"開始轉換模型 '{model_name_or_path}' 到 CTranslate2 格式...")
    print(f"輸出目錄: {output_dir}")
    print(f"量化方式: {quantization}")
    files_to_copy = ['preprocessor_config.json']

    converter = TransformersConverter(
        model_name_or_path=model_name_or_path, # 這裡會傳入 Hub ID
        # copy_files=files_to_copy, # 當 model_name_or_path 是 Hub ID 時，此參數可能不那麼關鍵或由 converter 自動處理
        trust_remote_code=trust_remote_code,
    )
    converter.convert(output_dir=output_dir, quantization=quantization, force=force)
    print(f"模型成功轉換並儲存到 {output_dir}")

# --- 執行轉換 ---
# MODEL_ID_FROM_HUB 來自儲存格 6
# ct2_model_save_path 來自儲存格 3
print(f"\n將 HuggingFace Hub 模型 ({MODEL_ID_FROM_HUB}) 轉換為 CTranslate2 格式...")
try:
    convert_hf_model_to_ct2(
        model_name_or_path=MODEL_ID_FROM_HUB,
        output_dir=ct2_model_save_path,
        quantization="float16",
        trust_remote_code=True
    )
except Exception as e:
    print(f"CTranslate2 模型轉換失敗: {e}")


將 HuggingFace Hub 模型 (openai/whisper-base) 轉換為 CTranslate2 格式...
開始轉換模型 'openai/whisper-base' 到 CTranslate2 格式...
輸出目錄: model_result/task1/v1_inference/ct2_model
量化方式: float16


config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/290M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.81k [00:00<?, ?B/s]

模型成功轉換並儲存到 model_result/task1/v1_inference/ct2_model


In [None]:
# 儲存格 9: 使用 Hugging Face Pipeline 進行 ASR，並用 WhisperX Align 獲取精確時間戳

from torch.utils.data import DataLoader
import json
import os
from tqdm import tqdm
import torch
import whisperx # 僅用於載入音訊和對齊
from transformers import pipeline # 用於 ASR

# --- 檢查並載入必要的變數和模型 ---

# 1. 載入 Hugging Face ASR Pipeline (用於初步轉錄)
hf_asr_pipeline = None
print(f"\n--- 準備 Hugging Face ASR Pipeline ---")
if 'MODEL_ID_FROM_HUB' in globals() and MODEL_ID_FROM_HUB and 'device' in globals():
    try:
        hf_asr_pipeline = pipeline(
            "automatic-speech-recognition",
            model=MODEL_ID_FROM_HUB,    # 應在儲存格 6 定義 (例如 "openai/whisper-base")
            device=str(device),         # 應在儲存格 4 定義 ("cuda" 或 "cpu")
            chunk_length_s=30,          # Whisper 通常處理30秒的區塊
            # return_timestamps="word"  # 或 "segment"，獲取初步的時間分割
        )
        print(f"Hugging Face ASR Pipeline ({MODEL_ID_FROM_HUB}) 載入成功。")
    except Exception as e:
        print(f"載入 Hugging Face ASR Pipeline 失敗: {e}")
else:
    print("錯誤: MODEL_ID_FROM_HUB 或 device 未定義。無法載入 Hugging Face ASR Pipeline。")


# 2. 載入 WhisperX 對齊模型 (只執行一次)
align_model_global = None
align_metadata_global = None
print(f"\n--- 準備 WhisperX 對齊模型 ---")
if 'TARGET_LANGUAGE' in globals() and TARGET_LANGUAGE and 'device' in globals():
    try:
        print(f"正在為語言 '{TARGET_LANGUAGE}' 載入 WhisperX 對齊模型...")
        align_model_global, align_metadata_global = whisperx.load_align_model(language_code=TARGET_LANGUAGE, device=str(device))
        print("WhisperX 對齊模型載入成功。")
    except Exception as e:
        print(f"載入 WhisperX 對齊模型失敗 for language {TARGET_LANGUAGE}: {e}")
        print("將無法執行精確的時間戳對齊。")
else:
    print("錯誤: TARGET_LANGUAGE 或 device 未定義。無法載入 WhisperX 對齊模型。")


# --- 定義推斷和儲存函式 ---
def calculate_output_with_hf_pipeline_and_align(
    asr_pipeline_instance, # 傳入 Hugging Face ASR Pipeline
    align_model,
    align_meta,
    dataloader,
    output_version_dir,
    target_lang_code="en" # 用於 align 和 pipeline 的 generate_kwargs
):
    print("\n--- 開始計算輸出 (使用 HF Pipeline + WhisperX Align) ---")
    predictions_texts = []
    filenames_list = []
    export_json_data = {}

    if not asr_pipeline_instance:
        print("錯誤: Hugging Face ASR Pipeline 未載入，無法進行推斷。")
        return
    if not align_model or not align_meta:
        print("警告: WhisperX 對齊模型未載入，時間戳將不會被精確對齊。")

    for batch in tqdm(dataloader, desc="處理音訊檔案", unit="file"):
        audio_file_path = batch["file_path"][0]
        audio_file_name = batch["audio_file_name"][0]

        full_transcript_for_task1 = f"ERROR_PROCESSING_{audio_file_name}"
        final_word_segments_for_task2 = []
        detected_language_for_json = target_lang_code # pipeline 強制語言

        with torch.no_grad():
            try:

                pipeline_output = asr_pipeline_instance(
                    audio_file_path, # Pipeline 可以直接接受檔案路徑
                    generate_kwargs={"language": target_lang_code, "task": "transcribe"},
                    return_timestamps="word" # 獲取詞級別的初步時間戳和文本塊
                )

                full_transcript_from_pipeline = pipeline_output["text"]
                initial_segments_for_align = [] # 準備給 whisperx.align 的格式

                if "chunks" in pipeline_output and pipeline_output["chunks"]:
                    for chunk in pipeline_output["chunks"]:
                        initial_segments_for_align.append({
                            "text": chunk["text"].strip(), # whisperx.align 需要純文本
                            "start": chunk["timestamp"][0],
                            "end": chunk["timestamp"][1]
                        })
                else: # 如果 pipeline 沒有返回 chunks，我們創建一個包含完整文本的 segment
                      # 這時 align 依賴其內部 VAD 或純文本對齊能力
                    print(f"  提示: HF Pipeline 未返回 'chunks' for {audio_file_name}. "
                          "將使用完整文本進行對齊 (對齊效果可能受影響)。")

                    initial_segments_for_align = [{"text": full_transcript_from_pipeline}]


                full_transcript_for_task1 = full_transcript_from_pipeline # 預設 Task 1 輸出


                if align_model and align_meta and initial_segments_for_align:
                    audio_waveform_for_align = whisperx.load_audio(audio_file_path) # align 需要波形
                    try:

                        alignment = whisperx.align(
                            initial_segments_for_align,
                            align_model,
                            align_meta,
                            audio_waveform_for_align,
                            device=str(device) # 確保是字串
                        )
                        full_transcript_for_task1 = alignment.get("text", full_transcript_for_task1) # 優先使用對齊後的文本

                        if "word_segments" in alignment and alignment["word_segments"]:
                            for word_info in alignment["word_segments"]:
                                if word_info.get("word") is not None and word_info.get("start") is not None and word_info.get("end") is not None:
                                    final_word_segments_for_task2.append({
                                        "text": word_info.get("word", ""),
                                        "timestamp": (word_info.get("start"), word_info.get("end")),
                                        "score": word_info.get("score", None)
                                    })
                        else:
                             print(f"  提示: {audio_file_name} 的 WhisperX Align 結果中未找到 'word_segments'。")
                             # 如果對齊後沒有詞級別，可以嘗試從初步的 pipeline chunks 提取
                             if "chunks" in pipeline_output and pipeline_output["chunks"]:
                                 for chunk in pipeline_output["chunks"]:
                                     final_word_segments_for_task2.append({
                                         "text": chunk["text"].strip(),
                                         "timestamp": chunk["timestamp"],
                                         "score": None # HF pipeline 不直接給詞對齊分數
                                     })


                    except Exception as e_align:
                        print(f"  處理 {audio_file_name} 時 WhisperX Align 失敗: {e_align}. Task 1 使用初步轉錄，Task 2 時間戳將來自初步轉錄 (若有)。")
                        # 如果對齊失敗，嘗試使用 pipeline 的 word timestamps (如果可用)
                        if "chunks" in pipeline_output and pipeline_output["chunks"]:
                            for chunk in pipeline_output["chunks"]:
                                final_word_segments_for_task2.append({
                                    "text": chunk["text"].strip(),
                                    "timestamp": chunk["timestamp"],
                                    "score": None
                                })
                else: # 未執行對齊 (因為模型未載入或無初步 segments)
                    print(f"  提示: 未執行 WhisperX Align for {audio_file_name}。Task 1 使用初步轉錄，Task 2 時間戳將來自初步轉錄 (若有)。")
                    if "chunks" in pipeline_output and pipeline_output["chunks"]:
                         for chunk in pipeline_output["chunks"]:
                            final_word_segments_for_task2.append({
                                "text": chunk["text"].strip(),
                                "timestamp": chunk["timestamp"],
                                "score": None
                            })

            except Exception as e_pipeline_transcribe:
                print(f"  處理音訊 {audio_file_name} 時 HF Pipeline 轉錄或主流程發生錯誤: {e_pipeline_transcribe}")
                # full_transcript_for_task1 已設為錯誤訊息
                final_word_segments_for_task2 = []

            predictions_texts.append(full_transcript_for_task1)
            filenames_list.append(audio_file_name)
            export_json_data[audio_file_name] = {
                "language": detected_language_for_json,
                "full_text_for_task1": full_transcript_for_task1,
                "word_timestamps_for_task2": final_word_segments_for_task2
            }

    # --- 儲存結果 ---
    task1_output_filepath = os.path.join(output_version_dir, "task1_answer_hf_pipeline_align.txt")
    print(f"\n正在儲存 Task 1 的預測 ({len(predictions_texts)} 條) 到: {task1_output_filepath}")
    with open(task1_output_filepath, "w", encoding="utf-8") as f:
        for pred_text in predictions_texts:
            f.write(str(pred_text).strip() + "\n")

    json_output_filepath = os.path.join(output_version_dir, "val_detailed_timestamps_hf_pipeline_align.json")
    print(f"正在儲存詳細的時間戳資訊到: {json_output_filepath}")
    with open(json_output_filepath, "w", encoding="utf-8") as f:
        json.dump(export_json_data, f, ensure_ascii=False, indent=2)

    print(f"所有預測結果和時間戳已儲存。")


# --- 準備 DataLoader 並執行推斷 ---
print("\n--- 開始執行推斷流程 (HF Pipeline + WhisperX Align) ---")

if ('processed_dataset' in globals() and 'val' in processed_dataset and
    processed_dataset['val'] and len(processed_dataset['val']) > 0 and
    'hf_asr_pipeline' in globals() and hf_asr_pipeline and # 改用 hf_asr_pipeline
    'align_model_global' in globals() and # 對齊模型是可選的，但最好有
    'version_dir' in globals() and version_dir and
    'TARGET_LANGUAGE' in globals() and TARGET_LANGUAGE):

    columns_to_keep = ["file_path", "audio_file_name"]
    inference_dataset = None
    try:
        inference_dataset = processed_dataset['val'].select_columns(columns_to_keep)
    except ValueError as e:
        print(f"從 processed_dataset['val'] 選擇欄位時出錯: {e}.")
        print(f"可用欄位: {processed_dataset['val'].column_names if processed_dataset['val'] else 'N/A'}")

    if inference_dataset and len(inference_dataset) > 0 :
        eval_dataloader = DataLoader(inference_dataset, batch_size=1)

        calculate_output_with_hf_pipeline_and_align(
            asr_pipeline_instance=hf_asr_pipeline, # 傳入 HF Pipeline
            align_model=align_model_global,
            align_meta=align_metadata_global,
            dataloader=eval_dataloader,
            output_version_dir=version_dir,
            target_lang_code=TARGET_LANGUAGE
        )
    elif not inference_dataset or len(inference_dataset) == 0:
         print("錯誤: 未能創建包含有效資料的 inference_dataset。無法執行推斷。")
    else: # Dataloader 創建失敗的情況 (較少見如果 inference_dataset 有內容)
        print("錯誤: DataLoader 未能創建。無法執行推斷。")
else:
    print("錯誤: 必要變數 (processed_dataset['val'], hf_asr_pipeline, version_dir, TARGET_LANGUAGE) 未準備好或音訊資料為空。")
    print("請按順序執行所有之前的儲存格。")


--- 準備 Hugging Face ASR Pipeline ---


Device set to use cuda


Hugging Face ASR Pipeline (openai/whisper-base) 載入成功。

--- 準備 WhisperX 對齊模型 ---
正在為語言 'en' 載入 WhisperX 對齊模型...


Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ls960.pth" to /root/.cache/torch/hub/checkpoints/wav2vec2_fairseq_base_ls960_asr_ls960.pth
100%|██████████| 360M/360M [00:01<00:00, 192MB/s]


WhisperX 對齊模型載入成功。

--- 開始執行推斷流程 (HF Pipeline + WhisperX Align) ---

--- 開始計算輸出 (使用 HF Pipeline + WhisperX Align) ---


You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, None], [2, 50359]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
處理音訊檔案:  50%|█████     | 1/2 [00:02<00:02,  2.95s/file]

  處理 test_audio_1 時 WhisperX Align 失敗: tensors used as indices must be long, int, byte or bool tensors. Task 1 使用初步轉錄，Task 2 時間戳將來自初步轉錄 (若有)。


Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
處理音訊檔案: 100%|██████████| 2/2 [00:04<00:00,  2.08s/file]

  處理 test_audio_2 時 WhisperX Align 失敗: tensors used as indices must be long, int, byte or bool tensors. Task 1 使用初步轉錄，Task 2 時間戳將來自初步轉錄 (若有)。

正在儲存 Task 1 的預測 (2 條) 到: model_result/task1/v1_inference/task1_answer_hf_pipeline_align.txt
正在儲存詳細的時間戳資訊到: model_result/task1/v1_inference/val_detailed_timestamps_hf_pipeline_align.json
所有預測結果和時間戳已儲存。



