# Library

In [1]:
!pip install transformers datasets torchaudio accelerate --upgrade



In [2]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


# Model

In [3]:
# Deteksi device dan precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Model ID
model_id = "openai/whisper-large-v3-turbo"

In [4]:
# Load model dan processor
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# Buat pipeline ASR
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)


Device set to use cpu


# Predict

In [5]:
import librosa

audio_path = "Kamu Harus Punya Hal Ini Biar Gak Diganti AI_processed.wav"

# Librosa automatically resamples to 16kHz and mono
waveform, sample_rate = librosa.load(audio_path, sr=16000)

# Format untuk pipeline Hugging Face
sample = {"array": waveform, "sampling_rate": sample_rate}

# Run transcription with timestamps
print("⏳ Transcribing with timestamps...")

result = pipe(
    sample,
    return_timestamps=True,       # <- Enable timestamp output
    chunk_length_s=30,            # <- Optional: force chunking to 30s per segment
    stride_length_s=(5, 5),       # <- Overlap between chunks
)

# Print result
print("✅ Hasil Transkripsi:")
segments = result["chunks"]
for seg in segments:
    start = round(seg["timestamp"][0], 2)
    end = round(seg["timestamp"][1], 2)
    text = seg["text"].strip()
    print(f"🕒 {start:>6}s - {end:>6}s: {text}")


Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.


⏳ Transcribing with timestamps...


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


✅ Hasil Transkripsi:
🕒    0.0s -   25.2s: I want to show you two video clips. One of them is real and one of them is fake. Look at these two clips. One of them is real and one of them is fake. Can you tell which one it is? Living at the dangers of AI. About AI. Artificial intelligence. The threat AI poses to the social. I've never seen it quite like this. This technology is spreading rapidly. It's really mind-blowing. Deep fakes. Deep fakes.
🕒   25.2s -  28.46s: Deep Tom Cruise was a tipping point for deep fakes.
🕒  28.46s -  31.88s: We're increasingly in a world where AI is everywhere.
🕒  31.88s -  35.08s: But do we actually know what's really going on?
🕒  35.08s -  37.24s: Let's dig into what AI really is with me,
🕒  37.24s -  105.6s: Krystal Wijaya at Klas Bakar. Yeah, AI has actually been around for many years now. Coming from the tech field, I started doing machine learning with a Python script in my laptop maybe 20 years ago. But today, we're increasingly exposed to AI in the wil