-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Open
Description
1. Did you update?
Yes. Installed fresh in clean runtime:
pip install --upgrade unsloth unsloth_zoo2. Colab or Kaggle or local / cloud?
Colab Pro (paid user)
3. Number of GPUs used (nvidia-smi)?
None β crash happens even with CPU-only runtime. No GPU involved.
4. Which notebook?
Unsloth Official Whisper Notebook
5. Paste Unsloth printout with π¦₯
Colab runtime crashes instantly before this can print. I never see the :sloth: printout. The notebook dies with:
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
I tried:
import os
os.environ['PYDEVD_DISABLE_FILE_VALIDATION'] = '1'But the crash still occurs.
6. Which trainer? SFTTrainer, GRPOTrainer etc?
No training starts. Crash happens before any trainer is invoked.
7. Minimal code to reproduce (no HF token):
from datasets import load_dataset, Audio
import tqdm
# Dataset is small (~15k examples)
dataset = load_dataset("Private dataset", split="train+test")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
dataset = dataset.train_test_split(test_size=0.06)
model.generation_config.language = "<||>"
model.generation_config.task = "transcribe"
model.config.suppress_tokens = []
model.generation_config.forced_decoder_ids = None
def formatting_prompts_func(example):
audio_arrays = example['path']['array']
sampling_rate = example["path"]["sampling_rate"]
features = tokenizer.feature_extractor(audio_arrays, sampling_rate=sampling_rate)
tokenized_text = tokenizer.tokenizer(example["text"])
return {
"input_features": features.input_features[0],
"labels": tokenized_text.input_ids,
}
train_dataset = [formatting_prompts_func(example) for example in tqdm.tqdm(dataset['train'], desc='Train split')]
test_dataset = [formatting_prompts_func(example) for example in tqdm.tqdm(dataset['test'], desc='Test split')]β What I Already Tried
- Restarted runtime and used fresh notebook
- Set
PYDEVD_DISABLE_FILE_VALIDATION = 1to bypass debug validation - Tried using
.map()instead of list comprehension (same crash) - Dataset is ~15k examples, not large
- No GPU / CUDA involved
- Crash is instant and consistent before training starts
- Suspect cause might be Unsloth + Whisper generation config or tokenizer/feature interaction
Request
Please confirm if:
unslothfully supports Whisper + audio datasets- This crash is known or related to
tokenizer/feature_extractor+ patched model - Any fix or stable workaround is available
Thanks for the great work β happy to help debug or test!