Skip to content

[Crash] Colab Instantly Crashes with Whisper + unsloth β€” Small Dataset, CPU Only, No TracebackΒ #2575

@C0deXG

Description

@C0deXG

1. Did you update?

Yes. Installed fresh in clean runtime:

pip install --upgrade unsloth unsloth_zoo

2. Colab or Kaggle or local / cloud?

Colab Pro (paid user)


3. Number of GPUs used (nvidia-smi)?

None β€” crash happens even with CPU-only runtime. No GPU involved.


4. Which notebook?

Unsloth Official Whisper Notebook


5. Paste Unsloth printout with πŸ¦₯

Colab runtime crashes instantly before this can print. I never see the :sloth: printout. The notebook dies with:

0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.

I tried:

import os
os.environ['PYDEVD_DISABLE_FILE_VALIDATION'] = '1'

But the crash still occurs.


6. Which trainer? SFTTrainer, GRPOTrainer etc?

No training starts. Crash happens before any trainer is invoked.


7. Minimal code to reproduce (no HF token):

from datasets import load_dataset, Audio
import tqdm

# Dataset is small (~15k examples)
dataset = load_dataset("Private dataset", split="train+test")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
dataset = dataset.train_test_split(test_size=0.06)

model.generation_config.language = "<||>"
model.generation_config.task = "transcribe"
model.config.suppress_tokens = []
model.generation_config.forced_decoder_ids = None

def formatting_prompts_func(example):
    audio_arrays = example['path']['array']
    sampling_rate = example["path"]["sampling_rate"]
    features = tokenizer.feature_extractor(audio_arrays, sampling_rate=sampling_rate)
    tokenized_text = tokenizer.tokenizer(example["text"])
    return {
        "input_features": features.input_features[0],
        "labels": tokenized_text.input_ids,
    }

train_dataset = [formatting_prompts_func(example) for example in tqdm.tqdm(dataset['train'], desc='Train split')]
test_dataset = [formatting_prompts_func(example) for example in tqdm.tqdm(dataset['test'], desc='Test split')]

βœ… What I Already Tried

  • Restarted runtime and used fresh notebook
  • Set PYDEVD_DISABLE_FILE_VALIDATION = 1 to bypass debug validation
  • Tried using .map() instead of list comprehension (same crash)
  • Dataset is ~15k examples, not large
  • No GPU / CUDA involved
  • Crash is instant and consistent before training starts
  • Suspect cause might be Unsloth + Whisper generation config or tokenizer/feature interaction

Request

Please confirm if:

  • unsloth fully supports Whisper + audio datasets
  • This crash is known or related to tokenizer/feature_extractor + patched model
  • Any fix or stable workaround is available

Thanks for the great work β€” happy to help debug or test!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions