[Crash] Colab Instantly Crashes with Whisper + unsloth — Small Dataset, CPU Only, No Traceback

### 1. Did you update?

Yes. Installed fresh in clean runtime:

```bash
pip install --upgrade unsloth unsloth_zoo
```

---

### 2. Colab or Kaggle or local / cloud?

**Colab Pro** (paid user)

---

### 3. Number of GPUs used (`nvidia-smi`)?

**None** — crash happens even with CPU-only runtime. No GPU involved.

---

### 4. Which notebook?

[Unsloth Official Whisper Notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Whisper.ipynb)

---

### 5. Paste `Unsloth` printout with :sloth:

Colab runtime crashes **instantly** before this can print. I never see the `:sloth:` printout. The notebook dies with:

```
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
```

I tried:

```python
import os
os.environ['PYDEVD_DISABLE_FILE_VALIDATION'] = '1'
```

But the crash still occurs.

---

### 6. Which trainer? `SFTTrainer`, `GRPOTrainer` etc?

No training starts. Crash happens before any trainer is invoked.

---

### 7. Minimal code to reproduce (no HF token):

```python
from datasets import load_dataset, Audio
import tqdm

# Dataset is small (~15k examples)
dataset = load_dataset("Private dataset", split="train+test")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
dataset = dataset.train_test_split(test_size=0.06)

model.generation_config.language = "<||>"
model.generation_config.task = "transcribe"
model.config.suppress_tokens = []
model.generation_config.forced_decoder_ids = None

def formatting_prompts_func(example):
    audio_arrays = example['path']['array']
    sampling_rate = example["path"]["sampling_rate"]
    features = tokenizer.feature_extractor(audio_arrays, sampling_rate=sampling_rate)
    tokenized_text = tokenizer.tokenizer(example["text"])
    return {
        "input_features": features.input_features[0],
        "labels": tokenized_text.input_ids,
    }

train_dataset = [formatting_prompts_func(example) for example in tqdm.tqdm(dataset['train'], desc='Train split')]
test_dataset = [formatting_prompts_func(example) for example in tqdm.tqdm(dataset['test'], desc='Test split')]
```

---

### ✅ What I Already Tried

- Restarted runtime and used fresh notebook
- Set `PYDEVD_DISABLE_FILE_VALIDATION = 1` to bypass debug validation
- Tried using `.map()` instead of list comprehension (same crash)
- Dataset is ~15k examples, not large
- No GPU / CUDA involved
- Crash is instant and consistent before training starts
- Suspect cause might be Unsloth + Whisper generation config or tokenizer/feature interaction

---

### Request

Please confirm if:
- `unsloth` fully supports Whisper + audio datasets
- This crash is known or related to `tokenizer/feature_extractor` + patched model
- Any fix or stable workaround is available

Thanks for the great work — happy to help debug or test!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Crash] Colab Instantly Crashes with Whisper + unsloth — Small Dataset, CPU Only, No Traceback #2575

1. Did you update?

2. Colab or Kaggle or local / cloud?

3. Number of GPUs used (`nvidia-smi`)?

4. Which notebook?

5. Paste `Unsloth` printout with 🦥

6. Which trainer? `SFTTrainer`, `GRPOTrainer` etc?

7. Minimal code to reproduce (no HF token):

✅ What I Already Tried

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Crash] Colab Instantly Crashes with Whisper + unsloth — Small Dataset, CPU Only, No Traceback #2575

Description

1. Did you update?

2. Colab or Kaggle or local / cloud?

3. Number of GPUs used (nvidia-smi)?

4. Which notebook?

5. Paste Unsloth printout with 🦥

6. Which trainer? SFTTrainer, GRPOTrainer etc?

7. Minimal code to reproduce (no HF token):

✅ What I Already Tried

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3. Number of GPUs used (`nvidia-smi`)?

5. Paste `Unsloth` printout with 🦥

6. Which trainer? `SFTTrainer`, `GRPOTrainer` etc?