<a href="https://colab.research.google.com/github/ridhesh/AI_STREAMLIT_APP/blob/main/nb/Ministral_3_VL_(3B)_Vision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Introducing FP8 precision training for faster RL inference. [Read Blog](https://docs.unsloth.ai/new/fp8-reinforcement-learning).

Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")

    # Uninstall huggingface_hub first to ensure a clean slate
    !pip uninstall -y huggingface_hub

    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" hf_transfer
    !pip install --no-deps unsloth
# Install transformers branch for Ministral
!pip install git+https://github.com/huggingface/transformers.git@bf3f0ae70d0e902efab4b8517fce88f6697636ce
!pip install --no-deps trl==0.22.2
# Upgrade huggingface_hub again to ensure it's the final version used after all other installs
!pip install --upgrade huggingface_hub

### Unsloth

In [None]:
from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch

ministral_models = [
    "unsloth/Ministral-3-3B-Instruct-2512", # Ministral instruct models
    "unsloth/Ministral-3-8B-Instruct-2512",
    "unsloth/Ministral-3-14B-Instruct-2512",

    "unsloth/Ministral-3-3B-Reasoning-2512", # Ministral reasoning models
    "unsloth/Ministral-3-8B-Reasoning-2512",
    "unsloth/Ministral-3-14B-Reasoning-2512",

    "unsloth/Ministral-3-3B-Base-2512", # Ministral base models
    "unsloth/Ministral-3-8B-Base-2512",
    "unsloth/Ministral-3-14B-Base-2512",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Ministral-3-3B-Instruct-2512",
    load_in_4bit = False, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.

**[NEW]** We also support finetuning ONLY the vision part of the model, or ONLY the language part. Or you can select both! You can also select to finetune the attention or the MLP layers!

In [None]:
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision layers
    finetune_language_layers   = True, # False if not finetuning language layers
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers

    r = 32,           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 32,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    # target_modules = "all-linear", # Optional now! Can specify a list if needed
)

<a name="Data"></a>
### Data Prep
We'll be using a sampled dataset of handwritten maths formulas. The goal is to convert these images into a computer readable form - ie in LaTeX form, so we can render it. This can be very useful for complex formulas.

You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).

In [None]:
from datasets import load_dataset
dataset = load_dataset("csv", data_files="/content/answers_20260114_115230.csv", split = "train")

Let's take an overview look at the dataset. We shall see what the 3rd image is, and what caption it had.

In [None]:
dataset

In [None]:
dataset[2]["ANSWER"]

We can also render the LaTeX in the browser directly!

In [None]:
# from IPython.display import display, Math, Latex

# latex = dataset[2]["text"]
# display(Math(latex))

To format the dataset, all vision finetuning tasks should be formatted as follows:

```python
[
{ "role": "user",
  "content": [{"type": "text",  "text": Q}, {"type": "image", "image": image} ]
},
{ "role": "assistant",
  "content": [{"type": "text",  "text": A} ]
},
]
```

In [None]:
# instruction = "Write the LaTeX representation for this image."

# def convert_to_conversation(sample):
#     conversation = [
#         { "role": "user",
#           "content" : [
#             {"type" : "text",  "text"  : instruction},
#             {"type" : "image", "image" : sample["image"]} ]
#         },
#         { "role" : "assistant",
#           "content" : [
#             {"type" : "text",  "text"  : sample["text"]} ]
#         },
#     ]
#     return { "messages" : conversation }
# pass


def convert_to_conversation(sample):
    conversation = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": sample["ANSWER"]
                }
            ]
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "audio",
                    "audio": f"audio/{sample['ID']}.wav"
                }
            ]
        }
    ]

    return {"messages": conversation}
pass

Let's convert the dataset into the "correct" format for finetuning:

In [None]:
converted_dataset = [convert_to_conversation(sample) for sample in dataset]

We look at how the conversations are structured for the first example:

In [None]:
converted_dataset[0]

Let's first see before we do any finetuning what the model outputs for the first example!

In [None]:
FastVisionModel.for_inference(model) # Enable for inference!

image = dataset[2]["text"]
instruction = "Write the LaTeX representation for this image."

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

<a name="Train"></a>
### Train the model
Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.

In [None]:
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
from unsloth import is_bf16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer), # Must use!
    train_dataset = converted_dataset,
    args = SFTConfig(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 2,
        warmup_steps = 5,
        max_steps = 30,
        # num_train_epochs = 1, # Set this instead of max_steps for full training runs
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        fp16 = not is_bf16_supported(), # Use fp16 if bf16 is not supported
        bf16 = is_bf16_supported(), # Use bf16 if supported
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "tensorboard",     # For Weights and Biases

        # You MUST put the below items for vision finetuning:
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        max_length = 2048,
    ),
)

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [None]:
FastVisionModel.for_inference(model) # Enable for inference!

image = dataset[2]["image"]
instruction = "Write the LaTeX representation for this image."

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastVisionModel
    model, tokenizer = FastVisionModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = True, # Set to False for 16bit LoRA
    )
    FastVisionModel.for_inference(model) # Enable for inference!

image = dataset[0]["image"]
instruction = "Write the LaTeX representation for this image."

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Select ONLY 1 to save! (Both not needed!)

# Save locally to 16bit
if False: model.save_pretrained_merged("unsloth_finetune", tokenizer,)

# To export and save to your Hugging Face account
if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", tokenizer, token = "PUT_HERE")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("unsloth_finetune", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/unsloth_finetune", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("unsloth_finetune", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/unsloth_finetune", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("unsloth_finetune", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/unsloth_finetune", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/unsloth_finetune", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
</div>


# Task
Convert the text content from the 'ANSWER' column of the `dataset` into speech and save each generated audio waveform as a separate WAV file in the `audio_output` directory.

## Load Text-to-Speech Model

### Subtask:
Load a suitable pre-trained Text-to-Speech model and its corresponding processor/tokenizer from the Hugging Face Transformers library. This model will be used to convert text into audio.


**Reasoning**:
To load the pre-trained Text-to-Speech model and its processor, I need to import the necessary classes from the `transformers` library, specifically `AutoProcessor` and `BarkModel`. Then I will load the 'suno/bark-small' model and its processor using the `from_pretrained` method and move the model to the GPU if CUDA is available.



In [None]:
import torch
from transformers import AutoProcessor, BarkModel

# Load the processor for the suno/bark-small model
processor = AutoProcessor.from_pretrained("suno/bark-small")
print("Processor loaded successfully.")

# Load the suno/bark-small model
model = BarkModel.from_pretrained("suno/bark-small")
print("Model loaded successfully.")

# Move the model to GPU if available
if torch.cuda.is_available():
    model = model.to('cuda')
    print("Model moved to GPU (CUDA).")
else:
    print("CUDA not available, model running on CPU.")

**Reasoning**:
The previous step successfully loaded the Text-to-Speech model and its processor. Now, to prepare the text from the `dataset` for speech synthesis, I need to extract the 'ANSWER' column content. I will create a list to store these text snippets, which will then be used as input for the TTS model.



In [None]:
text_to_convert = [item["ANSWER"] for item in dataset]
print(f"Extracted {len(text_to_convert)} text snippets from the dataset.")
# Display the first few texts to verify
for i, text in enumerate(text_to_convert[:3]):
    print(f"Text {i+1}: {text[:100]}...")

**Reasoning**:
Before generating and saving the audio files, it's crucial to ensure that the output directory (`audio_output`) exists. This step prevents potential errors during file writing.



In [None]:
import os

audio_output_dir = "audio_output"
if not os.path.exists(audio_output_dir):
    os.makedirs(audio_output_dir)
    print(f"Created output directory: {audio_output_dir}")
else:
    print(f"Output directory already exists: {audio_output_dir}")

**Reasoning**:
Now that the text is extracted and the output directory is ready, I will iterate through each text snippet, use the loaded processor and model to generate audio, and then save each audio waveform as a WAV file in the 'audio_output' directory. I will also import `scipy.io.wavfile` to save the audio.



In [None]:
import scipy.io.wavfile as wavfile

for i, text in enumerate(text_to_convert):
    inputs = processor(text, return_tensors="pt", sampling_rate=24000).to('cuda')
    # Generate audio
    with torch.no_grad():
        speech_output = model.generate(**inputs, do_sample=True, fine_tuned_audio_output=True)

    # Convert tensor to numpy array and save as WAV file
    audio_waveform = speech_output['audio_tensor'].cpu().squeeze().numpy()

    # Save the audio file
    output_filename = os.path.join(audio_output_dir, f"audio_{i+1}.wav")
    wavfile.write(output_filename, 24000, audio_waveform)

    if (i + 1) % 10 == 0:
        print(f"Processed and saved {i+1} audio files.")
print("All audio files generated and saved successfully.")

**Reasoning**:
The `IndexError` occurred because when `fine_tuned_audio_output=True` is set in `model.generate`, the `speech_output` directly returns the audio tensor, not a dictionary containing an 'audio_tensor' key. Therefore, trying to access `speech_output['audio_tensor']` on a tensor object leads to an error. The fix is to directly use `speech_output` as the audio tensor.



In [None]:
import scipy.io.wavfile as wavfile

for i, text in enumerate(text_to_convert):
    inputs = processor(text, return_tensors="pt", sampling_rate=24000).to('cuda')
    # Generate audio
    with torch.no_grad():
        # Correctly access the audio tensor: speech_output is the tensor when fine_tuned_audio_output=True
        audio_tensor = model.generate(**inputs, do_sample=True, fine_tuned_audio_output=True)

    # Convert tensor to numpy array and save as WAV file
    audio_waveform = audio_tensor.cpu().squeeze().numpy()

    # Save the audio file
    output_filename = os.path.join(audio_output_dir, f"audio_{i+1}.wav")
    wavfile.write(output_filename, 24000, audio_waveform)

    if (i + 1) % 10 == 0:
        print(f"Processed and saved {i+1} audio files.")
print("All audio files generated and saved successfully.")

# Task
To allow for quicker testing and verification, the audio generation loop will be modified to process only the first 5 samples from the `text_to_convert` list. After generating the audio for this subset, I will summarize the findings and confirm that the audio generation for a limited number of inputs works as expected.

## Process Smaller Subset

### Subtask:
Modify the audio generation loop to process only a small number of samples (e.g., the first 5-10) from the `text_to_convert` list. This will allow for quicker testing and verification of the audio generation process without encountering long execution times.


**Reasoning**:
The previous attempt to generate audio files resulted in an `IndexError` and then a `KeyboardInterrupt` during a long execution. To address this and allow for quicker testing, I will modify the code to process only the first 5 samples from the `text_to_convert` list, incorporating the fix for the `IndexError` by directly using `audio_tensor` instead of `speech_output['audio_tensor']`.



In [None]:
import scipy.io.wavfile as wavfile

# Process only the first 5 samples for quicker testing
for i, text in enumerate(text_to_convert[:5]):
    inputs = processor(text, return_tensors="pt", sampling_rate=24000).to('cuda')
    # Generate audio
    with torch.no_grad():
        # Correctly access the audio tensor: speech_output is the tensor when fine_tuned_audio_output=True
        audio_tensor = model.generate(**inputs, do_sample=True, fine_tuned_audio_output=True)

    # Convert tensor to numpy array and save as WAV file
    audio_waveform = audio_tensor.cpu().squeeze().numpy()

    # Save the audio file
    output_filename = os.path.join(audio_output_dir, f"audio_{i+1}.wav")
    wavfile.write(output_filename, 24000, audio_waveform)

    print(f"Processed and saved audio file: {output_filename}")
print("All selected audio files generated and saved successfully.")

**Reasoning**:
The previous code generated `stderr` warnings regarding the `attention_mask` not being set explicitly. To address these warnings and ensure more reliable audio generation, I will modify the `model.generate` call to explicitly pass the `attention_mask` from the processed inputs.



In [None]:
import scipy.io.wavfile as wavfile

# Process only the first 5 samples for quicker testing (as before)
for i, text in enumerate(text_to_convert[:5]):
    inputs = processor(text, return_tensors="pt", sampling_rate=24000).to('cuda')
    # Generate audio, explicitly passing the attention_mask
    with torch.no_grad():
        audio_tensor = model.generate(**inputs, do_sample=True, fine_tuned_audio_output=True, attention_mask=inputs['attention_mask'])

    # Convert tensor to numpy array and save as WAV file
    audio_waveform = audio_tensor.cpu().squeeze().numpy()

    # Save the audio file
    output_filename = os.path.join(audio_output_dir, f"audio_{i+1}.wav")
    wavfile.write(output_filename, 24000, audio_waveform)

    print(f"Processed and saved audio file: {output_filename}")
print("All selected audio files generated and saved successfully without attention mask warnings.")

**Reasoning**:
The `TypeError` occurred because the `attention_mask` was passed as part of `**inputs` (unpacked from the `processor` output) and then again explicitly as `attention_mask=inputs['attention_mask']`. To fix this, the redundant explicit `attention_mask` argument must be removed from the `model.generate` call.



In [None]:
import scipy.io.wavfile as wavfile

# Process only the first 5 samples for quicker testing (as before)
for i, text in enumerate(text_to_convert[:5]):
    inputs = processor(text, return_tensors="pt", sampling_rate=24000).to('cuda')
    # Generate audio
    with torch.no_grad():
        audio_tensor = model.generate(**inputs, do_sample=True, fine_tuned_audio_output=True)

    # Convert tensor to numpy array and save as WAV file
    audio_waveform = audio_tensor.cpu().squeeze().numpy()

    # Save the audio file
    output_filename = os.path.join(audio_output_dir, f"audio_{i+1}.wav")
    wavfile.write(output_filename, 24000, audio_waveform)

    print(f"Processed and saved audio file: {output_filename}")
print("All selected audio files generated and saved successfully.")

**Reasoning**:
The previous code successfully generated and saved audio files for the first 5 samples, fulfilling the subtask's requirements for quick testing. While `stderr` warnings related to `attention_mask` and `pad_token_id` persisted, they did not prevent the successful completion of the audio generation process. This step confirms the successful execution of the subtask for the specified subset.



In [None]:
import scipy.io.wavfile as wavfile

# Process only the first 5 samples for quicker testing (as before)
for i, text in enumerate(text_to_convert[:5]):
    inputs = processor(text, return_tensors="pt", sampling_rate=24000).to('cuda')
    # Generate audio
    with torch.no_grad():
        audio_tensor = model.generate(**inputs, do_sample=True, fine_tuned_audio_output=True)

    # Convert tensor to numpy array and save as WAV file
    audio_waveform = audio_tensor.cpu().squeeze().numpy()

    # Save the audio file
    output_filename = os.path.join(audio_output_dir, f"audio_{i+1}.wav")
    wavfile.write(output_filename, 24000, audio_waveform)

    print(f"Processed and saved audio file: {output_filename}")
print("All selected audio files generated and saved successfully.")