To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

Features in the notebook:
1. Uses Maxime Labonne's [FineTome 100K](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset.
1. Convert ShareGPT to HuggingFace format via `standardize_sharegpt`
2. Train on Completions / Assistant only via `train_on_responses_only`
3. Unsloth now supports Torch 2.4, all TRL & Xformers versions & Python 3.12!

In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [2]:
# Fix for fp16 vs fp32 error; taken from https://github.com/unslothai/unsloth/issues/1333
!pip uninstall -y triton
!pip install triton==2.3.1

Found existing installation: triton 3.1.0
Uninstalling triton-3.1.0:
  Successfully uninstalled triton-3.1.0
Collecting triton==2.3.1
  Downloading triton-2.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Downloading triton-2.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (168.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.1/168.1 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: triton
Successfully installed triton-2.3.1


* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [3]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-bnb-4bit", # NOTE: RetRobust uses text completion, NOT instruct model
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 2.3.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",

                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Offloading input_embeddings to disk to save VRAM


  offloaded_W = torch.load(filename, map_location = "cpu", mmap = True)


Unsloth: Offloading output_embeddings to disk to save VRAM


Unsloth 2024.11.10 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Unsloth: Training embed_tokens in mixed precision to save VRAM
Unsloth: Training lm_head in mixed precision to save VRAM


In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [23]:
from datasets import Dataset
import pandas as pd

# write code to load .csv files from Drive and generate a dict of format
# {"text": ["text1", "text2", "text3"]}

random_df = pd.read_csv("/content/drive/MyDrive/data/nq_random_own.csv")
top10_df = pd.read_csv("/content/drive/MyDrive/data/nq_top10_own.csv")
top1_df = pd.read_csv("/content/drive/MyDrive/data/nq_top1_own.csv")

ret_nq_df = pd.read_csv("/content/drive/MyDrive/data/nq_retrobust.csv")
ret_sqa_df = pd.read_csv("/content/drive/MyDrive/data/strategyqa_retrobust.csv")
ret_wh_df = pd.read_csv("/content/drive/MyDrive/data/wikihop_retrobust.csv")

data_dict = { "text": [] }

EOS_TOKEN = tokenizer.eos_token
""" Our data """
for _, row in random_df.iterrows():
    # Ensure row["text"] is a string before splitting
    text_value = str(row["text"])
    data_dict["text"].append(text_value.replace(' ###', EOS_TOKEN).replace('\n###', EOS_TOKEN).strip()) #Extend the 'text' list
for _, row in top10_df.iterrows():
    # Ensure row["text"] is a string before splitting
    text_value = str(row["text"])
    data_dict["text"].append(text_value.replace(' ###', EOS_TOKEN).replace('\n###', EOS_TOKEN).strip()) #Extend the 'text' list
for _, row in top1_df.iterrows():
    # Ensure row["text"] is a string before splitting
    text_value = str(row["text"])
    data_dict["text"].append(text_value.replace(' ###', EOS_TOKEN).replace('\n###', EOS_TOKEN).strip()) #Extend the 'text' list

""" RetRobust data """
for _, row in ret_nq_df.iterrows():
    # Ensure row["text"] is a string before splitting
    text_value = str(row["text"])
    data_dict["text"].append(text_value.replace(' ###', EOS_TOKEN).replace('\n###', EOS_TOKEN).strip()) #Extend the 'text' list
for _, row in ret_sqa_df.iterrows():
    # Ensure row["text"] is a string before splitting
    text_value = str(row["text"])
    data_dict["text"].append(text_value.replace(' ###', EOS_TOKEN).replace('\n###', EOS_TOKEN).strip()) #Extend the 'text' list
for _, row in ret_wh_df.iterrows():
    # Ensure row["text"] is a string before splitting
    text_value = str(row["text"])
    data_dict["text"].append(text_value.replace(' ###', EOS_TOKEN).replace('\n###', EOS_TOKEN).strip()) #Extend the 'text' list

dataset = Dataset.from_dict(data_dict)
def formatting_prompts_func(examples):
    return { "text" : [example for example in examples["text"]] }

dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/4758 [00:00<?, ? examples/s]

In [25]:
for row in dataset[:5]["text"]:
    print("=========================")
    print(row)

Context1: Howcast | How to Start Dead Lifting: Chapter 4: If you want your back to look strong, picking up an activity like deadlifting could be beneficial for you. Many bodybuilders and athletes integrate deadlifts into their routines to increase strength and muscle mass in the hamstrings, glutes, and lower back.
Question: what are the main muscles worked during deadlifts
Are follow up questions needed here: No.
So the final answer is: Hamstrings, glutes, and lower back<|end_of_text|>
Context1: Stanley Cup Finals: The Stanley Cup Finals refers to the championship series of the National Hockey League. It is a best-of-seven playoff series and is commonly referred to as the Stanley Cup Final.
Question: how many games in the stanley cup finals
Are follow up questions needed here: No.
So the final answer is: Best of seven<|end_of_text|>
Context1: Gallery of Woman (1979): "Woman" is a 1979 song by British singer John Lennon from his studio album "Double Fantasy". The song was written as an 

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

<a name="Train"></a>
### Continued Pretraining
Now let's use Unsloth's `UnslothTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 20 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

Also set `embedding_learning_rate` to be a learning rate at least 2x or 10x smaller than `learning_rate` to make continual pretraining work!

In [26]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        warmup_ratio = 0.1,
        num_train_epochs = 1,

        learning_rate = 5e-5,
        embedding_learning_rate = 5e-6,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=8):   0%|          | 0/4758 [00:00<?, ? examples/s]

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
6.947 GB of memory reserved.


In [27]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 4,758 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 297
 "-____-"     Number of trainable parameters = 982,515,712


Unsloth: Setting lr = 5.00e-06 instead of 5.00e-05 for embed_tokens.
Unsloth: Setting lr = 5.00e-06 instead of 5.00e-05 for lm_head.


Step,Training Loss
1,2.4184
2,2.5404
3,2.3069
4,2.3991
5,2.4268
6,2.6133
7,2.1828
8,2.1326
9,2.2478
10,2.0645


In [28]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

NameError: name 'start_gpu_memory' is not defined

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [29]:
from transformers import TextIteratorStreamer
from threading import Thread
text_streamer = TextIteratorStreamer(tokenizer)
import textwrap
max_print_width = 100

from unsloth import FastLanguageModel

FastLanguageModel.for_inference(model)

inputs = tokenizer(
[
"""Given the following question, answer it by providing follow up questions and intermediate answers. If intermediate questions are not necessary, answe the question directly. You are provided with evidence that can help you arrive at the answer before the question.
#
Context1: The Big Red One: Fuller was a World War II veteran and served with the 1st Infantry Division, which is nicknamed "The Big Red One" for the red numeral "1" on the division's shoulder patch. He received the Silver Star, Bronze Star, and Purple Heart during his service.
Question: how did the big red one get its name
Are follow up questions needed here: No.
So the final answer is: its shoulder patch
#
Context1: Location Map of Cayman Islands: The given Cayman Islands location map shows that the Cayman Islands are located in the western Caribbean Sea. Location Map of Cayman Islands. Where is Cayman ...
Question: where are the cayman islands on the map
Are follow up questions needed here: No.
So the final answer is: western Caribbean Sea
#
Context1: Korean War | Combatants, Summary, Years, Map ... - Britannica: After more than a million combat casualties had been suffered on both sides, the fighting ended in July 1953 with Korea still divided into two hostile states. Negotiations in 1954 produced no further agreement, and the front line has been accepted ever since as the de facto boundary between North and South Korea.
Question: who won the war between north korea and south korea
Are follow up questions needed here: No.
So the final answer is: technically still at war
#
Context1: It's Always Sunny in Philadelphia (season 13): The thirteenth season of the American comedy television series It's Always Sunny in Philadelphia premiered on FXX on September 5, 2018. ... The season consists of ...
Question: when does it's always sunny in philadelphia season 13 start
Are follow up questions needed here: No.
So the final answer is: September 5, 2018
#
Context1: You've Got a Friend in Me: "You've Got a Friend in Me" is a song by Randy Newman. Used as the theme song for the 1995 Disney/Pixar animated film Toy Story, it has since become a major ...
Question: who sang you got a friend in me from toy story
Are follow up questions needed here: No.
So the final answer is: Randy Newman
#
Context1: April 1961: Yuri Gagarin from the Soviet Union was the first human in space. His vehicle, Vostok 1 circled Earth at a speed of 27,400 kilometers per hour with the flight lasting 108 minutes.
Question: when was the first person sent to space
Are follow up questions needed here: No.
So the final answer is: 12 April 1961
#
Context1: marc benioff spent 13 years at oracle, before launching salesforce.
Question: where did the ceo of salesforce previously work?
Are follow up questions needed here: No.
So the final answer is:"""
]*1, return_tensors = "pt").to("cuda")

generation_kwargs = dict(
    inputs,
    streamer = text_streamer,
    max_new_tokens = 256,
    use_cache = True,
)
thread = Thread(target = model.generate, kwargs = generation_kwargs)
thread.start()

length = 0
for j, new_text in enumerate(text_streamer):
    if j == 0:
        wrapped_text = textwrap.wrap(new_text, width = max_print_width)
        length = len(wrapped_text[-1])
        wrapped_text = "\n".join(wrapped_text)
        print(wrapped_text, end = "")
    else:
        length += len(new_text)
        if length >= max_print_width:
            length = 0
            print()
        print(new_text, end = "")
    pass
pass

<|begin_of_text|>Given the following question, answer it by providing follow up questions and
intermediate answers. If intermediate questions are not necessary, answe the question directly. You
are provided with evidence that can help you arrive at the answer before the question. # Context1:
The Big Red One: Fuller was a World War II veteran and served with the 1st Infantry Division, which
is nicknamed "The Big Red One" for the red numeral "1" on the division's shoulder patch. He received
the Silver Star, Bronze Star, and Purple Heart during his service. Question: how did the big red one
get its name Are follow up questions needed here: No. So the final answer is: its shoulder patch #
Context1: Location Map of Cayman Islands: The given Cayman Islands location map shows that the
Cayman Islands are located in the western Caribbean Sea. Location Map of Cayman Islands. Where is
Cayman... Question: where are the cayman islands on the map Are follow up questions needed here: No.
So the final

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [30]:
#model.save_pretrained("lora_model") # Local saving
#tokenizer.save_pretrained("lora_model")

from google.colab import userdata

model.push_to_hub("scoliono/retrobust_plusplus_combined_3b", token = userdata.get('hf_token')) # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

README.md:   0%|          | 0.00/579 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/3.93G [00:00<?, ?B/s]

Saved model to https://huggingface.co/scoliono/retrobust_plusplus_combined_3b


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [31]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)

from google.colab import userdata
if True: model.push_to_hub_merged("scoliono/retrobust_plusplus_combined_3b_f16", tokenizer, save_method = "merged_16bit", token = userdata.get('hf_token'))

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

Unsloth: You are pushing to hub, but you passed your HF username = scoliono.
We shall truncate scoliono/retrobust_plusplus_combined_3b_f16 to retrobust_plusplus_combined_3b_f16
Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.2G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.11 out of 12.67 RAM for saving.


 25%|██▌       | 7/28 [00:00<00:01, 19.37it/s]We will save to Disk and not RAM now.
100%|██████████| 28/28 [00:29<00:00,  1.06s/it]


Unsloth: Saving tokenizer...

No files have been modified since last commit. Skipping to prevent empty commit.


 Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving retrobust_plusplus_combined_3b_f16/pytorch_model-00001-of-00002.bin...
Unsloth: Saving retrobust_plusplus_combined_3b_f16/pytorch_model-00002-of-00002.bin...


README.md:   0%|          | 0.00/585 [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/2.25G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/scoliono/retrobust_plusplus_combined_3b_f16
