# Trained on RunPod.io

- GPU - RTX 3090 24GB / A5000 24 GB
- RAM - 21 GB 
- HDD - 200 GB

Price 0.50$/hod

## 4-bit training

- training took cca. 15 minutes = 0.11 $

## 16-bit merged model

- merge took cca. 2 minute = 0.02 $
- push took cca. 2 minute = 0.02 $


# Inference on TGI 
https://ui.endpoints.huggingface.co/

GPU - L4 16GB VRAM

Price 0.8$/hod

In [1]:
import torch
for i in range(torch.cuda.device_count()):
   print(torch.cuda.get_device_properties(i).name)

print(torch.cuda.is_available())

print(torch.cuda.current_device())

ModuleNotFoundError: No module named 'torch'

# Install libraries

In [2]:
%pip install torch
%pip install bitsandbytes
%pip install accelerate
%pip install transformers
%pip install peft
%pip install datasets
%pip install evaluate
%pip install trl
%pip install matplotlib
%pip install tensorboard
%pip install sentencepiece
%pip install hf_transfer

Defaulting to user installation because normal site-packages is not writeable
Collecting torch
  Downloading torch-2.8.0-cp39-none-macosx_11_0_arm64.whl (73.6 MB)
[K     |████████████████████████████████| 73.6 MB 11.0 MB/s eta 0:00:01
[?25hCollecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)
Collecting fsspec
  Downloading fsspec-2025.10.0-py3-none-any.whl (200 kB)
[K     |████████████████████████████████| 200 kB 16.0 MB/s eta 0:00:01
Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
[K     |████████████████████████████████| 134 kB 16.6 MB/s eta 0:00:01
[?25hCollecting networkx
  Downloading networkx-3.2.1-py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 13.9 MB/s eta 0:00:01
[?25hCollecting sympy>=1.13.3
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
[K     |████████████████████████████████| 6.3 MB 23.3 MB/s eta 0:00:01
[?25hCollecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536

In [1]:
from huggingface_hub import login

API_TOKEN = "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"
login(token=API_TOKEN)

  from .autonotebook import tqdm as notebook_tqdm


### Load the model

In [4]:
model_name = "mistralai/Mistral-7B-Instruct-v0.3"

In [23]:
from transformers import (
    BitsAndBytesConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
)
import torch

# Model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    dtype=torch.float16,
    device_map="auto",
)
print(base_model.get_input_embeddings())
print(base_model.get_output_embeddings())
print("Model Vocabulary Size:", base_model.config.vocab_size)

base_tokenizer = AutoTokenizer.from_pretrained(model_name)
print("before", len(base_tokenizer))
base_tokenizer.add_special_tokens({"pad_token": "<pad>"})
print("after", len(base_tokenizer))

print("before 2", base_model.config.pad_token_id)
base_model.config.pad_token_id = base_tokenizer.pad_token_id
base_model.generation_config.pad_token_id = base_tokenizer.pad_token_id
print("after 2", base_model.config.pad_token_id)

print("Model Vocabulary Size:", base_model.config.vocab_size)
base_model.resize_token_embeddings(len(base_tokenizer))
print("Model Vocabulary Size:", base_model.config.vocab_size)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Embedding(32768, 4096)
Linear(in_features=4096, out_features=32768, bias=False)
Model Vocabulary Size: 32768
before 32768
after 32769
before 2 None
after 2 32768
Model Vocabulary Size: 32768
Model Vocabulary Size: 32769


### Test the model

In [24]:
# Function to test the model
def test_model(model, tokenizer, prompt):
    # Set model to eval mode
    model.eval()

    # Format the prompt as a conversation
    messages = [{"role": "user", "content": prompt}]

    # Apply chat template
    formatted_prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    # Tokenize
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

    # Generate without gradients
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=300,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    # Decode only the newly generated tokens (skip the input)
    generated_ids = outputs[0][inputs.input_ids.shape[1]:]
    response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    return response


# Test 1 - simple prompt
prompt = "Tell me a joke about programmers."
print(f"\nPrompt: {prompt}")
response = test_model(base_model, base_tokenizer, prompt)
print(f"Response: {response}")

# Test 2 - Autogen related prompt
prompt = "What is Autogen?"
print(f"\nPrompt: {prompt}")
response = test_model(base_model, base_tokenizer, prompt)
print(f"Response: {response}")


Prompt: Tell me a joke about programmers.
Response: Why did the programmer write a message on the bathroom wall in the office? He wanted to "make a clean break" from the old code. (A play on words, as in "clean" as in a clean break, but also "clean" as in writing on a bathroom wall.)

Prompt: What is Autogen?
Response: Autogen, short for Autogenous training, is a type of machine learning technique used in reinforcement learning. In this method, the agent learns a policy directly from demonstrations, without a reward signal. The demonstrations are typically provided by a human expert or another agent.

In other words, Autogen training involves using a model that has been pre-trained on a dataset to generate its own data. This self-generated data is then used to further train the model. This technique can be useful when the reward signal is sparse or difficult to define, as is often the case in complex, real-world problems.

Autogen is a popular method in video game AI, where it is used

Log model and tokenizer

In [8]:
# Model
print("---Model---")
print("Type:", type(base_model))
print("Architecture:", base_model)
print("Config:", base_model.config)
print("Generation Config:", base_model.generation_config)
print("Model Vocabulary Size:", base_model.config.vocab_size)
print("Input embeddings:")
print(base_model.get_input_embeddings())
print("Output embeddings:")
print(base_model.get_output_embeddings())

# Tokenizer
print("---Tokenzier---")
print("Type:", type(base_tokenizer))
# print(tokenizer_loaded)
print("Special tokens:", base_tokenizer.special_tokens_map)
print("All tokens count:", len(base_tokenizer))
print("Padding side:", base_tokenizer.padding_side)

---Model---
Type: <class 'transformers.models.mistral.modeling_mistral.MistralForCausalLM'>
Architecture: MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32769, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): MistralRMSNorm((4096

### Load the dataset

In [25]:
from datasets import load_dataset, Dataset, DatasetDict

dataset = load_dataset("lukaskellerstein/autogen", split="train")
print(dataset)

dataset = dataset.rename_column("text", "messages") 
print(dataset)

Dataset({
    features: ['text'],
    num_rows: 222
})
Dataset({
    features: ['messages'],
    num_rows: 222
})


#### Create a final dataset

In [26]:
# final dataset
final_datasets = dataset.train_test_split(test_size=0.2)
print(final_datasets)

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 177
    })
    test: Dataset({
        features: ['messages'],
        num_rows: 45
    })
})


### PEFT

In [41]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# ----------------------------------
# Adding the adapters to the layers
# ----------------------------------

# PEFT
peft_config = LoraConfig(
    r=16,
    lora_alpha=16,
    # target_modules=[
    #     "q_proj",
    #     "k_proj",
    #     "down_proj",
    #     "v_proj",
    #     "gate_proj",
    #     "o_proj",
    #     "up_proj",
    # ],
    lora_dropout=0.1,
    bias="none",
    # modules_to_save=[
    #     "lm_head",
    #     "embed_tokens",
    # ],
    task_type="CAUSAL_LM",
    target_modules="all-linear",  # https://huggingface.co/docs/peft/en/developer_guides/lora#qlora-style-training
)

### Training

Trainer

In [42]:
from datetime import timedelta, datetime

# Create timestamped run directory for this training session
run_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
run_name = f"run_{run_timestamp}"
output_dir = f"/workspace/runs/{run_name}/model"
logging_dir = f"/workspace/runs/{run_name}/logs"

print(f"Training run: {run_name}")
print(f"Output directory: {output_dir}")
print(f"Logging directory: {logging_dir}")

Training run: run_20251027_000226
Output directory: /workspace/runs/run_20251027_000226/model
Logging directory: /workspace/runs/run_20251027_000226/logs


In [44]:
from trl import SFTTrainer, SFTConfig

# ----------------------------------
# Training WITH evaluation (metrics)
# ----------------------------------

lr = 0.0001 # learning rate
bs = 1  # batch size
ga_steps = 4  # gradient acc. steps
epochs = 5
steps_per_epoch = len(final_datasets["train"]) // (bs * ga_steps)

training_args = SFTConfig(
    output_dir=output_dir,
    num_train_epochs=epochs,
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs,
    gradient_accumulation_steps=ga_steps,
    learning_rate=lr,
    save_steps=steps_per_epoch,
    save_total_limit=1,
    eval_strategy="steps",
    eval_steps=steps_per_epoch,  # eval and save once per epoch
    logging_steps=10,
    logging_dir=logging_dir,
    report_to="tensorboard",  # Enable TensorBoard logging
    lr_scheduler_type="cosine",
    # lr_scheduler_type="linear",
    warmup_steps=10,  # Gradual warmup
    fp16=True,
    # bf16=True,
)

trainer = SFTTrainer(
    model=base_model,
    args=training_args,
    processing_class=base_tokenizer,
    train_dataset=final_datasets["train"],
    eval_dataset=final_datasets["test"],
    peft_config=peft_config,
)



#### Log TRAINER - Model, dataset

In [45]:
print("--- Trainer model ---")
print(trainer.model)
print("Config:", trainer.model.config)
print("Generation Config:", trainer.model.generation_config)

print("Get Trainable Parameters")
print(trainer.model.print_trainable_parameters())
# trainable params: 167,772,160 || all params: 7,415,803,904 || trainable%: 2.2624

print("--- Trainer tokenizer ---")
print(trainer.processing_class)
print("Type:", type(trainer.processing_class))
# print(tokenizer_loaded)
print("Special tokens:", trainer.processing_class.special_tokens_map)
print("All tokens count:", len(trainer.processing_class))
print("Padding side:", trainer.processing_class.padding_side)


print("--- Trainer dataset ---")
print(trainer.train_dataset)

for t in trainer.train_dataset["messages"][:10]:
    print(t)

for t in trainer.train_dataset["input_ids"][:10]:
    print(t)


print("--- Trainer data collation ---")
print(trainer.data_collator)
collated_data = trainer.data_collator(trainer.train_dataset)
print(collated_data)

for t in collated_data["input_ids"][:10]:
    print(t)

for t in collated_data["labels"][:10]:
    print(t)

for t in collated_data["attention_mask"][:10]:
    print(t)


--- Trainer model ---
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32769, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )

Tensorboard logging

In [46]:
# --- LOG HYPERPARAMETERS TO TENSORBOARD ---
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter(log_dir=logging_dir)

# Create markdown summary of hyperparameters
hyperparams_summary = f"""
# Training Run: {run_name}

## Run Information
- **Timestamp**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
- **Model**: {model_name}
- **Dataset**: lukaskellerstein/autogen
- **Train samples**: {len(final_datasets["train"])}
- **Eval samples**: {len(final_datasets["test"])}

## Model Configuration
- **Quantization**: 4-bit (NF4)
- **Compute dtype**: float16
- **Double quantization**: True
- **Base vocab size**: 32768
- **Extended vocab size**: {base_model.config.vocab_size}
- **Pad token ID**: {base_tokenizer.pad_token_id}

## LoRA/PEFT Configuration
- **LoRA rank (r)**: {peft_config.r}
- **LoRA alpha**: {peft_config.lora_alpha}
- **LoRA dropout**: {peft_config.lora_dropout}
- **Target modules**: {peft_config.target_modules}
- **Bias**: {peft_config.bias}
- **Task type**: {peft_config.task_type}

## Training Hyperparameters
- **Learning rate**: {lr}
- **Batch size**: {bs}
- **Gradient accumulation steps**: {ga_steps}
- **Effective batch size**: {bs * ga_steps}
- **Epochs**: {epochs}
- **Steps per epoch**: {steps_per_epoch}
- **Total training steps**: {steps_per_epoch * epochs}
- **LR scheduler**: {training_args.lr_scheduler_type}
- **FP16**: {training_args.fp16}
- **Eval strategy**: {training_args.eval_strategy}
- **Eval steps**: {training_args.eval_steps}
- **Logging steps**: {training_args.logging_steps}
- **Save steps**: {training_args.save_steps}
- **Save total limit**: {training_args.save_total_limit}

## Directories
- **Output dir**: {output_dir}
- **Logging dir**: {logging_dir}
"""

writer.add_text("Hyperparameters", hyperparams_summary, 0)
writer.close()

print("✓ Hyperparameters logged to TensorBoard")

✓ Hyperparameters logged to TensorBoard


Training

In [47]:
import time

start = time.time()

print("Start training...")
startTrain = time.time()
trainer.train()
td = timedelta(seconds=(time.time() - startTrain))
print(f"Training takes: {td}")


# Total time for the script
td = timedelta(seconds=(time.time() - start))
print(f"Total takes: {td}")

Start training...


  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
44,1.7724,1.537762,1.48507,10437.0,0.662592
88,1.0666,1.252368,1.209841,20630.0,0.693283
132,0.6967,1.216455,0.886209,30969.0,0.721956
176,0.4716,1.326223,0.715053,41183.0,0.729845
220,0.3723,1.381443,0.674828,51380.0,0.728997


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


Training takes: 0:09:56.767539
Total takes: 0:09:56.768095


### Test the adapter - OK

In [48]:
# Function to test the model
def test_model(model, tokenizer, prompt):
    # Set model to eval mode
    model.eval()

    # Format the prompt as a conversation
    messages = [{"role": "user", "content": prompt}]

    # Apply chat template
    formatted_prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    # Tokenize
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

    # Generate without gradients
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    # Decode only the newly generated tokens (skip the input)
    generated_ids = outputs[0][inputs.input_ids.shape[1]:]
    response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    return response


# Test 1 - simple prompt
prompt = "Tell me a joke about programmers."
print(f"\nPrompt: {prompt}")
response = test_model(base_model, base_tokenizer, prompt)
print(f"Response: {response}")

# Test 2 - Autogen related prompt
prompt = "What is Autogen?"
print(f"\nPrompt: {prompt}")
response = test_model(base_model, base_tokenizer, prompt)
print(f"Response: {response}")


Prompt: Tell me a joke about programmers.
Response: Why don't programmers like nature? Because it has too many bugs.

Prompt: What is Autogen?
Response: Autogen is an open-source framework that allows developers to build LLM applications via multiple agents that can converse with each other to accomplish tasks.


### Save the adapter (to disk)

In [None]:
trainer.model.save_pretrained("SAVED_ADAPTER")
trainer.processing_class.save_pretrained("SAVED_ADAPTER")



('SAVED_ADAPTER/tokenizer_config.json',
 'SAVED_ADAPTER/special_tokens_map.json',
 'SAVED_ADAPTER/tokenizer.model',
 'SAVED_ADAPTER/added_tokens.json',
 'SAVED_ADAPTER/tokenizer.json')

### Push adapter (to hub)

In [None]:
trainer.model.push_to_hub(
    repo_id="lukaskellerstein/autogen-mistral-4bit-lora-adapter",
    token=API_TOKEN,
)
trainer.processing_class.push_to_hub(
    repo_id="lukaskellerstein/autogen-mistral-4bit-lora-adapter",
    token=API_TOKEN,
)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/lukaskellerstein/autogen-mistral-4bit-lora-adapter/commit/391d0b23c746eca2094f461985a9ac07c0b96ce9', commit_message='Upload tokenizer', commit_description='', oid='391d0b23c746eca2094f461985a9ac07c0b96ce9', pr_url=None, pr_revision=None, pr_num=None)

# MERGED model

### Merge LoRA adapter and base model => merged model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Merge LoRA adapters with base model
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
adapter_path = "lukaskellerstein/autogen-mistral-4bit-lora-adapter"  # input: adapters

# ------------------------------------------------
# WE CANNOT MERGE Quantized model with LoRA !!!!!!!!!!!!!
# ------------------------------------------------
# Model
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_compute_dtype=torch.bfloat16,
#     bnb_4bit_quant_type="nf4",
# )

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
print(base_model.get_input_embeddings())
print(base_model.get_output_embeddings())
print("Model Vocabulary Size:", base_model.config.vocab_size)

base_tokenizer = AutoTokenizer.from_pretrained(model_name)
print("before", len(base_tokenizer))
base_tokenizer.add_special_tokens({"pad_token": "<pad>"})
print("after", len(base_tokenizer))

print("before 2", base_model.config.pad_token_id)
base_model.config.pad_token_id = base_tokenizer.pad_token_id
base_model.generation_config.pad_token_id = base_tokenizer.pad_token_id
print("after 2", base_model.config.pad_token_id)

print("Model Vocabulary Size:", base_model.config.vocab_size)
base_model.resize_token_embeddings(len(base_tokenizer))
print("Model Vocabulary Size:", base_model.config.vocab_size)


# Load PEFT model
peft_model_loaded = PeftModel.from_pretrained(
    model=base_model,
    model_id=adapter_path,
    device_map="cuda",
)
print(type(peft_model_loaded))
print(peft_model_loaded)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Embedding(32768, 4096)
Linear(in_features=4096, out_features=32768, bias=False)
Model Vocabulary Size: 32768
before 32768
after 32769
before 2 None
after 2 32768
Model Vocabulary Size: 32768
Model Vocabulary Size: 32769


adapter_config.json:   0%|          | 0.00/736 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<class 'peft.peft_model.PeftModelForCausalLM'>
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32769, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear(

Unloading and merging model: 100%|██████████| 678/678 [00:00<00:00, 5534.64it/s]


#### Test the base model + adapter

In [None]:
# Function to test the model
def test_model(model, tokenizer, prompt):
    # Set model to eval mode
    model.eval()

    # Format the prompt as a conversation
    messages = [{"role": "user", "content": prompt}]

    # Apply chat template
    formatted_prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    # Tokenize
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

    # Generate without gradients
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    # Decode only the newly generated tokens (skip the input)
    generated_ids = outputs[0][inputs.input_ids.shape[1]:]
    response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    return response


prompt = "Tell me a joke about programmers."
print(f"\nPrompt: {prompt}")
response = test_model(peft_model_loaded, base_tokenizer, prompt)
print(f"Response: {response}")

#### Merge the model !!!!

In [None]:
# ---------------------------------------------------------------
# Merge base model and LoRA adapter together into one full model
# ---------------------------------------------------------------
merged_model = peft_model_loaded.merge_and_unload(progressbar=True)

log model and tokenizer

In [None]:
# Model
print("---Model---")
print("Type:", type(merged_model))
print("Architecture:", merged_model)
print("Config:", merged_model.config)
print("Generation Config:", merged_model.generation_config)
print("Model Vocabulary Size:", merged_model.config.vocab_size)
print("Input embeddings:")
print(merged_model.get_input_embeddings())
print("Output embeddings:")
print(merged_model.get_output_embeddings())

# Tokenizer
print("---Tokenzier---")
print("Type:", type(base_tokenizer))
# print(tokenizer_loaded)
print("Special tokens:", base_tokenizer.special_tokens_map)
print("All tokens count:", len(base_tokenizer))
print("Padding side:", base_tokenizer.padding_side)

---Model---
Type: <class 'transformers.models.mistral.modeling_mistral.MistralForCausalLM'>
Architecture: MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32769, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        

### Test "merged" model

In [None]:
# Function to test the model
def test_model(model, tokenizer, prompt):
    # Set model to eval mode
    model.eval()

    # Format the prompt as a conversation
    messages = [{"role": "user", "content": prompt}]

    # Apply chat template
    formatted_prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    # Tokenize
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

    # Generate without gradients
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    # Decode only the newly generated tokens (skip the input)
    generated_ids = outputs[0][inputs.input_ids.shape[1]:]
    response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    return response

# Test 1 - simple prompt
prompt = "Tell me a joke about programmers."
print(f"\nPrompt: {prompt}")
response = test_model(merged_model, base_tokenizer, prompt)
print(f"Response: {response}")

# Test 2 - tool use prompt
prompt = "The system should select a function from the following list if it can solve the user's question: [{'type': 'function', 'function': {'name': 'get_natural_phenomenon_details', 'description': 'Get details about a specific natural phenomenon', 'parameters': {'type': 'object', 'properties': {'phenomenon_name': {'type': 'string', 'description': 'The name of the phenomenon'}}, 'required': ['phenomenon_name']}}}, {'type': 'function', 'function': {'name': 'get_top_natural_phenomena', 'description': 'Get the names of the top N natural phenomena by scientific interest', 'parameters': {'type': 'object', 'properties': {'number': {'type': 'integer', 'description': 'The number of top natural phenomena to get'}}, 'required': ['number']}}}]. If calling a function can answer the question, return only a JSON with the call of that function. If not, return the answer as usual./nWhat is the Aurora Borealis?"
print(f"\nPrompt: {prompt}")
response = test_model(merged_model, base_tokenizer, prompt)
print(f"Response: {response}")

### Save "merged" model (to disk)

In [17]:
merged_model.save_pretrained("MERGED")

base_tokenizer.save_pretrained("MERGED")

('MERGED/tokenizer_config.json',
 'MERGED/special_tokens_map.json',
 'MERGED/tokenizer.model',
 'MERGED/added_tokens.json',
 'MERGED/tokenizer.json')

### Push "merged" model (to hub)

In [18]:
merged_model.push_to_hub(
    repo_id="lukaskellerstein/autogen-mistral-16bit-merged",
    token=API_TOKEN,
)
base_tokenizer.push_to_hub(
    repo_id="lukaskellerstein/autogen-mistral-16bit-merged",
    token=API_TOKEN,
)

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/lukaskellerstein/autogen-mistral-16bit-merged/commit/193ebadf5fcfeb8df01c915820a1109d20b28cc7', commit_message='Upload tokenizer', commit_description='', oid='193ebadf5fcfeb8df01c915820a1109d20b28cc7', pr_url=None, pr_revision=None, pr_num=None)

# Merged model from HUB

In [1]:
model_from_hub_name = "lukaskellerstein/autogen-mistral-16bit-merged"

### Load the model

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_loaded_fromHF = AutoModelForCausalLM.from_pretrained(
    model_from_hub_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

tokenizer_loaded_fromHF = AutoTokenizer.from_pretrained(model_from_hub_name)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/137k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

Log model and tokenizer

In [None]:
# Model
print("---Model---")
print("Type:", type(model_loaded_fromHF))
print("Architecture:", model_loaded_fromHF)
print("Config:", model_loaded_fromHF.config)
print("Generation Config:", model_loaded_fromHF.generation_config)
print("Model Vocabulary Size:", model_loaded_fromHF.config.vocab_size)
print("Input embeddings:")
print(model_loaded_fromHF.get_input_embeddings())
print("Output embeddings:")
print(model_loaded_fromHF.get_output_embeddings())

# Tokenizer
print("---Tokenzier---")
print("Type:", type(tokenizer_loaded_fromHF))
# print(tokenizer_loaded)
print("Special tokens:", tokenizer_loaded_fromHF.special_tokens_map)
print("All tokens count:", len(tokenizer_loaded_fromHF))
print("Padding side:", tokenizer_loaded_fromHF.padding_side)

---Model---
Type: <class 'transformers.models.mistral.modeling_mistral.MistralForCausalLM'>
Architecture: MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32769, 4096, padding_idx=32768)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_la

### Test the model - OK

In [None]:
# Function to test the model
def test_model(model, tokenizer, prompt):
    # Set model to eval mode
    model.eval()

    # Format the prompt as a conversation
    messages = [{"role": "user", "content": prompt}]

    # Apply chat template
    formatted_prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    # Tokenize
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

    # Generate without gradients
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    # Decode only the newly generated tokens (skip the input)
    generated_ids = outputs[0][inputs.input_ids.shape[1]:]
    response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    return response


# Test 1 - simple prompt
prompt = "Tell me a joke about programmers."
print(f"\nPrompt: {prompt}")
response = test_model(model_loaded_fromHF, tokenizer_loaded_fromHF, prompt)
print(f"Response: {response}")

# Test 2 - tool use prompt
prompt = "The system should select a function from the following list if it can solve the user's question: [{'type': 'function', 'function': {'name': 'get_natural_phenomenon_details', 'description': 'Get details about a specific natural phenomenon', 'parameters': {'type': 'object', 'properties': {'phenomenon_name': {'type': 'string', 'description': 'The name of the phenomenon'}}, 'required': ['phenomenon_name']}}}, {'type': 'function', 'function': {'name': 'get_top_natural_phenomena', 'description': 'Get the names of the top N natural phenomena by scientific interest', 'parameters': {'type': 'object', 'properties': {'number': {'type': 'integer', 'description': 'The number of top natural phenomena to get'}}, 'required': ['number']}}}]. If calling a function can answer the question, return only a JSON with the call of that function. If not, return the answer as usual./nWhat is the Aurora Borealis?"
print(f"\nPrompt: {prompt}")
response = test_model(model_loaded_fromHF, tokenizer_loaded_fromHF, prompt)
print(f"Response: {response}")
