<a href="https://colab.research.google.com/github/rtweera/code-ft/blob/main/model_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training a Coding Model

### Imports

In [23]:
!pip install -q torch transformers peft accelerate

from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model
import torch
import os
from accelerate import Accelerator

### Configuring Tensor Processing Unit (TPU) for Training

In [24]:
try:
    import torch_xla
    import torch_xla.core.xla_model as xm
    device = xm.xla_device()
    is_xla = True
except:
    print("TPU not found. `is_xla` set to `False`")
    is_xla = False

# Initialize accelerator with bf16 (brain float 16 - special FP16 dtype used by Google TPU) if using TPU 
if is_xla:
    os.environ["XLA_USE_BF16"] = "1"
    from accelerate import Accelerator  # Import Accelerator ONLY after setting environment variable
    accelerator = Accelerator(mixed_precision="bf16")
    precision = "bf16 (Brain Float 16)"
    print(f"Using TPU with {precision} precision")
else:
    from accelerate import Accelerator
    # Set default to fp16 for GPU or no mixed precision for CPU
    if torch.cuda.is_available():
        accelerator = Accelerator(mixed_precision="fp16")
        precision = "fp16 (Half Precision)"
    else:
        accelerator = Accelerator(mixed_precision="no")
        precision = "fp32 (Full Precision)"
    
    print(f"Using {'GPU' if torch.cuda.is_available() else 'CPU'} with {precision} precision")

TPU not found. `is_xla` set to `False`
Using CPU with fp32 (Full Precision) precision


#### BF16

* BF16 is a numerical format that is similar to FP16 but has a wider dynamic range, making it more suitable for training large models on TPUs. 
* It is designed to provide better performance and stability during training, especially for large models.
* BF16 is not natively supported on GPUs, so if you are using a GPU, you can use FP16 instead.
* BF16 uses 8 bits for the exponent and 7 bits for the mantissa, while FP16 uses 5 bits for the exponent and 10 bits for the mantissa.
* This means that BF16 can represent a wider range of values than FP16, which can help prevent underflow and overflow during training.

#### Mixed precision
* Mixed precision training is a technique that uses both 16-bit and 32-bit floating-point numbers to speed up training and reduce memory usage.
* In mixed precision training, the model's weights and gradients are stored in 16-bit format, while the optimizer state and loss are stored in 32-bit format.
* This allows the model to take advantage of the speed and memory benefits of 16-bit training while still maintaining the numerical stability of 32-bit training.
* Mixed precision training is supported on both TPUs and GPUs, and it can significantly speed up training times while reducing memory usage.

#### FP16 vs FP32
* GPUs can use FP16 (16-bit floating-point) or FP32 (32-bit floating-point) for training.
* CPUs typically use FP32 for training, as they do not have native support for FP16, and running FP16 on CPUs can lead to additional overhead and slower performance as CPUs need to convert FP16 to FP32 to process.

#### Gradient checkpointing
* Gradient checkpointing is a technique that reduces memory usage during training by storing only a subset of the intermediate activations and recomputing the rest during the backward pass.
* This allows for training larger models on limited hardware resources, such as TPUs or GPUs with limited memory.
* Gradient checkpointing works by dividing the model into segments and storing only the activations of the segments that are needed for the backward pass.
* The rest of the activations are recomputed during the backward pass, which reduces memory usage at the cost of increased computation time.

## Base model & Tokenizer loading

In [26]:
# Load model and tokenizer
model_name = "Qwen/Qwen2.5-Coder-0.5B"  # Specify exact variant if needed (e.g., 7B, 1.5B)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Display model details
print(f"=== Model: {model_name} ===")
print(f"Model type: {model.__class__.__name__}")
print(f"Model size: {sum(p.numel() for p in model.parameters()) / 1e6:.2f}M parameters")
print(f"Model architecture: {model.config.model_type}")
print(f"Number of layers: {model.config.num_hidden_layers}")
print(f"Hidden size: {model.config.hidden_size}")
print(f"Attention heads: {model.config.num_attention_heads}")
print(f"Vocab size: {model.config.vocab_size}")

# Display tokenizer details
print(f"\n=== Tokenizer ===")
print(f"Tokenizer type: {tokenizer.__class__.__name__}")
print(f"Vocabulary size: {len(tokenizer)}")
print(f"Model max length: {tokenizer.model_max_length}")

# Display special tokens
print(f"\n=== Special Tokens ===")
for key, value in tokenizer.special_tokens_map.items():
    if isinstance(value, str):
        print(f"{key}: '{value}' (ID: {tokenizer.convert_tokens_to_ids(value)})")
    elif isinstance(value, list):
        for v in value:
            print(f"{key}: '{v}' (ID: {tokenizer.convert_tokens_to_ids(v)})")
        # print(f"{key}: '{value}' (ID: {tokenizer.convert_tokens_to_ids(value)})")

# Optional: Check and display if tokenizer is fast (fast ones are written in Rust; slow ones are in Python)
if hasattr(tokenizer, "is_fast"):
    print(f"\nIs Fast Tokenizer: {tokenizer.is_fast}")

# Optional: Display example encoding
example_text = "def calculate_factorial(n):"
encoding = tokenizer(example_text, return_tensors="pt")
decoded = tokenizer.decode(encoding.input_ids[0])
print(f"\n=== Example Encoding ===")
print(f"Raw encoded: {encoding}")
print(f"Text: '{example_text}'")
print(f"Token IDs: {encoding.input_ids[0].tolist()}")
print(f"Decoded: '{decoded}'")
print(f"Number of tokens: {len(encoding.input_ids[0])}")

=== Model: Qwen/Qwen2.5-Coder-0.5B ===
Model type: Qwen2ForCausalLM
Model size: 494.03M parameters
Model architecture: qwen2
Number of layers: 24
Hidden size: 896
Attention heads: 14
Vocab size: 151936

=== Tokenizer ===
Tokenizer type: Qwen2TokenizerFast
Vocabulary size: 151665
Model max length: 32768

=== Special Tokens ===
eos_token: '<|endoftext|>' (ID: 151643)
pad_token: '<|endoftext|>' (ID: 151643)
additional_special_tokens: '<|im_start|>' (ID: 151644)
additional_special_tokens: '<|im_end|>' (ID: 151645)
additional_special_tokens: '<|object_ref_start|>' (ID: 151646)
additional_special_tokens: '<|object_ref_end|>' (ID: 151647)
additional_special_tokens: '<|box_start|>' (ID: 151648)
additional_special_tokens: '<|box_end|>' (ID: 151649)
additional_special_tokens: '<|quad_start|>' (ID: 151650)
additional_special_tokens: '<|quad_end|>' (ID: 151651)
additional_special_tokens: '<|vision_start|>' (ID: 151652)
additional_special_tokens: '<|vision_end|>' (ID: 151653)
additional_special_tok

#### Tokenizer examples

In [27]:
# Examples of different attention masks
print("\n=== Attention Mask Examples ===")

# Example 1: Single sentence - all tokens are attended to
single_text = "def add(a, b):"
single_encoding = tokenizer(single_text, return_tensors="pt")
print("\n1. Single sentence (all tokens attended to):")
print(f"Text: '{single_text}'")
print(f"Input IDs: {single_encoding.input_ids[0].tolist()}")
print(f"Attention mask: {single_encoding.attention_mask[0].tolist()}")
print(f"All 1's in attention mask = attend to all tokens")

# Example 2: Padded sequence - some tokens should be ignored
texts = ["def factorial(n):", "def sum_array(arr):"]
padded_encoding = tokenizer(texts, padding=True, return_tensors="pt")
print("\n2. Padded batch (shorter sequence has padding tokens):")
print(f"Texts: {texts}")
for i, (text, ids, mask) in enumerate(zip(texts, padded_encoding.input_ids, padded_encoding.attention_mask)):
    decoded_ids = tokenizer.decode(ids)
    print(f"\nSequence {i+1}: '{text}'")
    print(f"Input IDs: {ids.tolist()}")  
    print(f"Attention mask: {mask.tolist()}")
    print(f"0's in mask = ignore these tokens (padding)")

# Example 3: Code with comments that might be treated differently
code_with_comment = "def multiply(x, y):  # Multiplies two numbers"
comment_encoding = tokenizer(code_with_comment, return_tensors="pt")
print("\n3. Code with comment:")
print(f"Text: '{code_with_comment}'")
print(f"Input IDs: {comment_encoding.input_ids[0].tolist()}")
print(f"Attention mask: {comment_encoding.attention_mask[0].tolist()}")

# Example 4: Visualize the attention mask
print("\n4. Visual representation of attention masks:")
for i, (text, mask) in enumerate(zip(texts, padded_encoding.attention_mask)):
    att_vis = ''.join(['■' if m == 1 else '□' for m in mask])
    print(f"'{text}': {att_vis}")
    
# Breakdown of tokens for the second example
print("\n5. Token-by-token breakdown with attention:")
for i, (text, ids, mask) in enumerate(zip(texts, padded_encoding.input_ids, padded_encoding.attention_mask)):
    print(f"\nSequence {i+1}: '{text}'")
    tokens = tokenizer.convert_ids_to_tokens(ids)
    print(f"{'Token':<20} | {'Token ID':<8} | {'Attended?':<10} | {'Token Text'}")
    print("-" * 70)
    for token, token_id, attention in zip(tokens, ids.tolist(), mask.tolist()):
        attended = "Yes" if attention == 1 else "No (padding)"
        token_text = tokenizer.decode([token_id]).replace(" ", "·")
        print(f"{token:<20} | {token_id:<8} | {attended:<10} | '{token_text}'")


=== Attention Mask Examples ===

1. Single sentence (all tokens attended to):
Text: 'def add(a, b):'
Input IDs: [750, 912, 2877, 11, 293, 1648]
Attention mask: [1, 1, 1, 1, 1, 1]
All 1's in attention mask = attend to all tokens

2. Padded batch (shorter sequence has padding tokens):
Texts: ['def factorial(n):', 'def sum_array(arr):']

Sequence 1: 'def factorial(n):'
Input IDs: [750, 52962, 1445, 1648, 151643]
Attention mask: [1, 1, 1, 1, 0]
0's in mask = ignore these tokens (padding)

Sequence 2: 'def sum_array(arr):'
Input IDs: [750, 2629, 3858, 10939, 1648]
Attention mask: [1, 1, 1, 1, 1]
0's in mask = ignore these tokens (padding)

3. Code with comment:
Text: 'def multiply(x, y):  # Multiplies two numbers'
Input IDs: [750, 30270, 2075, 11, 379, 1648, 220, 671, 17439, 7202, 1378, 5109]
Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

4. Visual representation of attention masks:
'def factorial(n):': ■■■■□
'def sum_array(arr):': ■■■■■

5. Token-by-token breakdown with attention:


Padding or truncating is done when we are dealing with batches of text sequences of different lengths.
* Padding: Adding special tokens to the end of a sequence to make it the same length as the longest sequence in the batch.
* Truncating: Removing tokens from the end of a sequence to make it shorter than the maximum length.
* Padding and truncating are important for batch processing, as they ensure that all sequences in a batch have the same length, which is required for efficient computation on TPUs or GPUs.
* If not, tokenizer will throw an error when trying to process a batch of sequences with different lengths.

In [6]:

# Tokenization function
def tokenize_function(texts):
    return tokenizer(texts, truncation=True, max_length=512, padding="max_length", return_tensors="pt")

# Tokenize datasets
with open("train.txt", "r") as f:
    train_data = f.readlines()
with open("val.txt", "r") as f:
    val_data = f.readlines()

train_tokenized = tokenize_function(train_data)
val_tokenized = tokenize_function(val_data)


FileNotFoundError: [Errno 2] No such file or directory: 'train.txt'

In [None]:

# Convert to a format suitable for Trainer
class SimpleDataset:
    def __init__(self, tokenized_data):
        self.input_ids = tokenized_data["input_ids"]
        self.attention_mask = tokenized_data["attention_mask"]
    def __len__(self):
        return len(self.input_ids)
    def __getitem__(self, idx):
        return {"input_ids": self.input_ids[idx], "attention_mask": self.attention_mask[idx]}

train_dataset = SimpleDataset(train_tokenized)
val_dataset = SimpleDataset(val_tokenized)

# Data collator for CLM (shifts inputs to create targets)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


In [None]:

# Configure LoRA
lora_config = LoraConfig(
    r=8,              # Rank of the adaptation matrices # TODO: checck
    lora_alpha=32,    # Scaling factor
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Attention layers to adapt
    lora_dropout=0.1
)
model = get_peft_model(model, lora_config)

# Training arguments
training_args = TrainingArguments(
    output_dir="./finetuned_qwen_ballerina",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=is_xla,
    save_strategy="epoch",
    load_best_model_at_end=True
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator
)
if is_xla:
  trainer = accelerator.prepare(trainer)




In [None]:

# Train the model
trainer.train()


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mrtweera[0m ([33mrtw-rtweera[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
1,2.3966,No log


KeyError: "The `metric_for_best_model` training argument is set to 'eval_loss', which is not found in the evaluation metrics. The available evaluation metrics are: []. Consider changing the `metric_for_best_model` via the TrainingArguments."

In [None]:
# Manual evaluation
from transformers import Trainer
temp_trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=val_dataset,
    data_collator=data_collator
)
metrics = temp_trainer.evaluate()
print(metrics)

{'eval_model_preparation_time': 0.0465, 'eval_runtime': 379.3769, 'eval_samples_per_second': 5.973, 'eval_steps_per_second': 1.495}


In [None]:

# Save the fine-tuned model
model.save_pretrained("./finetuned_qwen_ballerina")
tokenizer.save_pretrained("./finetuned_qwen_ballerina")

('./finetuned_qwen_ballerina/tokenizer_config.json',
 './finetuned_qwen_ballerina/special_tokens_map.json',
 './finetuned_qwen_ballerina/vocab.json',
 './finetuned_qwen_ballerina/merges.txt',
 './finetuned_qwen_ballerina/added_tokens.json',
 './finetuned_qwen_ballerina/tokenizer.json')

### Zip and save

In [None]:
# prompt: zip this file for downloading "./finetuned_qwen_ballerina"

!zip -r /content/finetuned_qwen_ballerina.zip /content/finetuned_qwen_ballerina
# from google.colab import files
# files.download("/content/finetuned_qwen_ballerina.zip")


  adding: content/finetuned_qwen_ballerina/ (stored 0%)
  adding: content/finetuned_qwen_ballerina/merges.txt (deflated 57%)
  adding: content/finetuned_qwen_ballerina/adapter_model.safetensors (deflated 7%)
  adding: content/finetuned_qwen_ballerina/special_tokens_map.json (deflated 69%)
  adding: content/finetuned_qwen_ballerina/vocab.json (deflated 61%)
  adding: content/finetuned_qwen_ballerina/tokenizer_config.json (deflated 83%)
  adding: content/finetuned_qwen_ballerina/added_tokens.json (deflated 67%)
  adding: content/finetuned_qwen_ballerina/runs/ (stored 0%)
  adding: content/finetuned_qwen_ballerina/runs/Mar27_06-52-08_c2edef9bda76/ (stored 0%)
  adding: content/finetuned_qwen_ballerina/runs/Mar27_06-52-08_c2edef9bda76/events.out.tfevents.1743062784.c2edef9bda76.6654.1 (deflated 22%)
  adding: content/finetuned_qwen_ballerina/runs/Mar27_06-52-08_c2edef9bda76/events.out.tfevents.1743058341.c2edef9bda76.6654.0 (deflated 60%)
  adding: content/finetuned_qwen_ballerina/README.m

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the saved model and tokenizer
model_path = "./finetuned_qwen_ballerina"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Function to generate Ballerina code completions
def generate_code_completion(prompt, max_length=200, temperature=0.7, top_p=0.9):
    # Prepare the input
    inputs = tokenizer(prompt, return_tensors="pt")

    # Generate completion
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_new_tokens=max_length,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )

    # Decode and return the completion
    completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return completion

# Example usage
prompt = "public function calucalateFactorial(int number) returns int|error {\n"
completion = generate_code_completion(prompt)
print(completion)

public function calucalateFactorial(int number) returns int|error {
    if number < 0 {
        throw error("Factorial cannot be negative");
    }
    if number == 0 {
        return 1;
    }
    if number == 1 {
        return 1;
    }
    return number * calucalateFactorial(number - 1);
}

// Calculate the number of combinations between n items and k items
// using the factorial function
function combinations(n, k) returns int|error {
    if n < 0 || k < 0 {
        throw error("n and k must be non-negative integers");
    }
    if k > n {
        return 0;
    }
    return calucalateFactorial(n) / (calucalateFactorial(k) * calucalateFactorial(n - k));
}

// Calculate the number of permutations between n items and k items
// using the factorial function
function permutations(n, k) returns int|error {
    if n < 


Compare with Base model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Function to generate code using any model
def generate_with_model(model_name, prompt, max_length=200, temperature=0.7, top_p=0.9):
    # Load model and tokenizer
    print(f"Loading model: {model_name}")
    model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Prepare the input
    inputs = tokenizer(prompt, return_tensors="pt")

    # Generate completion
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_new_tokens=max_length,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )

    # Decode and return the completion
    completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return completion

# Define prompt
prompt = "public function calucalateFactorial(int number) returns int|error {\n"

# Compare base model vs fine-tuned model
print("===== BASE MODEL OUTPUT =====")
# Replace "Qwen/Qwen-7B" with the actual base model you used for fine-tuning
base_output = generate_with_model("Qwen/Qwen2.5-Coder-0.5B", prompt)
print(base_output)

print("\n===== FINE-TUNED MODEL OUTPUT =====")
finetuned_output = generate_with_model("./finetuned_qwen_ballerina", prompt)
print(finetuned_output)

===== BASE MODEL OUTPUT =====
Loading model: Qwen/Qwen2.5-Coder-0.5B
public function calucalateFactorial(int number) returns int|error {
    if (number < 0) {
        return error("Factorial is not defined for negative numbers");
    }
    else if (number == 0) {
        return 1;
    }
    else {
        int result = 1;
        for (int i = 1; i <= number; i++) {
            result *= i;
        }
        return result;
    }
}

===== FINE-TUNED MODEL OUTPUT =====
Loading model: ./finetuned_qwen_ballerina
public function calucalateFactorial(int number) returns int|error {
    if (number < 0) {
        throw error("Factorial can not be negative");
    }
    if (number <= 1) {
        return 1;
    }
    if (number == 2) {
        return 2;
    }
    return number * calucalateFactorial(number - 1);
}
/**
 * 生成一个随机的数字
 * @param min 最小值
 * @param max 最大值
 * @return 生成的数字
 */
function randomInt(min: int, max: int) returns int {
    if (min > max) {
        throw error("min must be smaller 

In [None]:
# Define prompt
prompt = """import ballerina/http;
service / on new http:Listener(9090) {\n"""

# Compare base model vs fine-tuned model
print("===== BASE MODEL OUTPUT =====")
# Replace "Qwen/Qwen-7B" with the actual base model you used for fine-tuning
base_output = generate_with_model("Qwen/Qwen2.5-Coder-0.5B", prompt)
print(base_output)

print("\n===== FINE-TUNED MODEL OUTPUT =====")
finetuned_output = generate_with_model("./finetuned_qwen_ballerina", prompt)
print(finetuned_output)

===== BASE MODEL OUTPUT =====
Loading model: Qwen/Qwen2.5-Coder-0.5B
import ballerina/http;
service / on new http:Listener(9090) {
    @http:GET
    public static string get() {
        return "Hello World";
    }
}

===== FINE-TUNED MODEL OUTPUT =====
Loading model: ./finetuned_qwen_ballerina
import ballerina/http;
service / on new http:Listener(9090) {
    /**
     * The `handleRequest` method is the entry point for the HTTP service. 
     * 
     * @param request - The HTTP request is received as a JSON object.
     * @param response - The HTTP response is sent back to the client as a JSON object.
     * 
     * @return The `return` statement is used to indicate that the function is done and should not return any value.
     */
    http:Response handleRequest(http:Request request, http:Response response) {
        if (request.method == "POST") {
            // The request body is the payload of the POST request.
            string payload = (string) request.body;
            // Extr

In [None]:
# Define prompt
prompt = """import ballerina/http;

# A client class for interacting with a chat service.
public isolated client class ChatClient {
    private final http:Client httpClient;

    # Initializes the `ChatClient` with the provided service URL and configuration.
    #
    # + serviceUrl - The base URL of the chat service.
    # + clientConfig - Configuration options for the chat client.
    # + return - An `error` if the client initialization fails otherwise nil.
    public function init(string serviceUrl, *ChatClientConfiguration clientConfig) returns error? {"""

# Compare base model vs fine-tuned model
print("===== BASE MODEL OUTPUT =====")
# Replace "Qwen/Qwen-7B" with the actual base model you used for fine-tuning
base_output = generate_with_model("Qwen/Qwen2.5-Coder-0.5B", prompt)
print(base_output)

print("\n===== FINE-TUNED MODEL OUTPUT =====")
finetuned_output = generate_with_model("./finetuned_qwen_ballerina", prompt)
print(finetuned_output)

===== BASE MODEL OUTPUT =====
Loading model: Qwen/Qwen2.5-Coder-0.5B
import ballerina/http;

# A client class for interacting with a chat service.
public isolated client class ChatClient {
    private final http:Client httpClient;

    # Initializes the `ChatClient` with the provided service URL and configuration.
    #
    # + serviceUrl - The base URL of the chat service.
    # + clientConfig - Configuration options for the chat client.
    # + return - An `error` if the client initialization fails otherwise nil.
    public function init(string serviceUrl, *ChatClientConfiguration clientConfig) returns error? {
        if (serviceUrl is not string) {
            return error("Service URL must be a string");
        }

        if (clientConfig is not ChatClientConfiguration) {
            return error("Client configuration must be an instance of `ChatClientConfiguration`");
        }

        httpClient = new(http.Client { serviceUrl: serviceUrl });
        return nil;
    }

    # Se

### Export to Ollama

In [None]:
!pip install -q torch transformers peft accelerate

from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model
import torch
import os

# Load the fine-tuned model and tokenizer from the saved directory
model_path = "./finetuned_qwen_ballerina"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Merge LoRA weights into the base model
# Note: If your model is already a PeftModel with LoRA, this step merges it
if hasattr(model, "merge_and_unload"):
    print("Merging LoRA weights...")
    model = model.merge_and_unload()  # Merges LoRA weights into the base model and unloads adapters
else:
    print("Model does not have LoRA weights to merge, proceeding with base model.")

# Ensure the model is in evaluation mode
model.eval()

# Save the merged model and tokenizer to a new directory for Ollama
ollama_model_path = "./ollama_finetuned_qwen_ballerina"
model.save_pretrained(ollama_model_path)
tokenizer.save_pretrained(ollama_model_path)

print(f"Model and tokenizer saved to {ollama_model_path}. Ready for Ollama export.")

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m62.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ValueError: Unrecognized model in ./finetuned_qwen_ballerina. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: albert, align, altclip, aria, aria_text, audio-spectrogram-transformer, autoformer, aya_vision, bamba, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, cohere2, colpali, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dab-detr, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, depth_pro, deta, detr, diffllama, dinat, dinov2, dinov2_with_registers, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, emu3, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, gemma3, gemma3_text, git, glm, glpn, got_ocr2, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, granitemoeshared, granitevision, graphormer, grounding-dino, groupvit, helium, hiera, hubert, ibert, idefics, idefics2, idefics3, idefics3_vision, ijepa, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mistral3, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, modernbert, moonshine, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmo2, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prompt_depth_anything, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_5_vl, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rt_detr_v2, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, shieldgemma2, siglip, siglip2, siglip_vision_model, smolvlm, smolvlm_vision, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superglue, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, textnet, time_series_transformer, timesformer, timm_backbone, timm_wrapper, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vitpose, vitpose_backbone, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zamba2, zoedepth