<a href="https://colab.research.google.com/github/wjudeh/course-spring-brasb-build-a-rest-api-code/blob/main/llm_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Fine-Tuning Large Language Models: A Practical Guide**

#### **Introduction**

Welcome to this tutorial on **fine-tuning large language models (LLMs)**! In this hands-on guide, you will learn how to fine-tune a **pre-trained language model** using **LLaMA-Factory**, **Hugging Face Transformers**, and **vLLM**. The tutorial provides step-by-step instructions for data preparation, model fine-tuning, evaluation, and deployment.

#### **What You Will Learn**

By the end of this tutorial, you will understand:

---

- **Setting Up the Environment**: Installing necessary libraries and dependencies for fine-tuning.
- **Dataset Preparation**: Formatting data for **Supervised Fine-Tuning (SFT)** using Pydantic schemas.
- **Fine-Tuning with LLaMA-Factory**: Training a model on a dataset, configuring **LoRA adapters**, and logging experiments with **Weights & Biases (WandB)**.
- **Evaluation and Inference**: Using the fine-tuned model for **Arabic news story processing**, including **entity extraction** and **translation**.
- **Optimized Inference with vLLM**: Deploying the model with **vLLM for fast inference** and testing its performance.
- **Cost Estimation**: Evaluating the token consumption and cost-effectiveness of fine-tuning.

---

#### **Who Should Take This Tutorial?**

This tutorial is designed for:

- **Machine Learning Engineers** who want to fine-tune LLMs efficiently.
- **NLP Practitioners** working with Arabic text processing.
- **AI Enthusiasts** looking to understand **parameter-efficient fine-tuning (LoRA)**.
- **Developers** interested in **deploying optimized LLMs using vLLM**.

---

#### **Prerequisites**

Before starting, ensure you have:

- Basic knowledge of **Python** and **NLP**.
- Familiarity with **Hugging Face Transformers**.
- Access to a **GPU-enabled environment** (Google Colab, AWS, etc.).

---

This tutorial provides a **practical workflow** for fine-tuning LLMs with real-world examples. Get ready to explore **cutting-edge NLP techniques** and optimize your models for better performance! 🚀

Author:
- [Abu Bakr Soliman](https://www.linkedin.com/in/bakrianoo/)

## Mount Your Google Drive

In [None]:
from google.colab import drive
drive.mount('/gdrive')

## Setup

In [None]:
!pip install -qU transformers==4.48.3 datasets==3.2.0 optimum==1.24.0
!pip install -qU openai==1.61.0 wandb
!pip install -qU json-repair==0.29.1
!pip install -qU faker==35.2.0
!pip install -qU vllm==0.7.2

In [None]:
!git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
!cd LLaMA-Factory && pip install -e .

In [None]:
from google.colab import userdata
import wandb

wandb.login(key=userdata.get('wandb'))
hf_token = userdata.get('huggingface')
!huggingface-cli login --token {hf_token}

## Imports

In [None]:
import json
import os
from os.path import join
import random
from tqdm.auto import tqdm
import requests

from pydantic import BaseModel, Field
from typing import List, Optional, Literal
from datetime import datetime

import json_repair

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

data_dir = "/gdrive/MyDrive/youtube-resources/llm-finetuning"
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"

device = "cuda"
torch_dtype = None

def parse_json(text):
    try:
        return json_repair.loads(text)
    except:
        return None

## Tasks

In [None]:
story = """
ذكرت مجلة فوربس أن العائلة تلعب دورا محوريا في تشكيل علاقة الأفراد بالمال،
 حيث تتأثر هذه العلاقة بأنماط السلوك المالي المتوارثة عبر الأجيال.

التقرير الذي يستند إلى أبحاث الأستاذ الجامعي شاين إنيت حول
الرفاه المالي يوضح أن لكل شخص "شخصية مالية" تتحدد وفقا لطريقة
 تفاعله مع المال، والتي تتأثر بشكل مباشر بتربية الأسرة وتجارب الطفولة.

 الأبعاد الثلاثة للعلاقة بالمال
بحسب الدراسة، هناك ثلاثة أبعاد رئيسية تشكّل علاقتنا بالمال:

الاكتساب (A): يميل الأفراد الذين ينتمون لهذا
 البعد إلى اعتبار المال سلعة قابلة للجمع، حيث يرون
في تحقيق الثروة هدفا بحد ذاته. والجانب السلبي لهذا
 النمط هو إمكانية التحول إلى هوس بالثروة أو العكس،
 أي رفض تام لاكتساب المال باعتباره مصدرا للفساد.

الاستخدام (U): يرى هؤلاء الأشخاص المال أداة للتمتع بالحياة، حيث يربطون قيمته بقدرته على توفير
المتعة والراحة. ومع ذلك، قد يصبح
البعض مدمنا على الإنفاق، في حين يتجه آخرون إلى التقشف المفرط خوفا من المستقبل.

الإدارة (M): أصحاب هذا النمط يعتبرون المال مسؤولية تتطلب التخطيط الدقيق. لكن في بعض الحالات،
 قد يتحول الأمر إلى هوس مفرط بإدارة الإنفاق، مما يؤثر سلبا على العلاقات الشخصية.

 كيف تؤثر العائلة على علاقتنا بالمال؟
يشير التقرير إلى أن التجارب الأسرية تلعب دورا رئيسيا في تحديد
 "الشخصية المالية" لكل فرد، على سبيل المثال، إذا كان أحد الوالدين يعتمد على المال
كمكافأة للسلوك الجيد، فقد يتبنى الطفل لاحقا النمط نفسه في حياته البالغة.

لتحليل هذه التأثيرات بشكل دقيق، طورت رابطة العلاج المالي
(Financial Therapy Association) أداة تسمى مخطط الجينوم المالي (Money Genogram)،
وهو نموذج يُستخدم لتحديد الأنماط المالية داخل العائلة.

تتضمن هذه الأداة:

رسم شجرة عائلية.
تصنيف أفراد العائلة وفقا للأبعاد الثلاثة للعلاقة بالمال (A ،U ،M).
تحديد ما إذا كان السلوك المالي لكل فرد صحيا (+) أو غير صحي (-).
على سبيل المثال، إذا نشأ شخص في عائلة
اعتادت على الإنفاق المفرط، فقد يكون لديه ميل قوي إلى اتباع النمط نفسه،
 أو العكس تماما، حيث يصبح مقتصدا بشكل مبالغ فيه كرد فعل نفسي.
"""

In [None]:
# # # Another Sample

# story = """
# قرر المجلس القومي للأجور في مصر، زيادة الحد الأدنى لأجر العاملين بالقطاع الخاص إلى 7 آلاف جنيه شهريًا مقابل 6 آلاف جنيه، على أن يتم تطبيق الزيادة اعتبارًا من 1 مارس 2025.
# كما قرر المجلس أن يكون الحد الأدنى لقيمة العلاوة الدورية للعاملين بالقطاع الخاص 250 جنيهًا شهريًا، ولأول مرة يقرر المجلس القومي للأجور وضع حد أدنى للأجر للعمل المؤقت "جزء من الوقت"، بحيث لا يقل أجرهم عن 28 جنيهًا صافيًا في الساعة، وذلك وفقًا لتعريفهم الوارد في قانون العمل.
# وقالت وزيرة التخطيط والتنمية الاقتصادية والتعاون الدولي، رانيا المشاط، إن رفع الحد الأدنى للأجور يأتي في إطار الحرص على الاستجابة للمستجدات الاقتصادية الراهنة، بما يعزز الاستقرار الاقتصادي والاجتماعي، مضيفة أن ذلك يتسق مع المعايير الدولية، حيث تؤكد منظمة العمل الدولية على ضرورة مراجعة الحد الأدنى للأجور على أساس دوري، لحماية القوة الشرائية للأسر، واستيعاب التغيرات الاقتصادية التدريجية.
# """

### Details Extraction

In [None]:
StoryCategory = Literal[
    "politics", "sports", "art", "technology", "economy",
    "health", "entertainment", "science",
    "not_specified"
]

EntityType = Literal[
    "person-male", "person-female", "location", "organization", "event", "time",
    "quantity", "money", "product", "law", "disease", "artifact", "not_specified"
]

class Entity(BaseModel):
    entity_value: str = Field(..., description="The actual name or value of the entity.")
    entity_type: EntityType = Field(..., description="The type of recognized entity.")

class NewsDetails(BaseModel):
    story_title: str = Field(..., min_length=5, max_length=300,
                             description="A fully informative and SEO optimized title of the story.")

    story_keywords: List[str] = Field(..., min_items=1,
                                      description="Relevant keywords associated with the story.")

    story_summary: List[str] = Field(
                                    ..., min_items=1, max_items=5,
                                    description="Summarized key points about the story (1-5 points)."
                                )

    story_category: StoryCategory = Field(..., description="Category of the news story.")

    story_entities: List[Entity] = Field(..., min_items=1, max_items=10,
                                        description="List of identified entities in the story.")


In [None]:
details_extraction_messages = [
    {
        "role": "system",
        "content": "\n".join([
            "You are an NLP data paraser.",
            "You will be provided by an Arabic text associated with a Pydantic scheme.",
            "Generate the ouptut in the same story language.",
            "You have to extract JSON details from text according the Pydantic details.",
            "Extract details as mentioned in text.",
            "Do not generate any introduction or conclusion."
        ])
    },
    {
        "role": "user",
        "content": "\n".join([
            "## Story:",
            story.strip(),
            "",

            "## Pydantic Details:",
            json.dumps(
                NewsDetails.model_json_schema(), ensure_ascii=False
            ),
            "",

            "## Story Details:",
            "```json"
        ])
    }
]

### Translation

In [None]:
class TranslatedStory(BaseModel):
    translated_title: str = Field(..., min_length=5, max_length=300,
                                  description="Suggested translated title of the news story.")
    translated_content: str = Field(..., min_length=5,
                                    description="Translated content of the news story.")

targeted_lang = "English"

translation_messages = [
    {
        "role": "system",
        "content": "\n".join([
            "You are a professional translator.",
            "You will be provided by an Arabic text.",
            "You have to translate the text into the `Targeted Language`.",
            "Follow the provided Scheme to generate a JSON",
            "Do not generate any introduction or conclusion."
        ])
    },
    {
        "role": "user",
        "content":  "\n".join([
            "## Story:",
            story.strip(),
            "",

            "## Pydantic Details:",
            json.dumps( TranslatedStory.model_json_schema(), ensure_ascii=False ),
            "",

            "## Targeted Language:",
            targeted_lang,
            "",

            "## Translated Story:",
            "```json"

        ])
    }
]

## Evaluation

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto",
    torch_dtype = torch_dtype
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)

In [None]:
model

In [None]:
text = tokenizer.apply_chat_template(
    details_extraction_messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=1024,
    do_sample=False, top_k=None, temperature=None, top_p=None,
)

generated_ids = [
    output_ids[len(input_ids):]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [None]:
print(response)

In [None]:
text = tokenizer.apply_chat_template(
    translation_messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=1024,
    do_sample=False, top_k=None, temperature=None, top_p=None,
)

generated_ids = [
    output_ids[len(input_ids):]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

## Evaluate OpenAI

In [None]:
from openai import OpenAI
from google.colab import userdata

openai_client = OpenAI(
    api_key=userdata.get('openai-colab'),
    # base_url="http://localhost:8000"
)

openai_model_id = "gpt-4o-mini"

In [None]:
chat_completion = openai_client.chat.completions.create(
    messages=details_extraction_messages,
    model=openai_model_id,
    temperature=0.2,
)

print(chat_completion.choices[0].message.content)

In [None]:
chat_completion = openai_client.chat.completions.create(
    messages=translation_messages,
    model=openai_model_id,
    temperature=0.2,
)

print(chat_completion.choices[0].message.content)

In [None]:
json.loads(chat_completion.choices[0].message.content)

## Knowledge Distillation

In [None]:
raw_data_path = join(data_dir, "datasets", "news-sample.jsonl")

raw_data = []
for line in open(raw_data_path):
    if line.strip() == "":
        continue

    raw_data.append(
        json.loads(line.strip())
    )

random.Random(101).shuffle(raw_data)

print(f"Raw data: {len(raw_data)}")

Raw data: 2400


In [None]:
raw_data[0]['content']

In [None]:
cloud_model_id = "gpt-4o-mini"
price_per_1m_input_tokens = 0.150
price_per_1m_output_tokens = 0.600

prompt_tokens = 0
completion_tokens = 0

save_to = join(data_dir, "datasets", "sft.jsonl")

ix = 0
for story in tqdm(raw_data):

    sample_details_extraction_messages = [
        {
            "role": "system",
            "content": "\n".join([
                "You are an NLP data paraser.",
                "You will be provided by an Arabic text associated with a Pydantic scheme.",
                "Generate the ouptut in the same story language.",
                "You have to extract JSON details from text according the Pydantic details.",
                "Extract details as mentioned in text.",
                "Do not generate any introduction or conclusion."
            ])
        },
        {
            "role": "user",
            "content": "\n".join([
                "## Story:",
                story['content'].strip(),
                "",

                "## Pydantic Details:",
                json.dumps(
                    NewsDetails.model_json_schema(), ensure_ascii=False
                ),
                "",

                "## Story Details:",
                "```json"
            ])
        }
    ]

    response = openai_client.chat.completions.create(
                            messages=sample_details_extraction_messages,
                            model=cloud_model_id,
                            temperature=0.2,
                        )

    if response.choices[0].finish_reason != "stop":
        prompt_tokens += response.usage.prompt_tokens
        continue

    llm_response = response.choices[0].message.content
    llm_resp_dict = parse_json(llm_response)

    if not llm_resp_dict:
        continue

    with open(save_to, "a", encoding="utf8") as dest:
        dest.write(json.dumps({
            "id": ix,
            "story": story['content'].strip(),
            "task": "Extrat the story details into a JSON.",
            "output_scheme": json.dumps( NewsDetails.model_json_schema(), ensure_ascii=False ),
            "response": llm_resp_dict,
        }, ensure_ascii=False, default=str)  + "\n" )

    ix += 1
    prompt_tokens += response.usage.prompt_tokens
    completion_tokens += response.usage.completion_tokens

    if(ix % 3) == 0:
        cost_input = (prompt_tokens / 1_000_000) * price_per_1m_input_tokens
        cost_output = (completion_tokens / 1_000_000) * price_per_1m_output_tokens
        total_cost = cost_input + cost_output

        print(f"Iteration {ix}: Total Cost = ${total_cost:.4f} ")

In [None]:
cloud_model_id = "gpt-4o-mini"
price_per_1m_input_tokens = 0.150
price_per_1m_output_tokens = 0.600

prompt_tokens = 0
completion_tokens = 0

save_to = join(data_dir, "datasets", "sft.jsonl")

ix = 0
for story in tqdm(raw_data):

    for targeted_lang in ["English", "French"]:
        sample_translation_messages = [
            {
                "role": "system",
                "content": "\n".join([
                    "You are a professional translator.",
                    "You will be provided by an Arabic text.",
                    "You have to translate the text into the `Targeted Language`.",
                    "Follow the provided Scheme to generate a JSON",
                    "Do not generate any introduction or conclusion."
                ])
            },
            {
                "role": "user",
                "content": "\n".join([
                    "## Pydantic Details:",
                    json.dumps( TranslatedStory.model_json_schema(), ensure_ascii=False ),
                    "",

                    "## Targeted Language or Dialect:",
                    targeted_lang,
                    "",

                    "## Story:",
                    story['content'].strip(),
                    "",

                    "## Translated Story:",
                    "```json"
                ])
            }
        ]

        response = openai_client.chat.completions.create(
                                messages=sample_translation_messages,
                                model=cloud_model_id,
                                temperature=0.2,
                            )

        if response.choices[0].finish_reason != "stop":
            prompt_tokens += response.usage.prompt_tokens
            continue

        llm_response = response.choices[0].message.content
        llm_resp_dict = parse_json(llm_response)

        if not llm_resp_dict:
            continue

        with open(save_to, "a", encoding="utf8") as dest:
            dest.write(json.dumps({
                "id": ix,
                "story": story['content'].strip(),

                "output_scheme": json.dumps( TranslatedStory.model_json_schema(), ensure_ascii=False ),
                "task": f"You have to translate the story content into {targeted_lang} associated with a title into a JSON.",

                "response": llm_resp_dict,
            }, ensure_ascii=False, default=str)  + "\n" )

        ix += 1
        prompt_tokens += response.usage.prompt_tokens
        completion_tokens += response.usage.completion_tokens

        if(ix % 3) == 0:
            cost_input = (prompt_tokens / 1_000_000) * price_per_1m_input_tokens
            cost_output = (completion_tokens / 1_000_000) * price_per_1m_output_tokens
            total_cost = cost_input + cost_output

            print(f"Iteration {ix}: Total Cost = ${total_cost:.4f} ")

## Format Finetuning Datasets

In [None]:

sft_data_path = join(data_dir, "datasets", "sft.jsonl")
llm_finetunning_data = []

system_message = "\n".join([
    "You are a professional NLP data parser.",
    "Follow the provided `Task` by the user and the `Output Scheme` to generate the `Output JSON`.",
    "Do not generate any introduction or conclusion."
])

for line in open(sft_data_path):
    if line.strip() == "":
        continue

    rec = json.loads(line.strip())

    llm_finetunning_data.append({
        "system": system_message,
        "instruction": "\n".join([
            "# Story:",
            rec["story"],

            "# Task:",
            rec["task"],

            "# Output Scheme:",
            rec["output_scheme"],
            "",

            "# Output JSON:",
            "```json"

        ]),
        "input": "",
        "output": "\n".join([
            "```json",
            json.dumps(rec["response"], ensure_ascii=False, default=str),
            "```"
        ]),
        "history": []
    })

random.Random(101).shuffle(llm_finetunning_data)

In [None]:
len(llm_finetunning_data)

2766

In [None]:
train_sample_sz = 2700

train_ds = llm_finetunning_data[:train_sample_sz]
eval_ds = llm_finetunning_data[train_sample_sz:]

os.makedirs(join(data_dir, "datasets", "llamafactory-finetune-data"), exist_ok=True)

with open(join(data_dir, "datasets", "llamafactory-finetune-data", "train.json"), "w") as dest:
    json.dump(train_ds, dest, ensure_ascii=False, default=str)

with open(join(data_dir, "datasets", "llamafactory-finetune-data", "val.json"), "w", encoding="utf8") as dest:
    json.dump(eval_ds, dest, ensure_ascii=False, default=str)

In [None]:
join(data_dir, "datasets", "llamafactory-finetune-data", "val.json")

'/gdrive/MyDrive/youtube-resources/llm-finetuning/datasets/llamafactory-finetune-data/val.json'

## Finetuning

In [None]:
# # Configure LLaMA-Factory for the new datasets

# # update /content/LLaMA-Factory/data/dataset_info.json and append
# ```
   "news_finetune_train": {
        "file_name": "/gdrive/MyDrive/youtube-resources/llm-finetuning/datasets/llamafactory-finetune-data/train.json",
        "columns": {
            "prompt": "instruction",
            "query": "input",
            "response": "output",
            "system": "system",
            "history": "history"
        }
    },
    "news_finetune_val": {
        "file_name": "/gdrive/MyDrive/youtube-resources/llm-finetuning/datasets/llamafactory-finetune-data/val.json",
        "columns": {
            "prompt": "instruction",
            "query": "input",
            "response": "output",
            "system": "system",
            "history": "history"
        }
    }
# ```

# https://wandb.ai/mr-bakrianoo/llamafactory/runs/apwbkni9
# https://wandb.ai/mr-bakrianoo/llamafactory/runs/c5tf0q90

In [None]:
%%writefile /content/LLaMA-Factory/examples/train_lora/news_finetune.yaml

### model
model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 64
lora_target: all

### dataset
dataset: news_finetune_train
eval_dataset: news_finetune_val
template: qwen
cutoff_len: 3500
# max_samples: 50
overwrite_cache: true
preprocessing_num_workers: 16

### output
# resume_from_checkpoint: /gdrive/MyDrive/youtube-resources/llm-finetuning/models/checkpoint-1500
output_dir: /gdrive/MyDrive/youtube-resources/llm-finetuning/models/
logging_steps: 10
save_steps: 500
plot_loss: true
# overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
# val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 100

report_to: wandb
run_name: newsx-finetune-llamafactory

push_to_hub: true
export_hub_model_id: "bakrianoo/news-analyzer"
hub_private_repo: true
hub_strategy: checkpoint


Overwriting /content/LLaMA-Factory/examples/train_lora/news_finetune.yaml


In [None]:
!cd LLaMA-Factory/ && llamafactory-cli train /content/LLaMA-Factory/examples/train_lora/news_finetune.yaml

## New Finetuned Model Evaluation

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto",
    torch_dtype = torch_dtype
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)

In [None]:
finetuned_model_id = "/gdrive/MyDrive/youtube-resources/llm-finetuning/models"
model.load_adapter(finetuned_model_id)

In [None]:
def generate_resp(messages):
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=1024,
        do_sample=False, top_k=None, temperature=None, top_p=None,
    )

    generated_ids = [
        output_ids[len(input_ids):]
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    return response

response = generate_resp(translation_messages)

In [None]:
parse_json(response)

#### Tip for Qwen2.5

Qwen2.5 oftenly produce chinese characters with some responses. To skip this, use the next class to generate responses.

Source:
`https://jupyter267.medium.com/how-to-eliminate-the-chance-of-generating-chinese-in-qwen-2-5-2cf919bb0fdc`



In [None]:
class Generator:
    def __init__(self, model, tokenizer):

        self.model, self.tokenizer = model, tokenizer
        self.mask = None

    def generate(self, messages:list, max_new_tokens: int=2000, temperature:float=0.1):

        def logits_processor(token_ids, logits):
          # logits_processor default recieve the logits which is the score matrix of each time-step
          """
              A processor to ban Chinese character
          """
          if self.mask is None:
              # as we don't know where the Chinses tokens locate at which index
              # in the vocabulary but we know how it looks like and the range of it

              # decode all the tokens in the vocabulary in order
              token_ids = torch.arange(logits.size(-1))
              decoded_tokens = self.tokenizer.batch_decode(token_ids.unsqueeze(1), skip_special_tokens=True)

              # create a mask tensor to exclude positions of Chinese characters.
              # since this process uses a for loop and is time-consuming,
              # the result will be stored as a property for later use to ensure it only runs once.
              self.mask = torch.tensor([
                  # loop through each token in the vocabulary and compare it to Chinese characters.
                  any(0x4E00 <= ord(c) <= 0x9FFF or 0x3400 <= ord(c) <= 0x4DBF or 0xF900 <= ord(c) <= 0xFAFF for c in
                      token)
                  for token in decoded_tokens
              ])

          # mask the score by - inf
          logits[:, self.mask] = -float("inf")
          return logits

        # this step transforms the messages into a string,
        # adding special tokens e.g separate tokens between system content user queries
        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True,
        )

        model_inputs = self.tokenizer([text], return_tensors="pt").to(self.model.device)

        generated_ids = self.model.generate(
            **model_inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            # add the logits_processor here
            logits_processor=[logits_processor]
        )
        generated_ids = [
            output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
        ]
        response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

        return response

In [None]:
# define an object
llm = Generator(model, tokenizer)

# generate a response without chinese characters
response = llm.generate(details_extraction_messages)
print( parse_json(response) )

response = llm.generate(translation_messages)
print( parse_json(response) )

## Cost Estimation

In [None]:
from tqdm.auto import tqdm
from faker import Faker
import random
from datetime import datetime

start_time = datetime.now()
fake = Faker('ar')

input_tokens = 0
output_tokens = 0

for i in tqdm(range(30)):
    prompt = fake.text(max_nb_chars=random.randint(150, 200))

    messages = [
        {
            "role": "user",
            "content": prompt,
        }
    ]

    response = generate_resp(messages)

    input_tokens += len(tokenizer.apply_chat_template(messages))
    output_tokens += len(tokenizer.encode(response))

total_time = (datetime.now() - start_time).total_seconds()

print(f"Total Time: {total_time} seconds")
print(f"Input Tokens: {input_tokens}")
print(f"Output Tokens: {output_tokens}")
print(f"Total Tokens: {input_tokens + output_tokens}")

In [None]:
# Total Time: 387.838115 seconds
# Input Tokens: 2446
# Output Tokens: 7375
# Total Tokens: 9821

In [None]:
9821  / 388

25.311855670103093

## vLLM

In [None]:
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
adapter_model_id = "/gdrive/MyDrive/youtube-resources/llm-finetuning/models"


!nohup vllm serve "{base_model_id}" --dtype=half --gpu-memory-utilization 0.8 --max_lora_rank 64 --enable-lora --lora-modules news-lora="{adapter_model_id}" &


nohup: appending output to 'nohup.out'


In [None]:
!tail -n 30 nohup.out

### Inference

In [None]:
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

prompt = tokenizer.apply_chat_template(
    translation_messages,
    tokenize=False,
    add_generation_prompt=True
)

In [None]:
vllm_model_id = "news-lora"

llm_response = requests.post("http://localhost:8000/v1/completions", json={
    "model": vllm_model_id,
    "prompt": prompt,
    "max_tokens": 1000,
    "temperature": 0.3
})

llm_response.json()

## Load Testing

In [None]:
%%writefile locust.py

import random
import json
from locust import HttpUser, task, between, constant
from transformers import AutoTokenizer
from faker import Faker

fake = Faker('ar')

class CompletionLoadTest(HttpUser):
    wait_time = between(1, 3)

    @task
    def post_completion(self):
        model_id = "news-lora"
        prompt = fake.text(max_nb_chars=random.randint(150, 200))

        message = {
            "model": model_id,
            "prompt": prompt,
            "max_tokens": 512,
            "temperature": 0.3
        }

        llm_response = self.client.post("/v1/completions", json=message)

        if llm_response.status_code == 200:
            with open("./vllm_tokens.txt", "a") as dest:
                dest.write(json.dumps({
                    "prompt": prompt,
                    "response": llm_response.json()["choices"][0]["text"],
                }, ensure_ascii=False) + "\n")


Overwriting locust.py


In [None]:
!locust --headless -f locust.py --host=http://localhost:8000 -u 20 -r 1 -t "60s" --html=locust_results.html

In [None]:
vllm_tokens = [
    json.loads(line.strip())
    for line in open("./vllm_tokens.txt") if line.strip() != ""
]

In [None]:
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

total_input_tokens = sum([ len(tokenizer.encode(rec['prompt'])) for rec in vllm_tokens ])
total_output_tokens = sum([ len(tokenizer.encode(rec['response'])) for rec in vllm_tokens ])

print(f"Total Input Tokens: {total_input_tokens}")
print(f"Total Output Tokens: {total_output_tokens}")

Total Input Tokens: 2662
Total Output Tokens: 37840


In [None]:
37840 / 60

630.6666666666666