In [3]:
dataset_name = "tuanlda78202/leo_summarization_task"
model_name = "tuanlda78202/Qwen3-1.7B-Leo-Summarization"

In [1]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=40960,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-30 03:30:10 [__init__.py:243] Automatically detected platform cuda.
==((====))==  Unsloth 2025.5.9: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.9.0.
   \\   /|    NVIDIA GeForce RTX 3060. Num GPUs = 1. Max memory: 11.638 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 2048, padding_idx=151654)
    (layers): ModuleList(
      (0-27): 28 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=6144, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=6144, bias=False)
          (down_proj): Linear4bit(in_features=6144, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
    

In [4]:
from datasets import load_dataset
dataset = load_dataset(dataset_name, split="test")

In [5]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a helpful assistant specialized in summarizing documents. Generate a concise TL;DR summary in markdown format having a maximum of 512 characters of the key findings from the provided documents, highlighting the most significant insights

### Input:
{}

### Response:
{}"""

In [6]:
from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer)


def generate_text(
    instruction, streaming: bool = True, trim_input_message: bool = False
):
    message = alpaca_prompt.format(
        instruction,
        "",
    )
    inputs = tokenizer([message], return_tensors="pt").to("cuda")

    if streaming:
        return model.generate(
            **inputs, streamer=text_streamer, max_new_tokens=256, use_cache=True
        )
    else:
        output_tokens = model.generate(**inputs, max_new_tokens=256, use_cache=True)
        output = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)[0]

        if trim_input_message:
            return output[len(message) :]
        else:
            return output

In [11]:
_ = generate_text(dataset[11]["instruction"], streaming=True)

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a helpful assistant specialized in summarizing documents. Generate a concise TL;DR summary in markdown format having a maximum of 512 characters of the key findings from the provided documents, highlighting the most significant insights

### Input:
![preloader](https://ploomber.io/images/favicon_hu25b163a2750ceab2fb56e04dc19e17ac_2757_36x0_resize_q90_h2_box_2.webp)
[![Ploomber](https://ploomber.io/images/logo.png) Ploomber](https://ploomber.io/)
  * [Home](https://ploomber.io/)
  * [Blog](https://ploomber.io/blog/)
  * [Pricing](https://ploomber.io/pricing/)
  * [AI Editor](https://editor.ploomber.io/)
  * [Docs](https://docs.cloud.ploomber.io)
  * [Contact](https://ploomber.io/contact/)
  * [Explore](https://ploomber.io/blog/vllm-deploy/)
    * [Sample Apps](https://www.platform.ploomber.io/explore)
    

[The key findings from the provided documents include the deployment of vLLM as a scalable solution for serving large language models (LLMs) at scale, with Ploomber offering a streamlined deployment process. The guide provides a step-by-step approach to installing and configuring vLLM, emphasizing the importance of prerequisites like CUDA compatibility and proper API key authentication. It also highlights the integration of vLLM with OpenAI and other APIs, showcasing its versatility in providing REST endpoints for model inference. Additionally, the document addresses potential issues, such as the bug in the transformers package, and offers a one-click deployment option on Ploomber Cloud for secure and efficient infrastructure. The focus is on production readiness, including handling crashes and ensuring compatibility with various models and environments. The summary should be concise, capturing the main points without exceeding 512 characters.]
The key findings from the provided docume