# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.07it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Stephanie Hill, and I'm a young writer. I write short stories and poetry. I love to write about the world and the world of fiction. My work is collected in collections of poetry and short stories, including "The Blue Rainbow" and "The Woman Who Wasn't There."
The story of "The Woman Who Wasn't There" is the story of an old woman who has been misunderstood and the consequences of that misunderstanding. The woman has been portrayed as a "dark, ugly, and censored" woman. She has been labeled as a liar and a fraud.
It is true that the author has made it clear that
Prompt: The president of the United States is
Generated text:  elected for a term of 4 years. If there are 44 eligible voters, how many people have been on the job of president for 5 years or more? Let's start by noting that the president is elected for a term of 4 years, which is equivalent to 12 years of service. We need to determine how many people have been on the job of president fo

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity?

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and diverse cultural scene. 

This statement encapsulates the key points about Paris, including its capital status, notable landmarks, and cultural attractions. It provides a brief overview of the city's significance in French history and culture. 

To ensure accuracy, the statement should be clear and concise, using appropriate terminology and avoiding any potential confusion. It should also be suitable for a general audience, providing a basic understanding of the city's importance in French society and its cultural landscape. 

Please provide the statement in a format that can be easily copied and pasted into a document

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to evolve and improve, leading to more sophisticated and accurate AI systems that can perform a wide range of tasks with increasing accuracy and efficiency. Some potential future trends in AI include:

1. Increased integration with human intelligence: AI systems are likely to become more integrated with human intelligence, allowing them to learn from and adapt to human behavior and decision-making processes.

2. Enhanced privacy and security: As AI systems become more sophisticated, there will be a greater need for measures to protect user privacy and ensure that AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [First Name] and I'm [Last Name], a [Occupation] [Current Location] [Reason for Joining] and [Name of the Last Professional Experience]. Thank you for the opportunity to meet you.

I am a [First Name] with [Last Name] and I'm a [Occupation]. I'm a [First Name] with [Last Name] and I'm a [Occupation]. My [First Name] with [Last Name] and I'm a [Occupation]. 

I'm [First Name] with [Last Name] and I'm a [Occupation]. My [First Name] with

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, the Louvre Museum, and the Palace of Versailles. The city is also renowned for its rich culture, including the annual "Marche Nationale" parade, which takes place every September. Additionally, Paris has a diverse culinary scene with a wide variety of food options and

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

Age

]

 year

 old

 [

Occup

ation

]

 living

 in

 [

Your

 City

/T

own

].

 I

 have

 always

 had

 a

 special

 interest

 in

 [

Your

 Hobby

/

Interest

/

Interest

]

 and

 I

'm

 always

 looking

 for

 opportunities

 to

 [

Your

 Goal

].

 I

 believe

 that

 [

Reason

 Why

 You

're

 Good

 At

 Something

]

 makes

 me

 unique

 and

 sets

 me

 apart

 from

 others

.

 Whether

 it

's

 [

A

 Unique

 Skill

/

Pass

ion

/

Interest

]

 or

 [

Your

 Unique

 Strength

],

 I

 have

 it

!

 I

 believe

 that

 being

 able

 to

 [

Your

 Skill

/

Pass

ion

/

Interest

/

Unique

 Strength

]

 is

 what

 makes

 me

 special

 and

 that

's

 why

 I

'm

 here

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

,

 as

 well

 as

 its

 rich

 history

 and

 cultural

 heritage

.

 



(Note

:

 The

 statement

 needs

 to

 be

 crafted

 in

 French

,

 with

 a

 clear

 and

 concise

 structure

.)

 



Le

 capital

 de

 France

 est

 Paris

,

 conn

u

 pour

 ses

 landmarks

 comme

 l

'

É

iff

el

 Tower

,

 le

 Mus

ée

 de

 la

 Rev

ue

 et

 le

 Ch

âte

au

 de

 Notre

-D

ame

,

 ainsi

 que

 son

 rich

esse

 histor

ique

 et

 culture

lle

.

 



(S

o

 simple

 et

 précis

ément

,

 mais

 avec

 une

 meilleure

 prof

onde

ur

 que

 l

'aut

re

 réponse

.)

 



Le

 capital

 de



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 the

 development

 of

 new

 technologies

,

 applications

,

 and

 techniques

 that

 will

 further

 advance

 our

 understanding

 of

 the

 world

 and

 enable

 us

 to

 make

 more

 accurate

,

 efficient

,

 and

 personalized

 decisions

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Improved

 accuracy

 and

 reliability

:

 AI

 systems

 are

 becoming

 increasingly

 accurate

 and

 reliable

,

 but

 there

 are

 still

 challenges

 to

 overcome

,

 such

 as

 noise

,

 bias

,

 and

 uncertainty

.

 Researchers

 are

 working

 on

 developing

 more

 sophisticated

 algorithms

 and

 techniques

 to

 improve

 the

 accuracy

 and

 reliability

 of

 AI

 systems

.



2

.

 Integration

 with

 other

 technologies

:

 AI

 is

 already

 integrated

 into

 a

 wide

 range

 of

 applications

,

 from

 self

-driving

 cars

 to

 smart

 homes

.

 As

 AI




In [6]:
llm.shutdown()