# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.29it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Janna and I am a patient, and I am writing to you today to express my thoughts and feelings about my car. I have been driving a new car that I purchased in a new town and I feel a sense of satisfaction with it. It is clean, well-maintained, and its exterior is of a high quality. My car has a dependable engine and it makes a loud, rumbling engine sound when I turn on the car. My car's interior is also clean and stylish. I appreciate the spacious, bright windows and the spacious, bright seats.
I have been driving a new car for the past 10 years
Prompt: The president of the United States is
Generated text:  trying to decide how many military base chairs to build for the upcoming year. The base chairs are on sale for $35 each, and the president has a budget of $50,000. 

a) What is the total number of base chairs that can be purchased if the price per chair is discounted by $20? 
b) If the discounted prices were to be applied, what would be the to

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous world-renowned museums, theaters, and festivals. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is also known for its rich history, including the influence of French colonialism and the impact of the French Revolution. Paris is a vibrant and dynamic city with a rich cultural heritage that continues to inspire and captivate people around the world. 

Paris is the capital of France and is a major cultural and economic center

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence. This could lead to more natural and intuitive interactions between humans and machines, as well as more effective decision-making.

2. Enhanced privacy and security: As AI becomes more sophisticated, there is a risk that it could be used for malicious purposes. Therefore, there is a need for increased privacy and security measures to protect the data and personal information that is generated by AI.

3. Greater automation and efficiency: AI is likely to become more efficient and effective at performing tasks, which



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Sarah. I’m a 25-year-old software engineer who has been working in tech for over a decade. I’m really passionate about technology and programming, and I’m always looking for new ways to learn and improve. Outside of work, I enjoy hiking, reading, and spending time with my family and friends. What other things do you know about me? I'm a little introverted, but I love spending time with my own kind. I’m also a bit of a perfectionist, and I strive to always do my best. I always have a good laugh, but I’m not very good at socializing. I'm

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a city located in the south-central part of the country. It is the largest city and the most populous metropolitan area in Europe. The city is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

]

 and

 I

 am

 a

 [

occupation

]

!

 I

 am

 currently

 [

current

 location

]

 and

 I

 enjoy

 [

reason

 for

 my

 hobby

 or

 interest

].

 I

 love

 to

 [

describe

 a

 fun

 activity

 or

 project

].

 I

 have

 a

 [

number

]

 degree

 from

 [

un

iversity

],

 and

 I

 have

 been

 [

number

]

 years

 of

 experience

 in

 [

occupation

].

 I

 am

 [

age

],

 and

 I

 love

 [

love

 or

 hobby

 of

 mine

].

 I

 am

 a

 [

en

joy

ment

 or

 interest

]

 person

,

 and

 I

 am

 always

 up

 for

 [

activity

 or

 project

].

 I

 am

 always

 eager

 to

 learn

 new

 things

,

 and

 I

 am

 always

 open

 to

 trying

 new

 experiences

.

 What

 brings

 you

 to

 this



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 Notre

 Dame

 Basil

ica

.

 It

 is

 also

 home

 to

 the

 largest

 city

 in

 the

 European

 Union

,

 with

 a

 population

 of

 over

 

7

 million

 people

.

 Paris

 is

 a

 vibrant

 and

 diverse

 city

,

 known

 for

 its

 fashion

,

 art

,

 and

 cuisine

.

 It

 is

 also

 a

 popular

 tourist

 destination

,

 with

 millions

 of

 visitors

 annually

.

 With

 its

 rich

 history

 and

 cultural

 heritage

,

 Paris

 is

 a

 fascinating

 city

 that

 has

 fascinated

 people

 for

 centuries

.

 Its

 location

 in

 the

 middle

 of

 the

 continent

,

 near

 the

 Mediterranean

,

 has

 made

 it

 a

 prime

 destination

 for

 many



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 poised

 to

 be

 both

 transformative

 and

 complex

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 Integration

 of

 AI

 with

 Other

 Technologies

:

 AI

 is

 already

 being

 integrated

 with

 various

 technologies

,

 including

 machine

 learning

,

 natural

 language

 processing

,

 and

 robotics

.

 This

 integration

 is

 expected

 to

 continue

,

 with

 more

 sophisticated

 applications

 becoming

 available

.



2

.

 Enhanced

 Rob

ust

ness

 and

 Adapt

ability

:

 As

 AI

 becomes

 more

 integrated

 into

 various

 domains

,

 it

 is

 likely

 to

 become

 more

 robust

 and

 adaptable

.

 This

 could

 lead

 to

 the

 development

 of

 more

 sophisticated

 models

 that

 can

 handle

 a

 wide

 range

 of

 tasks

.



3

.

 Increased

 Use

 of

 Explain

able

 AI

:

 As

 AI

 systems

 become

 more

 sophisticated

,

 they

 may

 become




In [6]:
llm.shutdown()