# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.57it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex. I'm a writer. I'm from Boston. I'm from Massachusetts. I'm American. I live in New York. I was born in Brooklyn. I'm the third of three children. My dad is an attorney and my mom is a teacher. I have a younger sister and a younger brother. I have a close relationship with my sister. I'm married, but we're not married yet. I've had some traveling and have lived in Europe and Asia and other countries. I like writing stories and I've had a lot of success with my stories. I have two big awards to show for it. One is a
Prompt: The president of the United States is
Generated text:  a wealthy man who owns a luxurious penthouse with a total area of 100,000 square feet. He decides to allocate a portion of his wealth to a charity. He divides this wealth into 10 equal parts, and each part will be invested in a different charity. If the president decides to allocate $5,000 to each part and invests the remainder in a public art project, what will be 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a cultural and economic hub, with a rich history dating back to the Roman Empire and a modern city that has undergone significant development over the centuries. It is a popular tourist destination, attracting millions of visitors each year. Paris is also known for its cuisine, with its famous dishes such as croissants, beignets, and escargot. The city is also home to many other notable

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more sophisticated and nuanced decision-making. This could lead to more personalized and context-aware AI systems that can better understand and respond to human emotions and behaviors.

2. Enhanced machine learning capabilities: AI systems are likely to become even more powerful and capable, with the ability to learn from vast amounts of data and adapt to new situations. This could lead to more efficient and effective AI systems that can handle a wider range of tasks and applications.

3. Greater emphasis on ethical and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], I'm a [role or profession] who specializes in [specialty or expertise]. I am [born and raised] in [city]. I live in [your current location]. I graduated from [school] with a [degree or major] and have [number of years] years of experience in this field. I am currently [status] in my [major or field of study]. I enjoy [interest or hobby]. I am [age], [height], and [weight]. I am [gender]. I have [physical attributes or personality traits]. I am [ability or personality]. I am a [professional or entrepreneur

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light, a beautiful historic city with a rich history, famous for its grand boulevards, museums, and elegant palaces.
1. Paris is the capital of France.
2. It is also the country's largest city, with a population of over 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

age

]

 year

 old

 aspiring

 [

occupation

].

 I

 believe

 that

 [

reason

 for

 interest

 in

 the

 field

]

 is

 what

 makes

 me

 unique

 and

 important

 to

 me

.

 With

 my

 [

strength

 or

 skill

]

 abilities

,

 I

 have

 the

 potential

 to

 make

 a

 [

positive

 impact

 on

 the

 world

]

 that

 I

 am

 passionate

 about

.

 If

 you

 want

 to

 know

 more

,

 please

 let

 me

 know

.

 [

Name

].

 [

Name

]

 is

 a

 [

occupation

],

 [

reason

 for

 interest

 in

 the

 field

].

 [

Name

]

 believes

 that

 [

reason

 for

 interest

 in

 the

 field

]

 is

 what

 makes

 [

Name

]

 unique

 and

 important

 to

 [

Name

].

 With

 [

Name

's

 ability



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 the

 largest

 city

 and

 the

 most

 populous

 urban

 area

 in

 France

,

 with

 a

 population

 of

 over

 

2

 million

 people

.

 It

 is

 the

 cultural

,

 educational

,

 and

 economic

 center

 of

 France

,

 and

 one

 of

 the

 most

 visited

 cities

 in

 the

 world

.

 Paris

 is

 also

 known

 as

 "

la

 Ville

 Fl

uv

iale

"

 (

The

 River

 City

)

 for

 its

 historic

 canal

 system

.

 The

 city

 is

 home

 to

 many

 of

 France

's

 most

 famous

 landmarks

 and

 attractions

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Palace

 of

 Vers

ailles

.

 Paris

 is

 also

 a

 major

 hub

 for

 French

 culture

,

 with

 the

 French

 language

 being

 one

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 has

 the

 potential

 to

 revolution

ize

 various

 industries

 and

 aspects

 of

 human

 life

.

 Some

 possible

 trends

 in

 AI

 include

:



1

.

 Increased

 automation

 and

 efficiency

:

 AI

 is

 becoming

 increasingly

 efficient

 and

 can

 handle

 complex

 tasks

 that

 were

 previously

 done

 by

 humans

.

 This

 will

 likely

 lead

 to

 increased

 productivity

 and

 cost

 savings

 for

 businesses

.



2

.

 Improved

 natural

 language

 processing

:

 With

 the

 help

 of

 AI

,

 natural

 language

 processing

 capabilities

 will

 become

 more

 advanced

,

 allowing

 machines

 to

 understand

 human

 language

 more

 accurately

 and

 respond

 to

 queries

 better

.



3

.

 Enhanced

 personal

ization

:

 AI

 will

 allow

 machines

 to

 personalize

 their

 interactions

 with

 users

,

 creating

 more

 personalized

 experiences

 and

 a

 more

 seamless

 user

 experience

.



4

.

 Greater




In [6]:
llm.shutdown()