# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.66it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.65it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Suhail. I am a computer science major and I am interested in finding solutions for problems that can be described by an equation. I am looking for a way to find solutions for equations that are of the form Ax = b, where A is a square matrix and b is a column vector. Is there a method to find solutions for such equations? If not, what are some alternative methods that could be used? Additionally, is there a way to find solutions to equations that are not of the form Ax = b, such as Ax = c, where c is a column vector? If so, what are the differences between solving for
Prompt: The president of the United States is
Generated text:  at the center of the world’s greatest political controversy, with the possibility of his term coming to an end in less than two years. The President has 200 days to complete his term. This is the first time that the president has been in office for less than 200 days. President Obama, who has a very low level of dissat

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also the birthplace of French literature, art, and cuisine. Paris is a bustling metropolis with a rich cultural heritage and is a popular tourist destination. The city is home to many museums, theaters, and restaurants, making it a popular destination for visitors from around the world. Paris is also known for its fashion industry, with many famous designers and boutiques located in the city. Overall, Paris is a vibrant and dynamic city that is a must-visit for anyone interested in French culture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI will continue to automate many tasks, from manufacturing to customer service, and will become more efficient and accurate. This will lead to increased productivity and lower costs for businesses.

2. Improved privacy and security: As AI systems become more sophisticated, they will need to be designed with privacy and security in mind. This will require ongoing efforts to protect user data and ensure that AI systems are not used to harm individuals.

3. Enhanced human-machine collaboration: AI will continue to play a more significant role in human-machine collaboration, enabling machines to perform tasks that would be too difficult or



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [name], and I'm a [age] year old [gender] person. I'm currently living in [city or country] and I have a passion for [field or hobby]. My most notable achievement so far has been [specific achievement or accomplishment]. What can you tell me about yourself? As an AI language model, I don't have a physical appearance or a passion for a specific field or hobby. However, I can assist you in finding out more about your interests and experiences. How can I help you today? Let me know! [name] said. (As AI language model, I don't have a physical appearance

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the City of Light, the famous Eiffel Tower, and the annual Eiffel Tower tower博览. 
Paris is a cosmopolitan metropolis with a rich history dating back over 2,000 years. It is known as the “City of Love” and has 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

occupation

].

 I

'm

 a

 [

type

 of

 job

]

 with

 [

number

 of

 years

 of

 experience

]

 years

 of

 experience

 in

 [

field

 of

 work

].

 I

 am

 an

 [

occup

ational

 or

 professional

 title

]

 with

 [

number

 of

 years

 of

 experience

]

 years

 of

 experience

 in

 [

occupation

].

 My

 primary

 skill

 is

 [

a

 unique

 skill

 or

 trait

].

 I

'm

 a

 [

character

 type

]

 with

 [

number

 of

 roles

 or

 expertise

]

 different

 roles

 or

 expertise

 that

 I

 have

 worked

 in

.

 I

 am

 an

 [

occupation

]

 with

 [

number

 of

 years

 of

 experience

]

 years

 of

 experience

 in

 [

occupation

].

 I

 am

 a

 [

occup

ational

 or

 professional

 title

]

 with



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



(A

)

 True

 (

B

)

 False

 (

A

)

 True

 



The

 statement

 is

 true

 because

 Paris

,

 the

 city

 of

 light

,

 is

 the

 capital

 of

 France

.

 It

 is

 known

 for

 its

 beautiful

 architecture

,

 vibrant

 culture

,

 and

 influential

 role

 in

 the

 French

 Republic

 and

 its

 political

,

 social

,

 and

 cultural

 life

.

 



Paris

 is

 located

 on

 the

 Se

ine

 River

 in

 the

 center

 of

 the

 country

,

 and

 the

 city

 is

 famous

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

 Dame

 Cathedral

.

 The

 city

 is

 also

 home

 to

 the

 world

's

 oldest

 university

,

 the

 Sor

bon

ne

,

 and

 the

 headquarters

 of

 several

 major

 French



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 very

 exciting

 and

 full

 of

 potential

.

 Here

 are

 some

 possible

 trends

 that

 are

 likely

 to

 shape

 the

 field

:



1

.

 Increased

 focus

 on

 ethical

 considerations

:

 With

 more

 people

 coming

 online

 and

 the

 internet

 becoming

 more

 pervasive

,

 it

's

 becoming

 more

 important

 than

 ever

 to

 consider

 the

 ethical

 implications

 of

 AI

.

 This

 includes

 things

 like

 privacy

,

 bias

,

 and

 transparency

,

 as

 well

 as

 issues

 around

 AI

 bias

 and

 its

 potential

 to

 perpet

uate

 discrimination

.



2

.

 More

 advanced

 hardware

 and

 software

:

 As

 AI

 gets

 more

 complex

,

 there

's

 a

 need

 for

 more

 powerful

 hardware

 and

 software

 to

 support

 it

.

 This

 could

 include

 things

 like

 new

 types

 of

 GPUs

,

 more

 powerful

 neural

 networks

,

 and

 new

 machine

 learning




In [6]:
llm.shutdown()