# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.57it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tom and I am a volunteer at the Children’s Home and I would like to make a call. Can you please call me?

I understand that this is a concern. I am a volunteer at the Children's Home and I will be a frequent caller. I will be calling the Children's Home on the 16th of November to make a call. 

However, I would like to inform the staff and the Children's Home management team that I need to be available to make this call to ensure that the Children's Home staff are aware of the need to arrange the call. 

The question is, can I contact the Children's
Prompt: The president of the United States is
Generated text:  trying to decide whether to use a machine to assist him in making decisions. The president decides to use a machine that has a failure probability of 0.02% on a monthly basis. 

1. If the machine fails, it costs the president $10,000.00 in lost revenue. Calculate the expected cost of using the machine over a year.
2. If the machine fail

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short, positive description of your personality or skills]. I'm always looking for new challenges and opportunities to grow and learn. What do you think makes you unique and special? I'm a [insert a short, positive description of your personality or skills]. I'm always looking for new challenges and opportunities to grow and learn. What do you think makes you unique and special? I'm a [insert a short, positive description of your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also famous for its rich history, including the French Revolution and the French Revolution Museum. Paris is a bustling city with a diverse population and is home to many world-renowned museums, including the Louvre, the Musée d'Orsay, and the Musée d'Art Moderne. The city is also known for its fashion industry, with Paris Fashion Week being one of the largest in the world. Paris is a popular tourist destination and is home to many international companies, including the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and robots to personalized medicine and virtual assistants. Additionally, AI is likely to continue to be used for ethical and social purposes, such as improving access to healthcare and reducing poverty. Finally, AI is likely to continue to be used for economic and social benefits, such as creating new industries and jobs and improving the efficiency of our economies. Overall, the future of AI is likely to be one of continued innovation and growth. However,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [title] at [company]. I enjoy [job role] and I am always looking for ways to [specific goal or dream]. If you're ever in need of a helpful hand, feel free to reach out to me. #JobSeeker #Helpful #Friendly
Hi there! I'm [Name] at [company]. I enjoy [job role] and always aim to help others in need. If you're looking for advice or assistance, feel free to reach out to me. #CareerAdvice #HelpfulFriends #Friendly

Does this cover all the points that a neutral self-int

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

[Mark down]
# The Capital of France: Paris

The capital city of France is Paris. Known for its iconic landmarks like the Eiffel Tower and Notre-Dame Cathedral, Paris is a beautiful and historic city that is easily accessible by air and train from other parts of the world. 

# 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

]

 and

 I

 am

 a

 [

profession

 or

 role

]

 who

 has

 been

 in

 the

 field

 for

 [

number

 of

 years

]

 years

.

 I

 have

 a

 passion

 for

 [

reason

 for

 your

 profession

],

 and

 I

 enjoy

 [

task

 or

 challenge

 you

 have

 in

 your

 role

].

 I

 am

 always

 ready

 to

 learn

 and

 to

 make

 a

 positive

 impact

 in

 the

 world

.

 What

's

 your

 profession

,

 and

 what

's

 your

 reason

 for

 being

 in

 it

?

 [

Your

 Name

]

 [

Your

 Profession

]

 [

Your

 reasons

 for

 being

 in

 the

 field

].

 [

Your

 Name

]

 is

 looking

 forward

 to

 our

 interview

 and

 discussing

 your

 professional

 background

.

 What

's

 next

 on

 your

 to

-do

 list

?

 [

Your

 Name

]



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

The

 City

 of

 Light

"

 or

 "

The

 City

 of

 Love

".

 It

 is

 the

 largest

 city

 in

 France

 and

 one

 of

 the

 largest

 in

 Europe

,

 with

 a

 population

 of

 over

 

1

5

 million

.

 Paris

 is

 a

 cosm

opolitan

 city

 with

 a

 rich

 history

 dating

 back

 over

 

2

,

 

0

0

0

 years

.

 It

 is

 also

 a

 major

 cultural

,

 artistic

,

 and

 industrial

 center

,

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 has

 a

 rich

 cultural

 heritage

,

 with

 a

 diverse

 range

 of

 museums

,

 art

 galleries

,

 and

 theaters

 that

 offer

 visitors

 a

 glimpse

 into

 the

 city



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 complex

 inter

play

 of

 trends

 and

 developments

 across

 various

 fields

.

 Some

 of

 the

 most

 prominent

 trends

 include

:



1

.

 Enhanced

 AI

:

 AI

 is

 being

 further

 improved

 with

 new

 algorithms

 and

 architectures

,

 leading

 to

 even

 more

 powerful

 and

 accurate

 predictions

.

 This

 could

 lead

 to

 new

 applications

 in

 fields

 such

 as

 healthcare

,

 transportation

,

 and

 cybersecurity

.



2

.

 AI

 ethics

 and

 regulation

:

 As

 AI

 becomes

 more

 prevalent

,

 there

 will

 be

 a

 need

 for

 new

 ethical

 frameworks

 and

 regulations

 to

 govern

 its

 use

.

 This

 could

 lead

 to

 increased

 scrutiny

 and

 regulation

 of

 AI

,

 as

 well

 as

 new

 challenges

 related

 to

 the

 development

 of

 AI

 that

 can

't

 be

 fully

 understood

 or

 predicted

.



3




In [6]:
llm.shutdown()