# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.65it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.64it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  James, and I am 22 years old. I am a software developer and I am quite active on Twitter. I do not own or produce any copyrighted material, but I do post excerpts from my own Twitter posts, and I do not take credit for any of my Twitter content.

Please share with me what you think about the use of Twitter as a platform for political messaging. What are the pros and cons of the use of Twitter for political messaging? How would you evaluate the success of Twitter as a political messaging platform? When and how did Twitter become the dominant social media platform for political messaging? In what ways did Twitter influence the
Prompt: The president of the United States is
Generated text:  married to a woman who is the mother of five children. How many children does the president have? To determine how many children the president has, we need to consider the information provided in the problem:

1. The president is married to a woman.
2. The woma

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is the largest city in France and the second-largest city in the European Union. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. It is also a major center for art, culture, and fashion, and is home to many world-renowned museums, theaters, and restaurants. Paris is a cultural and economic hub of France and a major tourist destination. It is home to many famous landmarks and attractions, including the Louvre, the Eiffel Tower, and the Champs

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks and biases in AI, there will be a greater emphasis on ethical AI. This will likely lead to more rigorous testing and validation of AI systems, as well as greater transparency and accountability in their development and deployment.

2. Integration of AI with other technologies: AI is likely to become more integrated with other technologies, such as blockchain, quantum computing, and biotechnology. This will allow for more complex and innovative applications of



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  __________ and I am a/an _____________. I have __________ years of experience in ___________ and I currently work for ___________. I'm an expert in ____________. My ___________ is ___________. In my free time, I enjoy ____________. My professional goals include ___________. 

Please provide me with a list of questions I should ask the person to better understand their background. Please also list the various reasons why the person is good at their job. Include a table with a list of their achievements and a table of their skills. Lastly, include a chart with a table of their most valuable skills and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light. 

This statement encapsulates the key information provided in the original prompt, including the fact that it is the capital 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 first

 name

]

 and

 I

'm

 an

 [

insert

 occupation

 or

 profession

]

 who

 started

 learning

 to

 code

 when

 I

 was

 [

insert

 number

 or

 date

]

 of

 years

 old

.

 I

've

 always

 been

 fascinated

 by

 technology

 and

 always

 wanted

 to

 create

 my

 own

,

 but

 I

've

 always

 struggled

 with

 learning

 the

 basics

 of

 programming

.

 After

 reading

 some

 online

 tutorials

 and

 experimenting

 with

 different

 languages

,

 I

 realized

 that

 I

 can

 take

 it

 on

,

 and

 I

'm

 here

 to

 share

 my

 knowledge

 with

 you

.

 My

 name

 is

 [

insert

 first

 name

]

 and

 I

'm

 excited

 to

 help

 you

 learn

 how

 to

 code

!

 What

's

 your

 name

?

 And

 what

's

 your

 occupation

?

 And

 what

's

 your

 favorite

 hobby



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

La

 Ville

 de

 Paris

."

 It

 is

 the

 largest

 city

 in

 Europe

,

 located

 on

 the

 Se

ine

 River

,

 and

 is

 home

 to

 the

 French

 Parliament

 and

 the

 iconic

 E

iff

el

 Tower

.

 The

 city

 is

 known

 for

 its

 rich

 history

,

 diverse

 culture

,

 and

 iconic

 landmarks

 such

 as

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Bo

is

 de

 Bou

log

ne

 Park

.

 Paris

 is

 also

 home

 to

 many

 art

 galleries

,

 museums

,

 and

 restaurants

,

 and

 is

 a

 popular

 tourist

 destination

 for

 people

 from

 all

 over

 the

 world

.

 The

 city

 is

 known

 for

 its

 romantic

 atmosphere

,

 lively

 nightlife

,

 and

 cultural

 events

,

 and

 continues

 to

 be

 a

 bustling



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 constantly

 evolving

,

 and

 there

 are

 many

 possibilities

 for

 what

 to

 expect

 in

 the

 years

 ahead

.

 Here

 are

 some

 of

 the

 most

 likely

 trends

 that

 could

 shape

 the

 field

:



1

.

 Increased

 use

 of

 AI

 for

 healthcare

:

 AI

 is

 already

 being

 used

 in

 a

 wide

 range

 of

 healthcare

 applications

,

 from

 monitoring

 patient

 health

 and

 adjusting

 treatment

 plans

 to

 analyzing

 medical

 images

 and

 predicting

 disease

 outbreaks

.

 As

 AI

 continues

 to

 improve

,

 we

 can

 expect

 to

 see

 even

 more

 widespread

 use

 in

 healthcare

.



2

.

 AI

 in

 transportation

:

 The

 demand

 for

 electric

 and

 autonomous

 vehicles

 is

 growing

 rapidly

,

 and

 AI

 is

 playing

 a

 significant

 role

 in

 the

 development

 of

 these

 vehicles

.

 We

 can

 expect

 to

 see

 more

 AI

-powered

 traffic




In [6]:
llm.shutdown()