# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.36it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.35it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tanya. I am a student of year 9 at a school in the United Kingdom. I have a pet cat named Fido. I love my cat very much. I have a new pet dog named Whiskers. He is a very friendly dog. He has a long fur. He is a big and strong dog. I like to ride my bicycle, go to the cinema and go to the park. I think it's a great fun to play with my cat and my dog. I hope to have a new pet when I grow up. What would be the best title for this passage? A) Pet Cats B)
Prompt: The president of the United States is
Generated text:  trying to decide whether to use a drone to capture a photograph of the president of South Korea. The president is 5 feet tall, and the drone is 20 feet tall. If the drone is 10 feet above the president's head, how much larger is the photograph taken by the drone compared to if it was taken from the president's perspective?
To determine how much larger the photograph taken by the drone is compared to if it was taken from the president'

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and a diverse population of over 10 million people. The city is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its cuisine, fashion, and art scene, making it a popular tourist destination. The city is a cultural and economic hub of France and plays a significant role in the country's economy. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also home to many

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, reduce costs, and increase efficiency. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, with the potential to revolutionize the way we treat and diagnose diseases.

2. AI in finance: AI is already being used in finance to improve risk management, fraud detection, and trading algorithms. As AI technology continues to improve, we



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I specialize in [What you do in the industry]. I am a [Your specialty] expert with years of experience in the industry, and I aim to provide you with the best possible solutions to your problems. How can I help you today?

Hello, my name is [Your Name], and I specialize in [What you do in the industry]. I am a [Your specialty] expert with years of experience in the industry, and I aim to provide you with the best possible solutions to your problems. How can I help you today?

---

[Your Name]

---

Your friendly assistant. 

---

**Note:**

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as "La Pucelle".

The answer is: Paris, also known as "La Pucelle," is the capital city of France. It was established in 914 by Charles Martel as the chief city of the Franks, after conquerin

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 an

 [

age

]

 year

 old

,

 [

gender

]

 woman

.

 I

'm

 currently

 a

 [

job

 title

]

 who

 specialize

 in

 [

specific

 skill

 or

 area

 of

 expertise

].

 I

 have

 a

 passion

 for

 [

mention

 an

 area

 of

 interest

 or

 hobby

 that

 you

 enjoy

].

 I

'm

 constantly

 learning

 and

 growing

 as

 a

 person

 and

 [

fill

 in

 the

 blank

 with

 something

 that

 describes

 your

 personality

 or

 character

 traits

].

 I

 believe

 that

 I

 have

 a

 unique

 talent

 for

 [

mention

 a

 positive

 quality

 or

 strength

 of

 yours

].

 I

'm

 [

insert

 a

 personal

 trait

 or

 attribute

 that

 makes

 you

 stand

 out

].

 And

 I

'm

 always

 looking

 for

 opportunities

 to

 share

 my

 knowledge

 and

 inspire

 others

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 on

 the

 Se

ine

 River

 and

 known

 for

 its

 historical

 and

 artistic

 landmarks

,

 including

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 It

 has

 a

 population

 of

 about

 

2

.

5

 million

 people

.

 



(Note

:

 I

've

 included

 an

 example

 sentence

 to

 give

 context

:

 "

Paris

 is

 the

 most

 populous

 city

 in

 France

,

 with

 an

 estimated

 population

 of

 over

 

2

.

5

 million

 people

 as

 of

 

2

0

2

1

,

 according

 to

 the

 

2

0

2

0

 census

.

 "

)

 



Please

 provide

 the

 correct

 statement

 if

 you

 believe

 it

 to

 be

 accurate

.

 If

 the

 statement

 is

 not

 correct

,

 please

 explain

 why

 it

 is

 incorrect

.

 To

 answer

 this



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 rapidly

 advancing

 and

 evolving

.

 Here

 are

 some

 possible

 trends

 that

 could

 shape

 the

 future

 of

 AI

:



1

.

 Increased

 AI

 automation

:

 With

 the

 increase

 in

 the

 number

 of

 AI

-driven

 tasks

,

 there

 will

 be

 a

 significant

 rise

 in

 the

 use

 of

 AI

 automation

.

 This

 will

 lead

 to

 the

 automation

 of

 mundane

 tasks

 and

 reduce

 the

 need

 for

 human

 intervention

.



2

.

 AI

 ethics

 and

 privacy

 concerns

:

 As

 AI

 systems

 become

 more

 sophisticated

,

 there

 will

 be

 increasing

 scrutiny

 of

 their

 ethical

 implications

 and

 potential

 privacy

 concerns

.

 There

 will

 be

 a

 need

 for

 more

 stringent

 regulations

 and

 guidelines

 to

 ensure

 that

 AI

 systems

 are

 used

 responsibly

.



3

.

 AI

-powered

 healthcare

:

 AI

 is

 already

 revolution

izing




In [6]:
llm.shutdown()