# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.63it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  John. I'm a friendly student who studies hard. I like to help people and want to make the world a better place. I can also speak a few languages and I am a good cook. I think cooking is a very important and enjoyable activity, and I love trying new recipes. I am very outgoing and like to make people happy. I am willing to learn new things and improve my skills. What do you think John is like? Let's see if we can guess what kind of person he is.
Answer this question: Is John a male or female? To answer this question, we need to consider the information we have about
Prompt: The president of the United States is
Generated text:  a very important person in the government of the country. Everyone knows that the president of the United States, Donald Trump, is an anti-establishment and progressive political leader. Although he has not been in power for many years, he has been trying to become more and more popular among people around the world.

Do

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and vibrant cultural scene. It is also the birthplace of the French Revolution and the home of the French language. Paris is a bustling metropolis with a rich history and a diverse population. The city is known for its fashion, art, and cuisine, and is a major tourist destination. It is also home to many famous landmarks and attractions, including the Louvre Museum and the Champs-Élysées. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. The city is known for its love of music, art,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more natural and intuitive interactions between humans and machines.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more robust and transparent AI systems that are designed to minimize harm and maximize benefits.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a [Genre] novelist who specializes in [Specific genre] fiction. I'm excited to bring my unique voice to your story. What is your name? What's your genre of fiction? What's your writing style? How long have you been in the industry? What's your favorite book to read? What's your favorite part of writing? What's your favorite part of being a novelist? What's your favorite part of being a writer? What's your favorite part of being a novelist? What's your favorite part of being a novelist? What's your favorite part of being a novelist? What's your favorite

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

This statement is factually correct and provides a clear understanding of the capital's name, the country it belongs to, and its official title. It avoids any potential confusion or ambiguity.

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

 have

 been

 at

 this

 for

 a

 while

 now

 and

 I

've

 learned

 a

 lot

.

 I

 enjoy

 helping

 people

 and

 I

'm

 always

 looking

 for

 new

 opportunities

 to

 learn

 and

 grow

.

 I

'm

 an

 active

 member

 of

 the

 community

 and

 I

'm

 always

 willing

 to

 lend

 a

 helping

 hand

 to

 others

.



This

 introduction

 should

 include

 at

 least

 one

 positive

 attribute

 that

 sets

 you

 apart

 from

 others

 in

 your

 field

 and

 one

 action

 or

 achievement

 that

 you

 are

 proud

 of

.

 In

 your

 case

,

 the

 positive

 attribute

 could

 be

 [

positive

 attribute

],

 and

 the

 action

 or

 achievement

 could

 be

 [

example

 of

 action

 or

 achievement

].

 This

 way

,

 the

 reader

 will

 get

 a

 clear

 picture

 of

 who

 you

 are



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 largest

 city

 in

 the

 country

 and

 home

 to

 the

 world

's

 tallest

 building

,

 the

 E

iff

el

 Tower

.



That

 statement

 is

 true

.

 Paris

 is

 the

 capital

 of

 France

,

 and

 it

 is

 home

 to

 the

 E

iff

el

 Tower

,

 the

 tallest

 man

-made

 structure

 in

 the

 world

.

 Other

 famous

 landmarks

 in

 Paris

 include

 the

 Lou

vre

 Museum

,

 the

 Arc

 de

 Tri

omp

he

,

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 a

 popular

 tourist

 destination

,

 known

 for

 its

 vibrant

 culture

,

 beautiful

 architecture

,

 and

 French

 cuisine

.

 The

 city

 has

 a

 rich

 history

,

 with

 an

 ancient

 Roman

 site

,

 medieval

 cathedral

,

 and

 modern

 fashion

 district

.

 The

 city

 is

 also

 home

 to

 the

 European



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

,

 and

 there

 are

 several

 trends

 that

 are

 likely

 to

 shape

 the

 landscape

 of

 AI

 in

 the

 coming

 years

.

 Some

 of

 the

 possible

 future

 trends

 in

 AI

 include

:



1

.

 Autonomous

 vehicles

:

 Self

-driving

 cars

 will

 become

 more

 common

 in

 the

 future

,

 and

 AI

 will

 play

 a

 crucial

 role

 in

 making

 them

 safer

 and

 more

 reliable

.



2

.

 Chat

bots

 and

 AI

 assistants

:

 AI

 will

 be

 able

 to

 interact

 with

 humans

 in

 a

 more

 natural

 and

 convers

ational

 way

,

 with

 chat

bots

 becoming

 increasingly

 sophisticated

 and

 user

-friendly

.



3

.

 Medical

 diagnosis

 and

 treatment

:

 AI

 will

 play

 a

 critical

 role

 in

 improving

 the

 accuracy

 and

 efficiency

 of

 medical

 diagnosis

 and

 treatment

.

 This

 includes

 the

 ability

 to

 analyze




In [6]:
llm.shutdown()