# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.65it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.64it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Dara Melvin, I am 25 years old and I am from Canada. I was a 2 year resident in the U.S. for the summer of 2008. I attended the University of Toronto, and graduated in the fall of 2008. I currently work as a data engineer at a Canadian startup, and I write about data science and machine learning for a variety of platforms. I believe in the importance of the data scientist, and I often post interviews, blogs, and articles about the field.
I attended the University of Toronto, and graduated in the fall of 2008
Prompt: The president of the United States is
Generated text:  trying to decide how many armed guards he should have in a new city. He estimates that there are 200,000 people living in the city. He calculates that if he has 3 guards per 1000 people, how many guards should he have? To determine how many armed guards the president should have in the new city, we need to follow these steps:

1. Calculate the total number of people in the city

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Skill] who enjoys [Favorite Activity] and [Favorite Food]. I'm a [Favorite Book] lover and I love [Favorite Movie]. I'm also a [Favorite Music] lover and I love [Favorite Sport]. I'm a [Favorite Hobby] person and I enjoy [Favorite Hobby]. I'm a [Favorite Hobby] person and I enjoy [Favorite Hobby]. I'm a [Favorite Hobby] person and I enjoy [Favorite Hobby]. I'm a [Favorite Hobby] person and I enjoy [Favorite Hobby]. I'm a [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the second-largest city in the European Union. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. It is also home to many world-renowned museums, including the Louvre and the Musée d'Orsay. Paris is a cultural and economic hub, with a rich history dating back to the Roman Empire and the Renaissance. It is a major transportation hub, with the Eiffel Tower serving as a symbol of the city's importance. Paris is also known for its fashion industry

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are likely to shape the way we live, work, and interact with technology. Here are some potential trends that are likely to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, particularly in areas such as diagnosis, treatment planning, and patient monitoring.

2. Increased use of AI in finance: AI is already being used in finance to improve fraud detection and risk management. As AI technology continues



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [Type] [Age] year old male who has recently started studying the art of [Element]. I am currently living and learning in [City or Country]. I have always had a passion for [Topic], but have recently started learning more about [Element]. I enjoy exploring new places, trying new foods, and making new friends. I am excited to join the [Team] and contribute to [Project]! Would you like to know more about me? [Optional, but might include some trivia about yourself!] Good luck with your studies. [Name] [Type] [Age] (Optional

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is also the largest city in the country. It is known for its stunning architecture, rich history, and annual festival celebrations, including the Eiffel Tower. Paris is also home to many famous museums, including t

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 [

Age

]

 years

 old

.

 I

 have

 always

 been

 passionate

 about

 [

Your

 Hobby

 or

 Interest

].

 I

 started

 learning

 about

 it

 when

 I

 was

 [

Age

],

 but

 I

 never

 gave

 up

 on

 it

.

 I

 am

 always

 looking

 for

 new

 and

 exciting

 ways

 to

 learn

,

 and

 I

 am

 always

 ready

 to

 share

 my

 knowledge

 with

 others

.

 I

 have

 a

 natural

 talent

 for

 problem

-solving

,

 and

 I

 am

 always

 eager

 to

 learn

 and

 grow

.

 I

 have

 a

 great

 sense

 of

 humor

,

 and

 I

 enjoy

 making

 people

 laugh

.

 I

 am

 always

 looking

 for

 ways

 to

 make

 my

 life

 more

 fun

 and

 exciting

,

 and

 I

 am

 always

 willing

 to

 embrace

 new

 experiences

.

 I

 hope



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 known

 as

 the

 "

City

 of

 Light

"

 and

 is

 a

 cosm

opolitan

 city

 with

 many

 historical

 and

 architectural

 landmarks

.

 It

 is

 located

 in

 the

 center

 of

 the

 country

 and

 is

 the

 largest

 city

 in

 Europe

 by

 population

.

 Paris

 is

 known

 for

 its

 vibrant

 cultural

 life

,

 rich

 history

,

 and

 beautiful

 parks

 and

 gardens

.

 It

 is

 also

 a

 major

 hub

 for

 business

,

 commerce

,

 and

 government

.

 The

 city

's

 skyline

 is

 dotted

 with

 skys

crap

ers

,

 and

 its

 metropolitan

 area

 is

 one

 of

 the

 largest

 in

 Europe

.

 Paris

 has

 a

 rich

 history

,

 including

 many

 significant

 contributions

 to

 art

,

 literature

,

 and

 science

,

 and

 it

 continues

 to

 be

 a

 cultural

 and

 intellectual

 center

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 several

 factors

,

 including

 advances

 in

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 In

 the

 next

 few

 years

,

 we

 can

 expect

 the

 following

 trends

 in

 AI

:



1

.

 AI

 will

 continue

 to

 become

 more

 intelligent

:

 As

 we

 learn

 more

 about

 how

 AI

 works

,

 we

 will

 continue

 to

 discover

 ways

 to

 make

 it

 more

 intelligent

.

 This

 could

 involve

 developing

 more

 complex

 algorithms

,

 enabling

 AI

 to

 learn

 from

 experience

,

 and

 improving

 its

 ability

 to

 understand

 and

 interpret

 natural

 language

.



2

.

 AI

 will

 become

 more

 ubiquitous

:

 As

 AI

 becomes

 more

 intelligent

 and

 ubiquitous

,

 we

 can

 expect

 it

 to

 play

 an

 increasingly

 important

 role

 in

 our

 daily

 lives

.

 This

 could

 involve




In [6]:
llm.shutdown()