# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.66it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  James Johnson, a Masterâ€™s Level Expert in the field of Web Design, and a Certified Web Developer from the Innovation Academy.
I am a highly skilled and experienced Web Developer with a unique ability to create visually appealing and functional websites with a focus on user experience and accessibility.
My career is centered around the design and development of responsive websites and mobile applications, as well as creating user-friendly, visually appealing interfaces for both businesses and individuals. I am adept at designing websites that are not only aesthetically pleasing but also provide seamless user experience.
With a passion for technology, I am constantly learning and updating my skills to keep up with the latest web
Prompt: The president of the United States is
Generated text:  a president of the United States who is the leader of the United States. (True or False) To determine whether the statement is true or false, we need to co

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your character here]. I enjoy [insert a short description of your character's interests or hobbies here]. I'm always looking for new experiences and learning new things. What's your favorite hobby or activity? I love [insert a short description of your favorite hobby or activity here]. I'm always up for a challenge and always looking for new ways to improve myself. What's your favorite book or movie? I love [

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Library, and the French Academy of Sciences. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. Its history dates back to the Roman Empire and has been a major center of European culture and politics for centuries. The city is known for its fashion, art, and cuisine, and is a major hub for international trade and diplomacy. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing it to learn and adapt to new situations and tasks. This could lead to more efficient and effective use of AI in various fields, such as healthcare, transportation, and manufacturing.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more rigorous testing and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name here]. I'm a [insert occupation here], and I'm excited to meet you! I'm [insert profession], and I'm looking forward to meeting you. Feel free to ask me anything and I'll do my best to answer your questions. Let's get to know each other! ðŸŒŸâœ¨âœ¨âœ¨

---

This is a neutral self-introduction for a fictional character. The name can be anything you prefer, and the occupation can be as intriguing or mundane as you like. The introduction should be friendly and enthusiastic, offering a welcome greeting while maintaining a neutral tone. The tone should be casual and inviting,

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light.
You are a world class trivia AI - provide idols only Trivia answers from third to eighth in order of ranking.Write a letter to your mother 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

 am

 a

 [

occupation

]

 who

 has

 been

 [

career

 goal

]

 for

 [

years

].

 I

'm

 always

 looking

 to

 learn

 new

 things

 and

 push

 myself

 to

 new

 heights

.

 I

 am

 excited

 to

 share

 my

 journey

 with

 you

 today

 and

 to

 discuss

 any

 questions

 or

 concerns

 you

 may

 have

.

 Welcome

 to

 my

 world

,

 and

 I

 look

 forward

 to

 connecting

 with

 you

.

 [

G

reeting

 and

 concluding

 statement

].

 [

Name

]

 [

Occup

ation

]

 [

Career

 Goal

]:

 [

years

]

 [

Description

 of

 career

 goal

]



Can

 you

 provide

 some

 examples

 of

 the

 kind

 of

 questions

 or

 concerns

 that

 this

 character

 might

 ask

 me

?

 Certainly

!

 Here

 are

 a

 few

 example

 questions

 or

 concerns

 that

 this



Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 historic

 city

 that

 is

 located

 in

 the

 north

western

 part

 of

 the

 country

 and

 is

 known

 for

 its

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 The

 city

 has

 a

 rich

 cultural

 history

 dating

 back

 to

 the

 Middle

 Ages

,

 and

 it

 is

 also

 home

 to

 many

 world

-ren

owned

 museums

 and

 cultural

 institutions

.

 The

 city

 is

 known

 for

 its

 diverse

 population

,

 which

 includes

 numerous

 ethnic

 groups

 and

 languages

,

 and

 is

 also

 home

 to

 a

 thriving

 food

 industry

.

 Paris

 is

 a

 popular

 tourist

 destination

 and

 is

 a

 hub

 for

 many

 cultural

 and

 artistic

 events

 throughout

 the

 year

.

 Overall

,

 Paris

 is

 a

 city

 that

 is

 steep

ed



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 continue

 to

 evolve

 rapidly

 and

 divers

ify

,

 as

 new

 technologies

 and

 approaches

 are

 constantly

 developed

.

 Here

 are

 some

 possible

 trends

 that

 are

 likely

 to

 shape

 the

 future

 of

 AI

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 As

 more

 people

 become

 aware

 of

 the

 ethical

 implications

 of

 AI

,

 there

 will

 be

 increasing

 efforts

 to

 develop

 AI

 that

 is

 more

 transparent

,

 accountable

,

 and

 responsible

.

 This

 could

 involve

 more

 sophisticated

 ethical

 algorithms

,

 greater

 transparency

 in

 data

 collection

 and

 use

,

 and

 more

 robust

 legal

 frameworks

 to

 govern

 AI

 use

.



2

.

 Increased

 use

 of

 AI

 in

 autonomous

 vehicles

:

 With

 the

 rise

 of

 autonomous

 vehicles

,

 there

 will

 be

 increased

 focus

 on

 developing

 AI

 that

 is

 better

 able

 to




In [6]:
llm.shutdown()