# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:07,  2.56s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:05,  2.76s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:08<00:02,  2.78s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.06s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.31s/it]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kurt and I have been a loyal fan of this site for many years. I just wanted to say that I have recently taken up the hobby of photography and I am really enjoying it. I have been experimenting with different types of photography such as landscape, portrait and still life, but I am particularly interested in learning more about astrophotography.
I would love to know if there are any specific tips or techniques that you would recommend for beginners in astrophotography, especially when it comes to capturing images of the moon, stars and planets.
I am using a DSLR camera and a tripod, and I am eager to get started.
Prompt: The president of the United States is
Generated text:  often seen as the embodiment of American values, but the perception of him is highly subjective. His policies and actions can either reinforce or challenge the principles of equality, justice, and freedom that are deeply ingrained in the American psyche.
President Donald Tr

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and trying to learn more about the Japanese culture. That's me in a nutshell. What do you think? Is it a good introduction?
This introduction is good because it's brief and to the point. It gives a sense of who Kaida is and what she's interested in, without revealing too much about her personality or background. The fact that she's a freelance writer and working on a novel suggests that she's creative and ambitious, but it's not too boast

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. This statement is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or context. It is a simple and straightforward statement that can be used as a starting point for further discussion or exploration of the topic. The statement is also accurate and reliable, as it is a widely accepted fact about France’s capital city. Overall, this statement is a good example of a concise factual statement. The statement is also neutral and does not express any opinion or bias. It simply states a fact, without any emotional or persuasive language. This makes

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, helping to improve patient outcomes and reduce healthcare costs.
2. Widespread adoption of AI in industries: AI is already being used in various industries, including finance, transportation,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Sage Wynter. I work as a librarian at the local library in Willow Creek. I enjoy reading, learning, and helping others find the information they need.
Introduction to the Character
Sage Wynter is a quiet and reserved individual who values knowledge and literature. She takes pride in her work as a librarian, where she can share her passion for reading and learning with the community. Sage is approachable and willing to lend a helping hand, making her a valuable resource for those seeking information.
Physical Description
Sage is a slender woman in her mid-30s with long, dark brown hair and expressive green eyes. She

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located in the northern part of the country, in the Île-de-France region.
Paris is a major political, economic, and cultural center i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Hiro

.

 I

'm

 

25

 years

 old

.

 I

 work

 as

 a

 park

 ranger

 at

 the

 local

 national

 park

.

 I

 enjoy

 hiking

 and

 photography

.

 I

've

 been

 living

 in

 the

 area

 for

 about

 

5

 years

.

 I

'm

 just

 trying

 to

 live

 a

 simple

 life

.

 That

's

 me

 in

 a

 nutshell

.

 


Word

 Count

:

 

45




Under

 

50

 words

,

 this

 self

-int

roduction

:


My

 name

 is

 Hiro

,

 and

 I

'm

 a

 

25

-year

-old

 park

 ranger

.

 I

've

 been

 living

 in

 this

 area

 for

 

5

 years

 and

 enjoy

 hiking

 and

 photography

.

 I

'm

 content

 with

 my

 simple

 life

.

 I

 work

 at

 the

 local

 national

 park

 and

 try

 to

 stay

 active

 outdoors



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 ##

 Step

 

1

:

 Identify

 the

 capital

 of

 France




The

 capital

 of

 France

 is

 a

 well

-known

 fact

 that

 is

 widely

 recognized

.



##

 Step

 

2

:

 Verify

 the

 information




Ver

ifying

 the

 information

,

 we

 confirm

 that

 Paris

 is

 indeed

 the

 capital

 of

 France

.



##

 Step

 

3

:

 Provide

 a

 concise

 statement




We

 can

 now

 provide

 a

 concise

 statement

 about

 the

 capital

 of

 France

.



The

 final

 answer

 is

:

 Paris

.

 ##

 Step

 

1

:

 Identify

 the

 capital

 of

 France




The

 capital

 of

 France

 is

 a

 well

-known

 fact

 that

 is

 widely

 recognized

.



##

 Step

 

2

:

 Verify

 the

 information




Ver

ifying

 the

 information

,

 we

 confirm

 that

 Paris

 is

 indeed

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 various

 factors

 including

 technological

 advancements

,

 societal

 needs

,

 and

 ethical

 considerations

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 focus

 on

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 prevalent

 in

 decision

-making

 processes

,

 there

 will

 be

 a

 growing

 need

 for

 AI

 systems

 to

 be

 transparent

 and

 explain

able

.

 This

 will

 involve

 developing

 AI

 models

 that

 provide

 clear

 and

 understandable

 explanations

 for

 their

 decisions

 and

 actions

.


2

.

 Integration

 with

 the

 Internet

 of

 Things

 (

Io

T

):

 AI

 will

 continue

 to

 play

 a

 key

 role

 in

 the

 development

 of

 the

 IoT

,

 enabling

 devices

 to

 interact

 and

 communicate

 with

 each

 other

 in

 real

-time

.

 This

 will

 lead

 to

 a

 more

 seamless




In [6]:
llm.shutdown()