# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.39it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.33it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.32it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.84it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.62it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Yvonne (pronounced Ee-Von) and I am a 45-year-old woman who has been living in the beautiful city of Winnipeg, Manitoba, Canada for most of my life. I am a wife, mother of one amazing child, and a proud animal lover. My family and I share our home with a playful dog named Duke and a lovable cat named Whiskers.
In my free time, you can find me trying out new recipes in the kitchen, practicing yoga, or reading a good book. I am passionate about living a healthy and balanced lifestyle, and I enjoy sharing my experiences and tips with others.

Prompt: The president of the United States is
Generated text:  a public servant, elected to serve the people of the country. The president's primary responsibility is to enforce the laws and the Constitution of the United States. The president is also responsible for executing the foreign policy of the United States, appointing federal judges, and serving as the commander in chief of the armed forces.
The pr

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new coffee shops. I'm a bit of a introvert and prefer to spend my free time alone, but I'm always up for a good conversation with someone who shares my interests. I'm currently working on a novel and trying to get my writing career off the ground. That's me in a nutshell.
This self-introduction is neutral because it doesn't reveal too much about Kaida's personality, background, or motivations. It simply provides

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city. The capital of France is Paris.
The statement is already concise and factual. It simply states the capital of France, which is Paris. There is no need for further elaboration or analysis. The statement is a straightforward assertion of a well-known fact. Therefore, the response is complete and does not require any additional information. The final answer is: The capital of France is Paris. This response is complete and does not require any additional information. The final answer is: The capital of France is Paris. This response is complete and does not require any additional information. The final answer

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems will be able to analyze medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for transparency and explainability in AI decision-making. Explainable AI (XAI) will become increasingly important to ensure that AI systems are fair



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Marcus Thompson. I work as a software developer in the city. I enjoy reading science fiction novels and playing guitar in my free time. I live in a small apartment in the suburbs. I'm a bit of a introvert, but I'm trying to be more social. I'm not particularly interested in sports or politics, but I'm open to learning more about different topics. I'm a bit of a nerd, and I like to spend time alone with my thoughts. I'm also a bit of a perfectionist, which can be a challenge in my work and personal life. That's me in a nutshell. How to Improve It

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  located on the Seine River and is known for its iconic landmarks like the Eiffel Tower and Notre Dame Cathedral. The city is one of the most visited in the world and is famous for its art, fashion, and cuisine.
France, o

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

 Wilson

,

 I

’m

 

27

 years

 old

.

 I

 work

 as

 a

 nurse

 at

 a

 local

 hospital

.

 I

 enjoy

 hiking

 and

 trying

 new

 restaurants

 in

 my

 free

 time

.

 I

 like

 to

 think

 I

’m

 a

 friendly

 and

 approach

able

 person

,

 and

 I

 try

 to

 be

 kind

 to

 everyone

 I

 meet

.


Emily

 is

 a

 kind

 and

 compassionate

 nurse

 who

 always

 puts

 the

 needs

 of

 her

 patients

 first

.

 She

's

 a

 bit

 of

 a

 perfection

ist

,

 which

 can

 sometimes

 make

 her

 feel

 overwhelmed

,

 but

 she

's

 working

 on

 finding

 a

 better

 balance

 between

 her

 work

 and

 personal

 life

.

 Emily

 loves

 spending

 time

 outdoors

 and

 trying

 new

 foods

,

 and

 she

's

 always

 up

 for

 an

 adventure

.

 She

's

 a





Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 located

 in

 the

 northern

 part

 of

 the

 country

.


Paris

 is

 famous

 for

 its

 beautiful

 architecture

,

 art

 museums

,

 fashion

 industry

,

 and

 cuisine

.

 The

 city

 has

 a

 rich

 history

 dating

 back

 to

 the

 Roman

 era

 and

 has

 been

 the

 capital

 of

 France

 since

 the

 

12

th

 century

.

 Paris

 is

 home

 to

 some

 of

 the

 world

’s

 most

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.


What

 are

 the

 top

 

3

 reasons

 why

 tourists

 visit

 Paris

?


The

 top

 

3

 reasons

 why

 tourists

 visit

 Paris

 are

:


1

.

 Historical

 Land

marks

:

 Paris

 is

 home

 to

 some

 of

 the

 world

’s

 most

 famous

 historical



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 now

 upon

 us

,

 and

 it

 is

 changing

 the

 world

.

 From

 self

-driving

 cars

 to

 medical

 diagnostics

,

 artificial

 intelligence

 is

 increasingly

 being

 used

 in

 various

 industries

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:

 

1

.

 Increased

 use

 of

 AI

 in

 Healthcare

:

 AI

 is

 being

 increasingly

 used

 in

 healthcare

 to

 diagnose

 diseases

,

 develop

 personalized

 treatment

 plans

,

 and

 predict

 patient

 outcomes

.

 

2

.

 Rise

 of

 Explain

able

 AI

:

 As

 AI

 becomes

 more

 widespread

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make

 decisions

.

 Explain

able

 AI

 (

X

AI

)

 is a

 new

 field

 that

 focuses

 on

 making

 AI

 systems

 transparent and

 accountable

.

 

3

.

 Development

 of

 Autonomous

 Systems

:

 Autonomous




In [6]:
llm.shutdown()