# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.28it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.15it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.11it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.36it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rachel, and I'm a freelance writer living in the beautiful city of Asheville, North Carolina. I'm thrilled to be joining the Pen and Paper team as a contributor.
I've been writing for years, starting with short stories and poetry in high school. I eventually honed my skills by writing for my college newspaper, and later landed a job as a writer at a local magazine. From there, I transitioned to freelance writing, which has allowed me to work with a variety of clients and explore a range of topics.
As a writer, I'm passionate about storytelling and connecting with people through words. I believe that writing is not just
Prompt: The president of the United States is
Generated text:  not the only public official who has recently used the word "fail" to describe a policy or an action. The public is also using the word to describe a variety of things, from a failed relationship to a failed business.
In fact, the word "fail" has become so ubiquitous

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my spare time. I'm a bit of a introvert, but I love meeting new people and hearing their stories. I'm currently working on a novel and trying to learn more about the world around me. That's me in a nutshell.
This is a good example of a neutral self-introduction because it doesn't reveal too much about Kaida's personality, background, or motivations. It simply provides a brief overview of who she is and what she does

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about the population of France’s capital city. The population of Paris is approximately 2.1 million people.
Provide a concise factual statement about the location of France’s capital city. Paris is located in the northern part of France, near the Seine River.
Provide a concise factual statement about the economy of France’s capital city. Paris is a major economic center, with a diverse economy that includes finance, fashion, and tourism.
Provide a concise factual statement about the culture of France’s capital city. Paris is known for its rich cultural heritage, including art, literature, and cuisine.
Provide a concise factual

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even larger role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in education: AI has the potential to revolutionize the way we learn, with the ability to personalize education and provide real-time feedback to students. In the future, AI is likely to become a ubiquitous part of the



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  River. I work as a freelance writer and spend most of my time in my small cottage near the forest. I'm not particularly fond of company, but I enjoy the quiet and the freedom that comes with being on my own.
Would you like me to make any changes? Perhaps a more detailed description of the character, their background, and their motivations? I can add some details about River's personality, interests, or skills. Just let me know what you need. To give you a better idea, here's a more detailed character sketch:
River is a 30-year-old woman who grew up in a big city. She's always

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The previous response was correct. The capital of France is Paris. Is there anything else you would like to know about Paris? It has a rich history, famous landmarks such as the Eiff

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Aurora

 "

R

ory

"

 Black

wood

,

 and

 I

'm

 a

 

22

-year

-old

 student

 at

 the

 local

 university

.

 I

'm

 studying

 environmental

 science

 with

 a

 focus

 on

 ecology

 and

 conservation

.

 In

 my

 free

 time

,

 I

 enjoy

 hiking

,

 reading

,

 and

 playing

 guitar

.

 That

's

 me

 in

 a

 nutshell

.

 How

 would

 you

 describe

 Aurora

 as

 a

 person

?

 What

 are

 her

 personality

 traits

,

 interests

,

 and

 values

?


A

ur

ora

,

 or

 Rory

 as

 she

's

 known

 to

 friends

,

 is

 a

 bright

 and

 curious

 individual

 with

 a

 passion

 for

 learning

 and

 exploring

 the

 natural

 world

.

 She

's

 a

 bit

 of

 a

 book

worm

,

 often

 getting

 lost

 in

 the

 pages

 of

 her

 favorite

 novels

 or

 scientific



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 located

 in

 the

 north

-central

 part

 of

 the

 country

.

 The

 Se

ine

 River

 runs

 through

 the

 city

 and

 is

 divided

 into

 two

 parts

 by

 the

 river

:

 the

 Left

 Bank

 and

 the

 Right

 Bank

.

 Paris

 is

 known

 for

 its

 iconic

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 also

 famous

 for

 its

 fashion

 industry

,

 cuisine

,

 and

 romantic

 atmosphere

.


Be

 the

 first

 to

 review

 “

France

”

 Cancel

 reply




S

iam

 -

 Thailand

 $

0

.

00




Af

ghan

istan

 $

0

.

00




Y

emen

 $

0

.

00




Per

u

 $

0

.

00




I

reland

 $

0

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 going

 to

 be

 bigger

 than

 we

 can

 imagine

.

 Here

 are

 some

 possible

 future

 trends

 that

 we

 can

 look

 forward

 to

:

 

1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 will

 be

 used

 to

 diagnose

 diseases

,

 predict

 patient

 outcomes

,

 and

 personalize

 medicine

.

 

2

.

 AI

-powered

 robots

 that

 can

 perform

 surgery

:

 Robots

 will

 be

 able

 to

 perform

 complex

 surgeries

 with

 precision

 and

 accuracy

,

 reducing

 the

 risk

 of

 human

 error

.

 

3

.

 AI

-ass

isted

 decision

-making

 in

 business

:

 AI

 will

 be

 used

 to

 analyze

 data

,

 identify

 patterns

,

 and

 make

 recommendations

 to

 businesses

,

 helping

 them

 make

 better

 decisions

.

 

4

.

 AI

-powered

 transportation

 systems

:

 AI

 will

 be

 used

 to

 manage

 traffic




In [6]:
llm.shutdown()