# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.04it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.06s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.06s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.14it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Karen. I'm a 44-year-old mom of two teenagers. I live in a cozy little house in a quiet neighborhood. My husband and I have been married for 20 years, and we're still going strong. We love to travel, try new restaurants, and spend time outdoors.

I work as a part-time librarian at our local library, which I adore. I get to help people find books, teach children about reading, and be surrounded by books all day. It's a dream job, really.

In my free time, I enjoy reading (of course!), gardening, and cooking. I'm a bit of a food
Prompt: The president of the United States is
Generated text:  not a king, but the head of state and the head of government of the country. The president is elected by the people through the Electoral College and serves a four-year term. The president is responsible for the execution of the laws, the appointment of federal judges and other high-ranking officials, and the command of the armed forces.
The president is also

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy writing about technology and social issues, and I'm currently working on a book about the intersection of artificial intelligence and human relationships. When I'm not writing, you can find me exploring the city's hidden cafes and trying out new foods. I'm a bit of a introvert, but I'm always up for a good conversation about the latest trends in tech and culture. What do you think? Is this a good self-introduction for a fictional character? Why or why not?
This self-introduction is a good start, but it could be improved

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about the population of France’s capital city. The population of Paris is approximately 2.1 million people.
Provide a concise factual statement about the location of France’s capital city. Paris is located in the northern part of France, in the Île-de-France region.
Provide a concise factual statement about the economy of France’s capital city. Paris is a major economic center, with a diverse economy that includes finance, fashion, and tourism.
Provide a concise factual statement about the culture of France’s capital city. Paris is known for its rich cultural heritage, including art, literature, and music, and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, including the development of AI-powered robots that can assist with surgeries and other medical procedures.
2. Widespread adoption of AI in industries: AI is already being used in various industries such as finance, transportation, and customer service. In the future, AI is likely to be adopted in many more industries, including manufacturing



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kalee McCool. I'm a 25-year-old artist, living in Portland, Oregon. I've been working as a freelance illustrator for about five years. Most of my work involves creating digital art for clients in the publishing and advertising industries. I'm also an avid hiker and enjoy spending time outdoors. When I'm not working, I like to explore the city, trying new restaurants and visiting local galleries. That's me in a nutshell.
Most of the writing in the self-introduction is in the third person. This can be effective for creating a sense of detachment and objectivity. However, there is no clear indication of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Next, provide a detailed description of the city, including notable landmarks and historical significance.
Paris, the capital of France, is a city steeped in

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

aida

 Yam

ato

 and

 I

'm

 a

 

25

-year

-old

 who

's

 trying

 to

 figure

 things

 out

.

 I

 like

 taking

 long

 walks

,

 reading

,

 and

 listening

 to

 music

.

 I

'm

 pretty

 laid

-back

 and

 value

 simplicity

.

 That

's

 me

.


Write

 a

 short

,

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

.

 My

 name

 is

 C

ael

um

,

 and

 I

'm

 a

 

22

-year

-old

 artist

.

 I

'm

 passionate

 about

 creating

 and

 expressing

 myself

 through

 various

 mediums

.

 I

 enjoy

 spending

 time

 outdoors

 and

 trying

 new

 things

.

 I

'm

 a

 bit

 of

 a

 dream

er

,

 always

 looking

 for

 inspiration

 and

 new

 ideas

.

 That

's

 me

.


Write

 a

 short

,

 neutral

 self

-int

roduction

 for



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Founded

 in

 the

 

3

rd

 century

 BC

 by

 the

 Celtic

 tribe

 known

 as

 the

 Paris

ii

,

 the

 city

 became

 a

 major

 power

 center

 in

 the

 Roman

 Empire

 and

 eventually

 the

 capital

 of

 the

 Kingdom

 of

 France

.

 Paris

 is

 located

 at

 the

 heart

 of

 the

 Î

le

-de

-F

rance

 region

 in

 northern

 France

.

 It

 is

 famous

 for

 its

 rich

 history

,

 stunning

 architecture

,

 world

-class

 museums

,

 and

 romantic

 atmosphere

,

 making

 it

 one

 of

 the

 world

’s

 most

 visited

 and

 beloved

 cities

.

 Many

 landmarks

 include

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 the

 Lou

vre

 Museum

,

 and

 the

 Ch

amps

-

É

lys

ées

.

 Paris

 is

 a

 center

 for

 international

 politics

,

 economy

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

 and

 transforming

 industries

,

 and

 this

 trend

 is

 expected

 to

 continue

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 use

 of

 edge

 AI

:

 As

 more

 devices

 become

 connected

 to

 the

 internet

,

 the

 need

 for

 edge

 AI

 will

 increase

.

 Edge

 AI

 refers

 to

 the

 ability

 of

 devices

 to

 perform

 AI

 tasks

 locally

,

 without

 relying

 on

 cloud

 computing

.


2

.

 Improved

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 prevalent

,

 there

 is

 a

 growing

 need

 for

 explain

ability

 and

 transparency

.

 This

 means

 that

 AI

 systems

 will

 need

 to

 provide

 clear

 explanations

 for

 their

 decisions

 and

 actions

.


3

.

 Increased

 use

 of

 natural

 language

 processing

:

 Natural

 language

 processing

 (




In [6]:
llm.shutdown()