# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.36it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.30it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.29it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.57it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Joseph Ahn, and I am a Ph.D. student in the Department of Computer Science and Engineering at the University of Minnesota. I am working under the guidance of Prof. Cyrus Shahabi. My research focus is on developing novel algorithms and techniques for solving large-scale problems in geospatial data management and analysis.
I am particularly interested in the areas of:
1. Spatial Join and Intersection: Efficiently computing spatial joins and intersections is crucial for various applications such as geographic information systems (GIS), location-based services, and spatial data analysis.
2. Spatial Indexing: Designing and optimizing spatial indexes is essential for efficient query processing in ge
Prompt: The president of the United States is
Generated text:  responsible for the security and defense of the country. The president has the authority to order the military to go to war and to negotiate treaties with other countries. The president also 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 22-year-old student at a local university, studying environmental science. I'm originally from a small town in the countryside, but I've been living in the city for a few years now. I enjoy hiking and reading in my free time. That's me in a nutshell.
This self-introduction is neutral because it doesn't reveal any personal opinions, emotions, or biases. It simply states the character's name, age, occupation, and interests in a straightforward and factual way. This type of introduction can be useful for a character who is trying to make a good impression or who is still getting to know

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. This is a factual statement that provides a concise piece of information about France’s capital city. It is a simple and direct statement that answers the question about the capital of France. The statement is also accurate and up-to-date, as Paris has been the capital of France since the 12th century. This type of statement is often used in educational or informational contexts to provide a quick and easy-to-understand fact about a particular topic. In this case, it provides a basic piece of information about France’s capital city. The statement is also neutral and does not express any opinion or bias, which

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, including the development of AI-powered robots that can assist with surgeries and other medical procedures.
2. Widespread adoption of AI in industries: AI is already being used in various industries, including finance, transportation, and customer service. In the future, AI is likely to become even more widespread, with many industries adopting AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Rennick Ellsworth. I'm a 35-year-old freelance writer living in the Pacific Northwest. I work on a variety of projects, from copywriting to literary fiction. When not writing, you can find me exploring the local wilderness or practicing my favorite instrument, the harmonica. I'm open to new experiences and collaborations.
This is a neutral self-introduction. It doesn't reveal too much about Rennick's personality, values, or motivations, so it's a good starting point for creating a character. You can add more details and traits as you continue to develop Rennick's character.
Here are some possible aspects

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This is a factual statement that contains only one sentence.
Provide a concise factual statement about the population of France’s capital city. The popul

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ethan Black

wood

,

 and

 I

'm

 a

 

30

-year

-old

 man

 living

 in

 the

 small

 town

 of

 Willow

 Creek

.

 I work

 as

 an

 accountant at

 a local

 firm

 and

 enjoy

 spending

 my

 free

 time

 hiking

 in

 the

 nearby

 woods

.

 I

'm

 not

 really

 a

 big

 fan

 of

 loud

 music

 or

 crowded

 places

,

 but

 I

 do

 appreciate

 a

 good

 book

 and

 a

 quiet

 cup

 of

 coffee

.

 That

's

 me

 in

 a

 nutshell

.


This

 is

 a

 neutral

 self

-int

roduction

 because

 it

 doesn't

 reveal

 any

 personal

 preferences

 or

 opinions

 that

 might

 sway the

 reader

 one

 way

 or

 the

 other

.

 It

 simply

 presents

 the

 character

's

 background

 and

 interests

 in

 a

 straightforward

 way

.

 Here

 are

 a

 few

 key

 points

 to

 highlight



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 city

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.


The

 E

iff

el

 Tower

 is

 an

 iron

 lattice

 tower

 in

 Paris

,

 France

.

 It

 was

 built

 for

 the

 

188

9

 World

's

 Fair

 and

 stands

 

324

 meters

 (

1

,

063

 ft

)

 tall

.

 The

 tower

 was

 constructed

 using

 over

 

18

,

000

 pieces

 of

 wrought

 iron

 and

 took

 nearly

 two

 years

 to

 complete

.

 It

 was

 designed

 by

 Gust

ave

 E

iff

el

 and

 was

 intended

 to

 be

 a

 temporary

 structure

,

 but

 it

 has

 become

 one

 of

 the

 most

 iconic

 and

 enduring

 symbols

 of

 Paris

 and

 France

.

 Visitors

 can

 ascend

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 far

 more

 diverse

 and

 unpredictable

 than

 most

 people

 think

.

 It

 will

 have

 far

-reaching

 impacts

 on

 all

 aspects

 of

 society

.

 AI

 is

 advancing

 at

 a

 rapid

 pace

, and

 we

 can

 expect

 to

 see

 new

 technologies

 and

 applications

 emerge

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 AI

 in

 Education

:


In

 the

 future

,

 AI

 will

 play

 a

 crucial

 role

 in

 education

.

 AI

-powered

 adaptive

 learning

 systems

 will

 be

 able

 to

 personalize

 learning

 experiences

 for

 students

,

 tail

oring

 the

 curriculum

 to

 their

 individual

 needs

 and

 abilities

.

 AI

 will

 also

 help

 teachers

 with

 grading

,

 administrative

 tasks

,

 and

 providing

 feedback

 to

 students

.


2

.

 AI

 in

 Healthcare

:


AI

 will

 revolution

ize




In [6]:
llm.shutdown()