# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.09it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.02s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.05s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.15it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kenji Nishida, and I am a freelance writer and editor specializing in science and technology. I have a strong background in biology, chemistry, and physics, and I enjoy helping my clients communicate complex scientific ideas to a wide audience.

I am skilled in:

* Writing and editing scientific articles, press releases, and other types of technical content
* Creating engaging content for websites, blogs, and social media
* Developing and editing educational materials, such as textbooks, workbooks, and online courses
* Conducting research and interviews to gather information and quotes from experts
* Collaborating with designers and developers to create multimedia content
*
Prompt: The president of the United States is
Generated text:  one of the most powerful people on the planet. Here are the top 10 things you need to know about the presidency, and the president's role in the U.S. government.
1. The President serves as both the head of state

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I live in a small town in the Pacific Northwest with my family. I enjoy hiking and reading in my free time. I'm a bit of a introvert and prefer to keep to myself, but I'm always up for a good conversation when the time is right. I'm a bit of a daydreamer and often find myself lost in thought, but I'm working on being more present in the moment. That's me in a nutshell. What do you think? Is this a good starting point for a character introduction?
This is a good starting point for a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is located in the northern part of the country, near the Seine River. Paris is known for its rich history, art, fashion, and cuisine. It is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. Paris is a popular tourist destination and a major hub for international business and culture. It is also the seat of the French government and the country’s largest city. The city has a population of over 2.1 million people and is a major center for education, science, and technology. Paris is also known for its romantic atmosphere and is often

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems will be able to analyze large amounts of medical data, identify patterns, and make predictions about patient outcomes.
2. Widespread adoption of AI in industries: AI is expected to be adopted in various industries, including finance, transportation, and education. AI-powered systems will be able to automate tasks, improve efficiency, and make



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Elianore Quasar. I’m a 24-year-old astrobiologist, currently residing in a small, coastal town. My research focuses on the possibility of life beyond Earth, and I have a strong interest in exoplanetary systems and astrochemical processes. I’m fascinated by the mysteries of the universe and enjoy exploring new ideas and theories. In my free time, I enjoy hiking, reading, and playing the guitar.
What is the purpose of the self-introduction?
The purpose of the self-introduction is to provide a brief overview of who Elianore Quasar is, what she does, and her interests. This is

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is located in the northern part of the country and is home to over 2.1 million people. The city is famous for its beautiful architecture, art museums, and historic landmarks like the

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

ael

.

 I

'm

 a

 

25

-year

-old

 freelance

 writer

 living

 in

 a

 small

 town

 surrounded

 by

 dense

 forest

.

 I

 enjoy

 spending

 time

 outdoors

,

 hiking

 and

 camping

 whenever

 I

 can

.

 I

 also

 have

 a

 strong

 interest

 in

 local

 folklore

 and

 mythology

,

 which

 often

 influences

 my

 writing

.

 I

'm

 always

 eager

 to

 meet

 new

 people

 and

 explore

 the

 world

 around

 me

.


What

 can

 you

 infer

 from

 the

 self

-int

roduction

?


The

 character

,

 K

ael

,

 is

 likely

 an

 outdoors

y

 person

 who

 values

 independence

 and

 a

 connection

 with

 nature

.

 He

 may

 be

 a

 bit

 of

 a

 intro

vert

,

 as

 he

 mentions

 enjoying

 time

 alone

 in

 the

 forest

.

 His

 interest

 in

 local

 folklore

 and

 mythology

 suggests

 that



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 located

 in

 the

 Î

le

-de

-F

rance

 region

 and

 serves

 as

 the

 country

’s largest

 city

.


France

’s

 capital

 city

 is

 a

 global

 center

 of

 culture

,

 finance

,

 fashion

,

 and

 politics

.


The

 city

 has

 a

 rich

 history

,

 with

 the

 Romans

 establishing

 a

 settlement

 called

 L

ut

et

ia

 in

 the

 

1

st

 century

 BC

.

 The

 city

 became

 a

 major

 center

 of

 learning

 during

 the

 Middle

 Ages

 and

 was

 a

 hub

 of

 artistic

 and

 cultural

 innovation

 during

 the

 Renaissance

.


Today

,

 Paris

 is

 known

 for

 its

 iconic

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

,

 which

 houses

 the

 Mona

 Lisa

.

 The

 city

 is

 also

 famous

 for



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 of

 ongoing

 debate

,

 but

 there

 are

 some

 trends

 that

 are

 likely

 to

 shape

 the

 field

 in

 the

 coming

 years

.


Art

ificial

 intelligence

 (

AI

)

 has

 come

 a

 long

 way

 since

 its

 inception

,

 and

 its

 future

 holds

 tremendous

 promise

 and

 potential

 risks

.

 As

 we

 move

 forward

,

 several

 trends

 are

 likely

 to

 shape

 the

 field

 of

 AI

 and

 influence

 its

 applications

 in

 various

 industries

 and

 aspects

 of

 life

.

 Some

 of

 these

 trends

 include

:


1

.

 AI

 for

 Social

 Good

:

 As

 AI

 technology

 continues

 to

 advance

,

 it

 is

 expected

 to

 be

 used

 more

 extensively

 for

 social

 good

,

 such

 as

 in

 healthcare

,

 education

,

 and

 environmental

 conservation

.

 AI

 can

 help

 analyze

 complex

 data

,

 identify

 patterns

,




In [6]:
llm.shutdown()