# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

The following error message 'operation scheduled before its operands' can be ignored.


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.06s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.52it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.25it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.16it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Marina and I'm an artist and a scientist. I love exploring the intersection of art and science, and how they can inform and inspire each other.

When I'm not in the studio, you can find me in the lab, experimenting with new materials and techniques to create innovative art pieces. I'm particularly interested in using biotechnology, nanotechnology, and materials science to create unique and thought-provoking art.

My art practice is focused on exploring the relationship between living systems and the built environment. I'm fascinated by how our daily lives are shaped by the interactions between living organisms, technology, and the physical world. I use a variety
Prompt: The president of the United States is
Generated text:  not a career politician. He or she is an outsider who has a vision for the country and is willing to challenge the status quo.
The media is not a neutral source of information. It is a powerful tool that can shape public op

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new coffee shops. I'm a bit of a introvert and prefer to spend my free time alone or with a few close friends. I'm currently working on a novel and trying to build my writing career. That's me in a nutshell. What do you think? Is there anything you'd like to add or change?
I think your self-introduction is great! It's concise, informative, and gives a good sense of who Kaida is. Here are a few

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris.
The capital of France is Paris. This statement is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or context. It is a simple and straightforward statement that can be used as a fact or a piece of trivia. The statement is also grammatically correct and easy to understand, making it suitable for a variety of contexts, such as educational materials, trivia games, or general knowledge questions. Overall, the statement is a clear and concise expression of a factual piece of information about France’s capital city. The statement is also neutral

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by the convergence of multiple factors, including technological advancements, societal needs, and economic pressures. Here are some possible future trends in artificial intelligence:
1. Increased focus on Explainability and Transparency: As AI becomes more pervasive in various industries, there will be a growing need for explainability and transparency in AI decision-making processes. This will involve developing techniques to provide insights into how AI models arrive at their conclusions, making them more trustworthy and accountable.
2. Rise of Edge AI: With the proliferation of IoT devices and the need for real-time processing, Edge AI will become increasingly important. This involves processing data closer to the source



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Elianore Quasar, and I'm a 23-year-old astrophysicist living in a research station on the edge of the galaxy. When I'm not working, you can find me reading about the history of the cosmos or attempting to cook something edible. I'm a bit of a skeptic, and I prefer to rely on empirical evidence rather than superstition or intuition.
You can add more details as you like, but try to keep your introduction brief and neutral. Avoid using slang, jargon, or overly technical terms that might confuse readers. For a more engaging introduction, try to reveal something unexpected or intriguing about your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. France is a country located in Western Europe with a rich history, art and architecture. France is the third most popular tourist destination in the world. Paris is 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

ael

in

 Dark

shadow

.

 I

’m

 a

 

27

-year

-old

 mixed

-media

 artist

 and

 recent

 transplant

 to

 the

 city

 of

 Ravens

hire

.

 I

 work

 out

 of

 a

 small

 studio

 in

 the

 north

 end

,

 where

 I

 spend

 most

 of

 my

 time

 hon

ing

 my

 craft

 and

 experimenting

 with

 new

 techniques

.

 When

 I

’m

 not

 creating

,

 you

 can

 find

 me

 exploring

 the

 city

’s

 hidden

 corners

 or

 scri

b

bling

 in

 my

 journal

.


A

 short

,

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

,

 highlighting

 their

 occupation

 and

 interests

.


K

ael

in

 Dark

shadow

 is

 a

 

27

-year

-old

 artist

 who

 has

 recently

 moved

 to

 the

 city

 of

 Ravens

hire

.

 He

 spends

 most

 of

 his

 time

 working

 out



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 has

 an

 area

 of

 

1

,

182

 km

²

 and

 a

 population

 of

 approximately

 

2

.

1

 million

 people

.

 Paris

 is

 located

 at

 the

 heart

 of

 the

 Î

le

-de

-F

rance

 region

 and

 is

 a

 major

 center

 for

 politics

,

 economy

,

 education

,

 and

 culture

.

 The

 city

 is

 home

 to

 many

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.


Be

 the

 first

 to

 review

 “

Describe

 Paris

,

 France

’s

 Capital

 City

”

 Cancel

 reply




Categories

:

 Des

criptive

 Essays

 ,

 History

 Essays

 Tags

:

 a

 descriptive

 essay

 about

 Paris

 ,

 descriptive

 essay

 about

 Paris

 France

 ,

 describe

 Paris

 France

 ,

 France

's

 capital



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 popular

 topic

 of

 discussion

.

 Experts

 have

 different

 views

 on

 how

 AI

 will

 evolve

 and

 the

 impact

 it

 will

 have

 on

 our

 lives

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


 

 

1

.

 AI

 becoming

 more

 human

-like

:

 As

 AI

 technology

 advances

,

 it

 is

 expected

 to

 become

 more

 human

-like

 in

 its

 behavior

 and

 decision

-making

.

 AI

 systems

 will

 be

 able

 to

 learn

 from

 experiences

,

 understand

 human

 emotions

,

 and

 interact

 with

 humans

 in

 a

 more

 natural

 way

.


 

 

2

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 healthcare

 to

 diagnose

 diseases

 and

 develop

 personalized

 treatment

 plans

.

 In

 the

 future

,

 AI

 will

 play

 an

 even

 larger

 role

 in




In [6]:
llm.shutdown()