# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.02it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.60it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.26it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.18it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Joanne and I am a massage therapist. I offer a range of massage techniques including Swedish massage, deep tissue massage, sports massage, and aromatherapy massage. I am based in Birmingham and cover areas such as Edgbaston, Harborne, and Moseley.
Massage can help to relieve stress and tension, improve circulation, and increase flexibility. It can also be used to treat a range of musculoskeletal injuries and conditions, such as lower back pain, sciatica, and tennis elbow. I work with clients to tailor the massage to their specific needs, and I also provide advice on stretching and exercise to help prevent future problems
Prompt: The president of the United States is
Generated text:  by far the most powerful office in the world, with an unparalleled level of authority and influence. The president is both the head of state and the head of government, with the power to make key decisions on a wide range of issues, from foreign policy to domestic 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my free time. I'm a bit of a introvert, but I'm always up for a good conversation.
This is a good example of a neutral self-introduction because it doesn't reveal too much about the character's personality, background, or motivations. It simply provides some basic information about who Kaida is and what she likes to do. This can be a good way to introduce a character in a story or in real life, especially if you want

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. The city is located in the northern part of the country, along the Seine River. Paris is known for its rich history, cultural landmarks, and romantic atmosphere. The city is home to many famous museums, including the Louvre, which houses the Mona Lisa. Paris is also famous for its fashion, cuisine, and architecture, including the iconic Eiffel Tower. The city has a population of over 2.1 million people and is a major hub for business, education, and tourism. Paris is a popular destination for visitors from around the world, attracting over 23 million tourists each

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Advancements in natural language processing: Natural language processing (NLP) is a key area of AI research, and future advancements in NLP may enable AI systems to understand and generate human-like language, leading to more effective communication



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Elianore Quasar. I'm a 25-year-old historian with a passion for exploring the intricate relationships between science, technology, and society. I'm currently working on a book about the history of artificial intelligence development. When I'm not researching, you can find me practicing my aerial photography skills or enjoying a good sci-fi novel. I'm curious about the world and its complexities, and I'm always eager to learn more.
Write a short, neutral self-introduction for a fictional character. Hello, my name is Kaia Luna. I'm a 28-year-old software engineer with a background in computer science. I've been

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is the largest city in France and is located in the northern part of the country. It is situated in the Île-de-France region and is known for 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

aida

 K

ats

ur

agi

.

 I

'm

 a

 

20

-year

-old

 university

 student

 who

 majors

 in

 sociology

.

 When

 I

'm

 not

 studying

,

 I

 enjoy

 reading

 about

 history

 and

 playing

 the

 guitar

.


Note

 the

 use

 of

 simple

 sentences

 and

 the

 neutral

 tone

.


For

 a

 more

 detailed

 self

-int

roduction

,

 describe

 the

 person

's

 personality

,

 interests

,

 and

 goals

.

 Hello

,

 my

 name

 is

 K

aida

 K

ats

ur

agi

.

 I

'm

 a

 

20

-year

-old

 university

 student

 who

 majors

 in

 sociology

.

 I

'm

 a

 bit

 of

 a

 book

worm

,

 always

 eager

 to

 learn

 and

 share

 my

 knowledge

 with

 others

.

 When

 I

'm

 not

 studying

,

 you

 can

 find

 me

 reading

 about

 history

 or

 playing



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Next

,

 provide

 an

 analysis

 of

 the

 statement

.

 The

 analysis

 should

 cover

 what

 makes

 this

 statement

 significant

,

 what

 in

ferences

 can

 be

 drawn

 from

 it

,

 and

 any

 implications

 for

 the

 reader

 or

 researcher

.

 The

 significance

 of

 the

 statement

 lies

 in

 its

 simplicity

 and

 univers

ality

,

 conveying

 a

 basic

 fact

 about

 France

 that

 is

 widely

 known

.

 However

,

 this

 statement

 also

 highlights

 the

 importance

 of

 Paris

 in

 French

 identity

,

 culture

,

 and

 history

.

 The

 city

 is

 often

 seen

 as

 a

 symbol

 of

 French

 elegance

,

 sophistication

,

 and

 art

istry

,

 reflecting

 the

 nation

's

 rich

 heritage

.

 This

 statement

 can

 be

 used

 as

 a

 starting

 point

 for

 exploring

 the

 broader

 themes

 of

 French

 culture

,

 politics

,

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 still

 developing

,

 but

 some

 trends

 that

 may

 emerge

 in

 the

 coming

 years

 include

:


1

.

 

 

Increased

 focus

 on

 human

-A

I

 collaboration

:

 As

 AI

 becomes

 more

 prevalent

 in

 various

 industries

,

 there

 will

 be

 a

 growing

 need

 for

 humans

 and

 AI

 systems

 to

 work

 together

 effectively

.

 This

 may

 involve

 developing

 new

 interfaces

 and

 tools

 that

 enable

 seamless

 collaboration

 between

 humans

 and

 AI

.


2

.

 

 

More

 emphasis

 on

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 integrated

 into

 decision

-making

 processes

,

 there

 will

 be

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 arrive

 at

 their

 decisions

.

 This

 may

 involve

 developing

 new

 techniques

 for

 explaining

 AI

 decisions

 and

 making

 them

 more

 transparent

.


3

.

 

 

Greater




In [6]:
llm.shutdown()