# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.05s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.14s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.15s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.06it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tyler and I am a 23 year old male. I have been diagnosed with ADHD, anxiety, and depression. I have been taking medication to help manage my symptoms but I feel like I am not getting the full benefits of the medication. I have tried different medications and combinations of medications but I still feel like I am struggling. I am looking for some advice on how to better manage my symptoms and find the right treatment plan.
ADHD (Attention Deficit Hyperactivity Disorder) is a neurological disorder that affects how a person behaves, attention, and ability to sit still and control behavior. ADHD is characterized by symptoms such as inattention,
Prompt: The president of the United States is
Generated text:  not required to be a U.S. citizen.
That's right, folks. Although I know what you're thinking - "but isn't that a fundamental requirement?" - the answer is actually no. Under the U.S. Constitution, the president is required to be a natural-born c

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Elianore Quasar. I'm a 25-year-old astrophysicist with a passion for studying black holes. I'm currently working on a research project to better understand the properties of event horizons. When I'm not in the lab, I enjoy hiking and reading about the history of space exploration. I'm looking forward to meeting new people and sharing my knowledge with others.
This self-introduction is neutral because it doesn't reveal any personal opinions or biases. It simply states the character's name, profession, and interests. The tone is professional and friendly, making it suitable for a variety of social situations.
Here are

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. This is a concise and factual statement about France’s capital city. It does not include any additional information or opinions, making it a clear and direct statement of fact. This type of statement is often used in encyclopedias, dictionaries, and other reference materials where accuracy and brevity are essential. It provides a quick and easy way to identify the capital of France, which is a fundamental piece of information about the country. The statement is also neutral and does not express any opinion or bias, making it suitable for a wide range of audiences and purposes. Overall, the statement is a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of Explainable AI (XAI): As AI becomes more pervasive in decision-making, there will be a growing need to understand how AI systems arrive at their conclusions. XAI will become increasingly important to build trust in AI systems and ensure transparency in decision-making.
2. Rise of Edge AI: With the proliferation of IoT devices, there will be a growing need for AI to be deployed at the edge, closer to the data source. Edge AI will enable faster processing, reduced



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Zara. I'm a 25-year-old freelance writer and artist from a small town in the Midwest. I've got a quirky sense of humor and a passion for storytelling.
I'd love to hear your feedback and suggestions for improving this self-introduction. What could be added or changed to make it more engaging and memorable?
The introduction is short and to the point, which is great for a quick hello, but you might consider adding a bit more depth or personality to make it more engaging. Here are a few suggestions:
1. Add a specific detail that reveals your character's personality or interests. For example, "I'm a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  located in the north-central part of the country.
The Eiffel Tower is an iconic landmark in the capital city of France. It was built for the 1889 World's Fair.
Answer: Th

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Em

ilia

 Grey

,

 and

 I

'm

 a

 

28

-year

-old

 freelance

 writer

 and

 book

keeper

.

 I

'm

 currently

 residing

 in

 a

 small

 apartment

 in

 the

 city

,

 and

 I

 enjoy

 spending

 my

 free

 time

 reading

 and

 exploring

 local

 coffee

 shops

.

 I

'm

 a

 bit

 of

 an

 intro

vert

,

 but

 I

'm

 slowly

 working

 on

 stepping

 out

 of

 my

 comfort

 zone

 and

 trying

 new

 things

.

 That

's

 me

 in

 a

 nutshell

.


Em

ilia

's

 description

 is

 concise

 and

 doesn

't

 reveal

 any

 biases

 or

 personal

 opinions

.

 It

 provides

 a

 brief

 overview

 of

 her

 personality

,

 occupation

,

 interests

,

 and

 living

 situation

.

 The

 tone

 is

 neutral

,

 avoiding

 any

 emot

ive

 language

 that

 could

 give

 away

 her

 character

's

 traits



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 largest

 city

 in

 the

 country

,

 with

 a

 population

 of

 around

 

12

 million

 people

 in

 the

 metropolitan

 area

 and

 over

 

2

.

1

 million

 people

 living

 within

 the

 city

 limits

.

 Paris

 is

 situated

 in

 the

 northern

 part

 of

 the

 country

 on

 the

 River

 Se

ine

,

 which

 runs

 through

 the

 heart

 of

 the

 city

.

 Paris

 has

 been

 the

 capital

 of

 France

 since

 the

 

12

th

 century

 and

 is

 known

 for

 its

 stunning

 architecture

,

 rich

 history

,

 art

,

 fashion

,

 and

 cuisine

,

 making

 it

 one

 of

 the

 world

’s

 most

 popular

 tourist

 destinations

.


A

)

 Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.


B

)

 Provide

 a

 concise

 factual

 statement

 about

 France

’s

 largest



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 the

 intersection

 of

 several

 key

 factors

,

 including

 advances

 in

 machine

 learning

 algorithms

,

 increases

 in

 computational

 power

,

 and

 the

 availability

 of

 large

 datasets

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 

 **

Increased

 Adoption

 in

 Everyday

 Life

**:

 AI

 is

 likely

 to

 become

 more

 ubiquitous

 in

 our

 daily

 lives

,

 with

 applications

 in

 areas

 such

 as

 smart

 homes

,

 transportation

,

 and

 healthcare

.

 This

 increased

 adoption

 will

 be

 driven

 by

 the

 development

 of

 more

 user

-friendly

 and

 accessible

 AI

 technologies

.



2

.

 

 **

Adv

ancements

 in

 Explain

ability

 and

 Transparency

**:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 for

 AI

 systems

 to

 be

 explain

able

 and

 transparent




In [6]:
llm.shutdown()