# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

**To launch the offline engine in your python scripts, `__main__` condition is necessary, since we use `spawn` mode to create subprocesses. Please refer to this [simple example](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py) for more details.**

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.01it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.65it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.31it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah and I am a 22-year-old nurse. I am originally from a small town in the Midwest, but I moved to the East Coast to pursue my nursing career. I love my job and the people I work with, but I have to admit that I miss the comforts of home and the familiarity of my hometown.
In my free time, I enjoy hiking and trying new restaurants. I'm a bit of a foodie and love learning about different cuisines and cooking techniques. I'm also a big fan of true crime podcasts and have a fascination with psychology and human behavior.
I'm excited to meet new people and make some connections
Prompt: The president of the United States is
Generated text:  often referred to as the most powerful person in the world. While this is an exaggeration, the president does have significant powers and responsibilities. The president serves as both the head of state and the head of government. As head of state, the president represents the United States at home and abroad,

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a novel and trying to learn more about the Japanese culture. That's me in a nutshell. What do you think? Is there anything you'd like to add or change?
Here are a few suggestions to make your self-introduction more engaging and effective:
1.  Add a personal touch: While your introduction is neutral, it's a good idea to add a personal touch

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. The city is located in the northern part of the country, in the Île-de-France region. Paris is known for its rich history, cultural landmarks, and romantic atmosphere. It is home to many famous museums, such as the Louvre and the Orsay, as well as iconic landmarks like the Eiffel Tower and Notre-Dame Cathedral. Paris is also a major business and financial center, and is home to many international organizations, including the United Nations Educational, Scientific and Cultural Organization (UNESCO). The city has a population of over 2.1 million people and is a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with applications such as:
a. Predictive analytics: AI will be used to predict patient outcomes, identify high-risk patients, and prevent hospital readmissions.
b. Personalized medicine: AI will be used to develop personalized treatment plans based on a patient's genetic profile, medical history, and lifestyle.
c. Virtual nursing



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Zephyr Wilder. I'm a 22-year-old adventurer and explorer. I've traveled to many places around the world, from the scorching deserts of Egypt to the lush rainforests of South America. I'm a bit of a thrill-seeker and enjoy taking risks. When I'm not exploring, I can be found practicing my parkour skills or studying ancient languages. I'm always up for a new challenge and enjoy meeting new people.
This self-introduction is neutral in that it does not reveal too much about Zephyr's personality, motivations, or backstory. It does, however, give a sense

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The statement includes a basic fact about a city in France, which is the capital. It is concise, stating only the basic fact that Paris is the capital of France. It is also factual, making sure to get the corre

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Z

ara

.

 I

'm

 

26

 years

 old

,

 and

 I

 work

 as

 a

 marketing

 manager

 for

 a

 small

 startup

.

 I

 enjoy

 trying

 new

 foods

,

 learning

 about

 history

,

 and

 practicing

 yoga

.

 That

's

 me

 in

 a

 nutshell

.

 How

 might

 this

 introduction

 be

 seen

 as

 lacking

 in

 depth

 or

 nu

ance

?

 What

 are

 some

 potential

 issues

 with

 this

 introduction

?


The

 introduction

 is

 brief

,

 neutral

,

 and

 gives

 a

 sense

 of

 the character

's

 background

 and

 interests

.

 However

,

 it

 might

 be

 seen

 as

 lacking

 in

 depth

 or

 nu

ance

 in

 several

 ways

:

 


1

.

 

 **

L

ack

 of

 personal

 details

**:

 The

 introduction

 focuses

 mainly

 on

 Zara

's

 professional

 life

 and

 a

 few

 hobbies

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris,

 located

 in

 the

 northern

 region

 of

 the

 country

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 art

 museums

,

 fashion

 industry

,

 and

 beautiful

 architecture

,

 including

 the

 iconic

 E

iff

el

 Tower

.

 It

 is

 a

 major

 center

 for

 international

 business

,

 finance

,

 and

 culture

.

 Geography

:

 Paris

 is

 situated

 on

 the

 Se

ine

 River

 and

 is

 the

 second

-largest

 metropolitan

 area

 in

 the

 European

 Union

.

 The

 city

 has

 a

 population

 of

 over

 

2

.

1

 million

 people

 within

 its

 limits

 and

 a

 metropolitan

 area

 of

 over

 

12

.

2 million

 people

.

 History

:

 Paris

 has

 been

 the

 capital

 of

 France

 since

 the

 

5

th

 century

.

 The

 city

 has

 played

 a

 significant

 role

 in

 European

 history



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely to

 be

 shaped

 by

 advancements

 in

 machine

 learning

,

 natural

 language

 processing

,

 and

 robotics

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 will

 be

 used

 to

 analyze

 medical

 data

,

 diagnose

 diseases

,

 and

 develop

 personalized

 treatment

 plans

.

 AI

-powered

 robots

 will

 assist

 surgeons

 during

 operations

,

 and

 AI

-driven

 chat

bots

 will

 help

 patients

 navigate

 healthcare

 systems

.


2

.

 Rise

 of

 autonomous

 vehicles

:

 Self

-driving

 cars

 and

 trucks

 will

 become

 more

 common

,

 reducing

 the

 number

 of

 accidents

 caused

 by

 human

 error

.

 AI

 will

 also

 be

 used

 to

 optimize

 traffic

 flow

 and

 reduce

 congestion

.


3

.

 Increased

 use

 of

 AI

 in

 education

:

 AI

-powered

 adaptive

 learning

 systems




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.02it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.62it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.31it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.22it/s]



In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Megan and I'm a 17-year-old student
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  the head of the executive branch of the U.S
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  famous for its beautiful and historical architecture, art galleries
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), 

In [9]:
llm.shutdown()