# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.16it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.81it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.46it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.36it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.03it/s]  9%|▊         | 2/23 [00:01<00:10,  1.95it/s]

 13%|█▎        | 3/23 [00:01<00:07,  2.78it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.52it/s]

 22%|██▏       | 5/23 [00:01<00:04,  4.12it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.38it/s]

 30%|███       | 7/23 [00:02<00:03,  4.74it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.14it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.31it/s] 43%|████▎     | 10/23 [00:02<00:02,  5.43it/s]

 48%|████▊     | 11/23 [00:02<00:02,  5.50it/s] 52%|█████▏    | 12/23 [00:02<00:01,  5.52it/s]

 57%|█████▋    | 13/23 [00:03<00:01,  5.55it/s] 61%|██████    | 14/23 [00:03<00:01,  5.57it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  5.57it/s] 70%|██████▉   | 16/23 [00:03<00:01,  5.61it/s]

 74%|███████▍  | 17/23 [00:03<00:01,  5.74it/s] 78%|███████▊  | 18/23 [00:03<00:00,  5.74it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  5.82it/s] 87%|████████▋ | 20/23 [00:04<00:00,  5.76it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  5.73it/s] 96%|█████████▌| 22/23 [00:04<00:00,  5.87it/s]

100%|██████████| 23/23 [00:04<00:00,  5.78it/s]100%|██████████| 23/23 [00:04<00:00,  4.76it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Christopher, and I'm a third-year Ph.D. student in the Department of Political Science at the University of California, Berkeley. My research focuses on the intersection of international relations, political economy, and global governance. I'm particularly interested in how international institutions, such as the International Monetary Fund (IMF) and the World Trade Organization (WTO), influence the economic behavior of states and shape the global economic order.

My research has been driven by a desire to understand the complex relationships between economic policy, international cooperation, and political decision-making. I've explored topics such as the impact of IMF programs on economic outcomes in developing countries, the
Prompt: The president of the United States is
Generated text:  at it again.
Trump has tweeted that he has an amazing new hairdo, which he claims is going to make America great again.
“Just got the most fantastic hairdo,

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new coffee shops. That's me in a nutshell. What do you think? Is it too short or too long? Should I add anything else? I'd love to hear your feedback!
Your self-introduction is concise and to the point, which is great. It gives a good sense of who you are and what you do without overwhelming the reader with too much information. The details about your cat and hobbies are nice touches that add a bit

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, finance, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the "City of Love." The city has a diverse range of neighborhoods, each with its

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into various aspects of our lives, while others foresee significant challenges and limitations. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI may be used to predict patient outcomes, identify high-risk patients, and develop new treatments.
2. Widespread adoption of AI in education: AI may be used to create personalized learning plans, grade assignments, and provide real-time feedback to students. AI-powered



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Rohan Singh. I'm a 25-year-old freelance writer and part-time social media manager for a small business. I live in a modest apartment in downtown Los Angeles, surrounded by the city's vibrant energy. In my free time, I enjoy reading science fiction novels, practicing yoga, and exploring the city's diverse neighborhoods.
This is a good start, but I can suggest a few ways to make it more engaging and nuanced:
Consider adding a personal anecdote or a quirky fact about yourself to make the introduction more interesting. For example: "I'm a 25-year-old freelance writer and part-time social media manager for a small

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This response is a concise factual statement about France's capital city. It clearly states the name of the capital, which is Paris. There is no ad

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

aida

 Mor

ish

ita

.

 I

 am

 a

 

16

-year

-old

 high

 school

 student

 living

 in

 Tokyo

,

 Japan

.


Write

 a

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

.

 Hello

,

 I

'm

 Lena

 Grant

.

 I

'm

 a

 

25

-year

-old

 software

 engineer

 working

 in

 New

 York

 City

.

 I

 graduated

 from

 Cornell

 University

 with

 a

 degree

 in

 computer

 science

 and

 I

'm

 passionate

 about

 developing

 innovative

 solutions

 to

 real

-world

 problems

.

 I

 enjoy

 hiking

 and

 playing

 guitar

 in

 my

 free

 time

.


Write

 a

 short

,

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

.

 Hi

,

 my

 name

 is

 W

ren

 Everett

.

 I

'm

 a

 

28

-year

-old

 graphic

 designer

 based

 in

 Portland

,

 Oregon

.

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 is

 often

 called

 the

 "

City

 of

 Light

"

 because

 of

 its

 beautiful

 and

 historic

 architecture

,

 vibrant

 culture

,

 and

 long

 history

 as

 a

 center

 of

 learning

 and

 the

 arts

.

 The

 city

 is

 home

 to

 some

 of

 the

 world

’s

 most

 famous

 museums

 and

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

,

 and

 Notre

 Dame

 Cathedral

.

 Paris

 is

 a

 popular

 tourist

 destination

 and

 a

 hub

 of

 international

 business

 and

 finance

.

 It

 is

 a

 city

 of

 over

 

2

.

1

 million

 people

,

 with

 a

 rich

 history

 dating

 back

 over

 

2

,

000

 years

.

 (

Source

:

 Wikipedia

)


France

's

 capital

 city

 is

 home

 to

 the

 famous

 Lou

vre

 Museum

.

 The



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 that

 sparks

 a

 lot

 of

 interest

 and

 imagination

.

 As

 the

 technology

 advances

,

 we

 can

 expect

 to

 see

 more

 sophisticated

 and

 pervasive

 applications

 of

 AI

 in

 various

 sectors

 of

 society

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


   

 

1

.

 AI

 Integration

 in

 Daily

 Life

:

 AI

 will

 become

 increasingly

 integrated

 into

 our

 daily

 lives

,

 making

 our

 lives

 more

 convenient

,

 efficient

,

 and

 enjoyable

.

 We

 can

 expect

 to

 see

 AI

-powered

 personal

 assistants

,

 smart

 homes

,

 and

 cities

 that

 are

 optimized

 for

 human

 comfort

 and

 productivity

.


   

 

2

.

 Increased

 Focus

 on

 Edge

 AI

:

 As

 the

 number

 of

 IoT

 devices

 grows

,

 AI

 processing

 will

 move

 from

 the

 cloud

 to

 the

 edge

,




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
import sglang as sgl

llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.23it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.85it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.54it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.44it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:23,  1.06s/it]

  9%|▊         | 2/23 [00:01<00:12,  1.73it/s] 13%|█▎        | 3/23 [00:01<00:07,  2.51it/s]

 17%|█▋        | 4/23 [00:01<00:05,  3.25it/s] 22%|██▏       | 5/23 [00:01<00:04,  3.85it/s]

 26%|██▌       | 6/23 [00:02<00:04,  4.13it/s] 30%|███       | 7/23 [00:02<00:03,  4.61it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.99it/s] 39%|███▉      | 9/23 [00:02<00:02,  5.30it/s]

 43%|████▎     | 10/23 [00:02<00:02,  5.34it/s]

 48%|████▊     | 11/23 [00:02<00:02,  5.20it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  5.02it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.55it/s]

 61%|██████    | 14/23 [00:03<00:02,  4.49it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  4.41it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.36it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.32it/s]

 78%|███████▊  | 18/23 [00:04<00:01,  4.35it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  4.30it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.19it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.22it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  4.26it/s]

100%|██████████| 23/23 [00:05<00:00,  4.20it/s]100%|██████████| 23/23 [00:05<00:00,  3.99it/s]


In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Karen. I am a 62 year old retired
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  getting a new ride.
The White House has confirmed
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  located on the Seine River, and the city
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), tor

In [9]:
llm.shutdown()