# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.28it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.93it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.59it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.43it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.49it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:20,  1.05it/s]  9%|▊         | 2/23 [00:01<00:10,  2.08it/s]

 13%|█▎        | 3/23 [00:01<00:06,  3.04it/s] 17%|█▋        | 4/23 [00:01<00:04,  3.89it/s]

 22%|██▏       | 5/23 [00:01<00:03,  4.62it/s] 26%|██▌       | 6/23 [00:01<00:03,  5.01it/s]

 30%|███       | 7/23 [00:01<00:02,  5.48it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.79it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.78it/s] 43%|████▎     | 10/23 [00:02<00:02,  6.06it/s]

 48%|████▊     | 11/23 [00:02<00:01,  6.09it/s] 52%|█████▏    | 12/23 [00:02<00:01,  6.28it/s]

 57%|█████▋    | 13/23 [00:02<00:01,  6.41it/s] 61%|██████    | 14/23 [00:02<00:01,  6.51it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  6.57it/s] 70%|██████▉   | 16/23 [00:03<00:01,  6.63it/s]

 74%|███████▍  | 17/23 [00:03<00:00,  6.59it/s] 78%|███████▊  | 18/23 [00:03<00:00,  6.63it/s]

 83%|████████▎ | 19/23 [00:03<00:00,  6.64it/s] 87%|████████▋ | 20/23 [00:03<00:00,  6.64it/s]

 91%|█████████▏| 21/23 [00:03<00:00,  6.65it/s] 96%|█████████▌| 22/23 [00:04<00:00,  6.67it/s]

100%|██████████| 23/23 [00:04<00:00,  6.69it/s]100%|██████████| 23/23 [00:04<00:00,  5.37it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Xochitl!
My name is Xochitl, which means "flower" in the Nahuatl language, spoken by the Aztecs in Mexico. I'm a passionate artist and educator with a love for all things creative and colorful. I have a degree in Fine Arts and have been teaching art to children and adults for over 10 years. I specialize in painting, drawing, pottery, and printmaking.
As an artist, I find inspiration in the beauty of nature, the vibrant colors of Mexican culture, and the warmth of human connection. My art is a reflection of my love for life, my heritage, and
Prompt: The president of the United States is
Generated text:  both the head of state and the head of government of the United States. The president is elected to a four-year term through the Electoral College. The president is responsible for a wide range of powers and duties, including serving as the commander-in-chief of the armed forces, conducting foreign policy, and appointing federal judges, includin

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student living in a small town in the Pacific Northwest. I enjoy hiking and reading in my free time. I'm a bit of a introvert, but I'm working on being more outgoing. I'm a junior, and I'm trying to figure out what I want to do with my life after graduation. I'm a bit of a perfectionist, which can sometimes make things difficult for me, but I'm trying to learn to be more relaxed and go with the flow. I'm a bit of a bookworm, and I love getting lost in a good story. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is the largest city in France and is located in the northern part of the country. It is situated on the Seine River and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, culture, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the "City of Light." The city has a rich cultural heritage and is home to many museums

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, from diagnosing diseases to developing personalized treatment plans.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for AI systems to be transparent and explainable, so that users can understand how they arrive at their decisions.
3. Growing importance of human-AI collaboration: As AI becomes more capable, humans and AI systems will need to work together more



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Emily Lee. I'm a software engineer by trade, with a strong background in coding and IT. I'm currently working on a project to develop a new language processing algorithm. Outside of work, I enjoy reading science fiction and hiking in the mountains. I'm originally from a small town in the Midwest, but I've lived in the city for over 5 years now. I'm a bit of a coffee snob and have a weakness for dark chocolate. That's me in a nutshell! I'm excited to get to know you and hear about your interests. How about you? What brings you here today? Good conversation can be a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.  Paris is situated in the northern part of the country. It is the country's largest city and a major centre for culture, fashion, and tourism. Paris is known for its iconic landmarks such as the

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Em

ilia

.

 I

'm

 

25

 years

 old

.

 I

 work

 as

 a

 software

 developer

 in

 a

 small

 startup

.

 I

 enjoy

 learning

 about

 new

 technologies

 and

 reading

 science

 fiction

 novels

 in

 my

 free

 time

.

 I

'm

 a

 bit

 of

 a

 intro

vert

,

 but

 I

 appreciate

 the

 quiet

,

 focused

 atmosphere

 of

 my

 work

 environment

.


The

 tone

 of

 this

 self

-int

roduction

 is

 neutral

 and

 to

 the

 point

.

 It

 doesn

't

 reveal

 too

 much

 about

 the

 character

's

 personality

 or

 personal

 life

,

 but

 it

 gives

 a

 sense

 of

 who

 they

 are

 and

 what

 they

 do

.


In

 a

 few

 sentences

,

 what

 is

 Em

ilia

's

 profession

 and

 personality

 like

?


Em

ilia

 is

 a

 software

 developer

 who

 works

 in

 a

 small



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 most

 visited

 city

 in

 the

 world

.

 

2

.

 What

 is

 the

 significance

 of

 the

 E

iff

el

 Tower

?

 The

 E

iff

el

 Tower

,

 built

 in

 

188

9

,

 was

 the

 world

's

 tallest

 man

-made

 structure

 and

 is

 now

 an

 iconic

 symbol

 of

 Paris

 and

 France

.

 It

 is

 a

 source

 of

 national

 pride

 and

 is

 one

 of

 the

 most

 recognizable

 landmarks

 in

 the

 world

.

 

3

.

 What

 is

 the

 significance

 of

 the

 Lou

vre

 Museum

?

 The

 Lou

vre

 is

 the

 world

's

 largest

 art

 museum

,

 housing

 over

 

550

,

000

 works

 of

 art

,

 including

 the

 Mona

 Lisa

.

 It

 is

 a

 premier

 cultural

 institution

 and

 a

 must

-

visit

 destination

 for

 art

 lovers



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 exciting

 and

 potentially

 transformative

.


The

 future

 of

 artificial

 intelligence

 (

AI

)

 is

 expected

 to

 be

 exciting

 and

 potentially

 transformative

,

 with

 numerous

 trends

 that

 will

 shape

 the

 industry

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 Adoption

 in

 Various

 Industries

:

 AI

 will

 continue

 to

 be

 adopted

 across

 various

 industries

,

 including

 healthcare

,

 finance

,

 education

,

 and

 transportation

.

 This

 will

 lead

 to

 increased

 efficiency

,

 productivity

,

 and

 innovation

.


2

.

 Adv

ancements

 in

 Deep

 Learning

:

 Deep

 learning

,

 a

 subset

 of

 machine

 learning

,

 will

 continue

 to

 advance

,

 leading

 to

 more

 accurate

 and

 robust

 AI

 systems

.

 This

 will

 enable

 AI

 to

 tackle

 complex

 tasks

,

 such

 as

 image

 and

 speech




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
import sglang as sgl

llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.22it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.84it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.52it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.43it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.04it/s]  9%|▊         | 2/23 [00:01<00:10,  2.01it/s]

 13%|█▎        | 3/23 [00:01<00:06,  2.93it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.69it/s]

 22%|██▏       | 5/23 [00:01<00:04,  4.33it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.66it/s]

 30%|███       | 7/23 [00:01<00:03,  5.06it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.37it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.58it/s] 43%|████▎     | 10/23 [00:02<00:02,  5.84it/s]

 48%|████▊     | 11/23 [00:02<00:01,  6.01it/s] 52%|█████▏    | 12/23 [00:02<00:01,  6.14it/s]

 57%|█████▋    | 13/23 [00:02<00:01,  6.25it/s] 61%|██████    | 14/23 [00:03<00:01,  6.33it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  6.38it/s] 70%|██████▉   | 16/23 [00:03<00:01,  6.43it/s]

 74%|███████▍  | 17/23 [00:03<00:00,  6.46it/s] 78%|███████▊  | 18/23 [00:03<00:00,  6.44it/s]

 83%|████████▎ | 19/23 [00:03<00:00,  6.44it/s] 87%|████████▋ | 20/23 [00:03<00:00,  6.45it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  6.46it/s] 96%|█████████▌| 22/23 [00:04<00:00,  6.48it/s]

100%|██████████| 23/23 [00:04<00:00,  6.49it/s]100%|██████████| 23/23 [00:04<00:00,  5.18it/s]


In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Kevin and I have been a professional dog walker for
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  not a ceremonial figurehead. The president has many
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  Paris, the most beautiful city in the world.
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]),

In [9]:
llm.shutdown()