# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.13it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.76it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.43it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.33it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.03it/s]  9%|▊         | 2/23 [00:01<00:10,  1.98it/s]

 13%|█▎        | 3/23 [00:01<00:07,  2.82it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.42it/s]

 22%|██▏       | 5/23 [00:01<00:04,  4.00it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.33it/s]

 30%|███       | 7/23 [00:02<00:03,  4.66it/s] 35%|███▍      | 8/23 [00:02<00:03,  4.93it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.00it/s] 43%|████▎     | 10/23 [00:02<00:02,  5.23it/s]

 48%|████▊     | 11/23 [00:02<00:02,  5.36it/s] 52%|█████▏    | 12/23 [00:02<00:02,  5.29it/s]

 57%|█████▋    | 13/23 [00:03<00:01,  5.40it/s] 61%|██████    | 14/23 [00:03<00:01,  5.51it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  5.38it/s] 70%|██████▉   | 16/23 [00:03<00:01,  5.51it/s]

 74%|███████▍  | 17/23 [00:03<00:01,  5.37it/s] 78%|███████▊  | 18/23 [00:04<00:00,  5.46it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  5.55it/s]

 87%|████████▋ | 20/23 [00:04<00:00,  5.31it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  5.08it/s] 96%|█████████▌| 22/23 [00:04<00:00,  5.22it/s]

100%|██████████| 23/23 [00:05<00:00,  5.33it/s]100%|██████████| 23/23 [00:05<00:00,  4.57it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Preeya, I am a qualified BSc (Hons) in Midwifery. I have been a midwife for over 10 years and have extensive experience working in both community and hospital settings. I am passionate about empowering women and their families to make informed choices and decisions about their care.
I am registered with the Nursing and Midwifery Council (NMC) and have up to date knowledge of best practice and guidelines. I have also completed additional training in areas such as fetal monitoring, postnatal depression and baby first aid.
I understand that every woman's journey through pregnancy, birth and motherhood is unique and
Prompt: The president of the United States is
Generated text:  a hard person to get a read on. This is not just because of the secrecy and obfuscation that surrounds the office, but also because of the president’s unique position in the country’s political hierarchy. The president is simultaneously the leader of the executive branch, a

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student who enjoys playing the guitar and listening to music. I'm a bit of a bookworm and love reading about history and science. I'm a bit shy, but I'm working on being more outgoing. I'm a junior at Springdale High School. That's me in a nutshell. What do you think? Is it a good self-introduction?
Your self-introduction is clear and concise. It provides a good overview of your character's interests and personality. However, it may benefit from a bit more depth and personality. Here are some suggestions to consider:
*  

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. 
This statement is a concise factual statement about France’s capital city. It clearly and accurately states that the capital of France is Paris. This statement is a good example of a concise factual statement because it is brief, to the point, and provides a clear and accurate piece of information. It does not include any unnecessary details or opinions, making it a good choice for a concise factual statement. 
Note: This response is a direct answer to the prompt and does not require any additional information or context. It is a simple and straightforward statement that

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much debate and speculation, and it's difficult to predict exactly what will happen. However, based on current trends and the pace of innovation, here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with applications such as:
a. Predictive analytics: AI will be used to analyze large amounts of medical data to predict patient outcomes and identify high-risk patients.
b. Personalized medicine: AI will



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Remi Blackwood. I'm a 27-year-old freelance writer and artist from Portland, Oregon. I enjoy hiking and reading in my free time.
What can you infer about Remi from this introduction?
Remi is likely to be a creative and independent person, given that they are a freelance writer and artist. They may value their freedom and flexibility, as freelance work often allows for a non-traditional schedule. The mention of hiking and reading as free-time activities suggests that Remi enjoys the outdoors and values intellectual pursuits. The fact that they are from Portland, Oregon, may imply that they have a somewhat bohemian or

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Provide an example of a famous building in Paris. The Eiffel Tower is an iconic symbol of the city.
What is the population of Paris? The popu

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Eve

 Archer

.

 I

 work

 as

 a

 sales

 representative

 for

 a

 tech

 firm

,

 and

 I

 enjoy

 spending

 time

 in

 nature

,

 reading

,

 and

 trying

 out

 new

 recipes

 in

 my

 free

 time

.


E

ve

 Archer

,

 a

 sales

 representative

 for

 a

 tech

 firm

,

 here

.

 I

'm

 a

 creative

 problem

-s

olver

,

 analytical

 thinker

,

 and

 nature

 enthusiast

.

 When

 I

'm

 not

 in

 the

 office

,

 I

 enjoy

 reading

,

 experimenting

 with

 new

 recipes

,

 and

 exploring

 the

 outdoors

.


E

ve

 Archer

,

 sales

 representative

 at

 a

 tech

 firm

.

 In

 my

 free

 time

,

 I

 like

 to

 hike

,

 cook

,

 and

 read

.

 I

'm

 always

 up

 for

 a

 challenge

 and

 looking

 to

 learn

 more

 about

 the

 people

 and

 world



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 home

 to

 the

 famous

 E

iff

el

 Tower

,

 which

 was

 originally

 intended

 for

 Barcelona

 but

 was

 eventually

 built

 in

 Paris

.

 The

 tower

 was

 designed

 by

 Gust

ave

 E

iff

el

,

 a

 French

 engineer

,

 and

 was

 completed

 in

 

188

9

 for

 the

 World

’s

 Fair

.

 At

 

324

 meters

 (

1

,

063

 feet

)

 tall

,

 it

 was

 the

 tallest

 man

-made

 structure

 in

 the

 world

 for

 over

 

40

 years

.

 Today

,

 it

 is

 one

 of

 the

 most

 iconic

 landmarks

 in

 the

 world

 and

 attracts

 millions

 of

 visitors

 each

 year

.


Paris

 is

 also

 known

 for

 its

 beautiful

 art

 and

 architecture

,

 including

 the

 Lou

vre

 Museum

,

 which

 houses

 a

 vast

 collection

 of

 artwork

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 advances

 in

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 These

 technologies

 have

 the

 potential

 to

 revolution

ize

 numerous

 industries

,

 from

 healthcare

 and

 finance

 to

 education

 and

 transportation

.


Possible

 future

 trends

 in

 AI

 include

:


1

.

 Increased

 use

 of

 deep

 learning

:

 Deep

 learning

,

 a

 subset

 of

 machine

 learning

,

 has

 already

 shown

 significant

 improvements

 in

 various

 AI

 applications

.

 As

 the

 field

 continues

 to

 evolve

,

 we

 can

 expect

 to

 see

 more

 widespread

 adoption

 of

 deep

 learning

 techniques

,

 leading

 to

 even

 more

 accurate

 and

 efficient

 AI

 systems

.


2

.

 Rise

 of

 natural

 language

 processing

:

 As

 AI

 becomes

 more

 integrated

 into

 daily

 life

,

 natural

 language

 processing

 will

 play

 a

 crucial

 role

 in




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
import sglang as sgl

llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.22it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.85it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.54it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.44it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:22,  1.01s/it]

  9%|▊         | 2/23 [00:01<00:12,  1.69it/s]

 13%|█▎        | 3/23 [00:01<00:08,  2.33it/s]

 17%|█▋        | 4/23 [00:01<00:06,  2.86it/s]

 22%|██▏       | 5/23 [00:01<00:05,  3.28it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.49it/s]

 30%|███       | 7/23 [00:02<00:04,  3.65it/s]

 35%|███▍      | 8/23 [00:02<00:03,  3.93it/s]

 39%|███▉      | 9/23 [00:02<00:03,  4.06it/s]

 43%|████▎     | 10/23 [00:03<00:03,  4.20it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.28it/s] 52%|█████▏    | 12/23 [00:03<00:02,  4.48it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.79it/s] 61%|██████    | 14/23 [00:03<00:01,  4.88it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.83it/s] 70%|██████▉   | 16/23 [00:04<00:01,  4.88it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.84it/s] 78%|███████▊  | 18/23 [00:04<00:00,  5.06it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  5.16it/s] 87%|████████▋ | 20/23 [00:05<00:00,  5.33it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  5.26it/s] 96%|█████████▌| 22/23 [00:05<00:00,  5.52it/s]

100%|██████████| 23/23 [00:05<00:00,  5.57it/s]100%|██████████| 23/23 [00:05<00:00,  4.08it/s]


In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Eli and I'm an artist. My work is
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  the head of state and head of government of the
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  Paris. France is a country in Europe.
France
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), t

In [9]:
llm.shutdown()