# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.26it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.90it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.58it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.47it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.05it/s]  9%|▊         | 2/23 [00:01<00:10,  2.04it/s]

 13%|█▎        | 3/23 [00:01<00:06,  2.92it/s] 17%|█▋        | 4/23 [00:01<00:05,  3.70it/s]

 22%|██▏       | 5/23 [00:01<00:04,  4.35it/s] 26%|██▌       | 6/23 [00:01<00:03,  4.71it/s]

 30%|███       | 7/23 [00:01<00:03,  5.11it/s] 35%|███▍      | 8/23 [00:02<00:02,  5.33it/s]

 39%|███▉      | 9/23 [00:02<00:02,  5.59it/s]

 43%|████▎     | 10/23 [00:02<00:02,  5.16it/s]

 48%|████▊     | 11/23 [00:02<00:02,  5.00it/s] 52%|█████▏    | 12/23 [00:02<00:02,  5.22it/s]

 57%|█████▋    | 13/23 [00:03<00:01,  5.41it/s] 61%|██████    | 14/23 [00:03<00:01,  5.58it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  5.18it/s]

 70%|██████▉   | 16/23 [00:03<00:01,  5.03it/s]

 74%|███████▍  | 17/23 [00:03<00:01,  4.68it/s] 78%|███████▊  | 18/23 [00:04<00:01,  4.87it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  5.08it/s] 87%|████████▋ | 20/23 [00:04<00:00,  5.12it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  5.34it/s] 96%|█████████▌| 22/23 [00:04<00:00,  5.47it/s]

100%|██████████| 23/23 [00:04<00:00,  5.59it/s]100%|██████████| 23/23 [00:04<00:00,  4.62it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Chris!
I'm a 23 year old freelance writer and editor based in Melbourne, Australia. I'm passionate about writing about music, culture, and technology, and I've written for a variety of publications and websites over the years.
When I'm not writing, you can find me at live music shows, eating food from a variety of cuisines, or trying to keep up with the latest developments in the world of tech.
I'm excited to be here and look forward to connecting with you! What brings you to my corner of the internet? Want to chat about something specific or just shoot the breeze? Let's talk!
Prompt: The president of the United States is
Generated text:  a big deal. He has a lot of power, and his words and actions have a significant impact on the world. It's no wonder that many people pay close attention to what he says and does. But have you ever thought about the president's personal life? What's he like outside of the Oval Office? How does he spend his fre

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team and enjoy arguing about current events. When I'm not studying or debating, I can usually be found playing guitar or listening to music. I'm a bit of a introvert, but I'm working on being more outgoing. I'm a bit of a perfectionist, which can sometimes make me come across as stubborn or critical. I'm still figuring out who I am and where I fit in, but I'm excited to see what

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about the country of France. France is a country located in Western Europe.
Provide a concise factual statement about the country of France’s population. The population of France is approximately 67 million people.
Provide a concise factual statement about the country of France’s official language. The official language of France is French.
Provide a concise factual statement about the country of France’s currency. The official currency of France is the Euro.
Provide a concise factual statement about the country of France’s government. France is a republic with a semi-presidential system of government.
Provide a concise factual statement about the country of France’s history

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, with applications such as personalized medicine, disease diagnosis, and patient care.
2. Advancements in natural language processing: AI systems will become more proficient in understanding and generating human language, enabling more effective communication between humans and machines.
3. Rise of Explainable AI: As AI becomes more pervasive, there will be a growing need to understand how AI systems make decisions, leading to the development of Explain



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Nova Spacewalker and I'm a 22-year-old botanist.
I'm currently a graduate student at the university studying plant ecology. My hobbies include hiking, rock climbing, and playing the guitar. I'm a bit of a curious person and love learning about the natural world and the impact humans have on it. I'm a self-proclaimed optimist and try to see the good in every situation. I'm excited to meet new people and make new connections in the field of botany.
I'm a bit of a homebody and prefer spending time alone or with a small group of close friends. I'm not a big fan

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located at the heart of the Île-de-France region in northern France. It is situated in the north-central part of the country. Paris is often called the “City of Light” (La Ville Lumière) due t

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Alexander

.

 I

'm

 a

 

25

-year

-old

 programmer

 with

 a

 passion

 for

 problem

-solving

.

 I

 enjoy

 playing

 chess

 and

 reading

 about

 history

.

 Outside

 of

 work

,

 you

 can

 find

 me

 hiking

 or

 practicing

 yoga

.

 I

'm

 currently

 living

 in

 a

 small

 apartment

 in

 downtown

 New

 York

 City

.

 I

 value

 efficiency

,

 honesty

,

 and

 personal

 growth

.

 I

'm

 looking

 for

 opportunities

 to

 learn

 and

 collaborate

 with

 like

-minded

 individuals

.

 My

 strengths

 include

 analytical

 thinking

,

 attention

 to

 detail

,

 and

 effective

 communication

.

 My

 weaknesses

 are

 my

 tendency

 to

 over

analyze

 situations

 and

 my

 occasional

 difficulty

 in

 deleg

ating

 tasks

.

 I

'm

 excited

 to

 meet

 new

 people

 and

 explore

 new

 ideas

.


Hello

,

 my

 name

 is

 Lena

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 city

 of

 Paris

 is

 located

 in

 the

 northern

 part

 of

 France

 and

 is

 the

 country

’s

 largest

 city

.


Some

 of

 the

 famous

 landmarks

 in

 Paris

 include

 the

 E

iff

el

 Tower

,

 the

 Arc

 de

 Tri

omp

he

,

 the

 Lou

vre

 Museum

,

 and

 Notre

 Dame

 Cathedral

.


The

 language

 spoken

 in

 Paris

 is

 French

,

 but

 English

 is

 widely

 spoken

 and

 understood

.


Paris

 is

 a

 global

 center

 for

 fashion

,

 cuisine

,

 art

,

 and

 entertainment

.


France

 is

 a

 leader

 in

 international

 relations

 and

 global

 governance

,

 with

 Paris

 hosting

 many

 international

 organizations

,

 including

 the

 United

 Nations

 Educational

,

 Scientific

 and

 Cultural

 Organization

 (

UN

ESCO

)

 and

 the

 Organization

 for

 Economic

 Cooperation

 and

 Development

 (

OE

CD

).




Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 not

 set

 in

 stone

,

 but

 several

 trends

 are

 emerging

 that

 are

 likely

 to

 shape

 the

 field

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 use

 of

 machine

 learning

:

 Machine

 learning

 is

 a

 subset

 of

 AI

 that

 enables

 computers

 to

 learn

 from

 data

 without

 being

 explicitly

 programmed

.

 As

 data

 becomes

 increasingly

 available

,

 machine

 learning

 is

 likely

 to

 become

 even

 more

 prevalent

,

 leading

 to

 more

 accurate

 and

 efficient

 AI

 systems

.


2

.

 Rise

 of

 explain

able

 AI

:

 As

 AI

 becomes

 more

 ubiquitous

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make

 decisions

.

 Explain

able

 AI

 (

X

AI

)

 aims

 to

 provide

 insights

 into

 the

 decision

-making

 process

 of




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
import sglang as sgl

llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.23it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.86it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.56it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.45it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.03it/s]

  9%|▊         | 2/23 [00:01<00:10,  1.93it/s] 13%|█▎        | 3/23 [00:01<00:07,  2.73it/s]

 17%|█▋        | 4/23 [00:01<00:05,  3.39it/s] 22%|██▏       | 5/23 [00:01<00:04,  3.92it/s]

 26%|██▌       | 6/23 [00:01<00:04,  4.18it/s] 30%|███       | 7/23 [00:02<00:03,  4.56it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.88it/s] 39%|███▉      | 9/23 [00:02<00:02,  5.06it/s]

 43%|████▎     | 10/23 [00:02<00:02,  5.19it/s] 48%|████▊     | 11/23 [00:02<00:02,  5.35it/s]

 52%|█████▏    | 12/23 [00:02<00:02,  5.46it/s] 57%|█████▋    | 13/23 [00:03<00:01,  5.50it/s]

 61%|██████    | 14/23 [00:03<00:01,  5.56it/s] 65%|██████▌   | 15/23 [00:03<00:01,  5.56it/s]

 70%|██████▉   | 16/23 [00:03<00:01,  5.50it/s] 74%|███████▍  | 17/23 [00:03<00:01,  5.42it/s]

 78%|███████▊  | 18/23 [00:04<00:00,  5.63it/s] 83%|████████▎ | 19/23 [00:04<00:00,  5.66it/s]

 87%|████████▋ | 20/23 [00:04<00:00,  5.66it/s] 91%|█████████▏| 21/23 [00:04<00:00,  5.89it/s]

 96%|█████████▌| 22/23 [00:04<00:00,  6.08it/s]100%|██████████| 23/23 [00:04<00:00,  6.23it/s]100%|██████████| 23/23 [00:04<00:00,  4.71it/s]


In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Alex, and I am a digital nomad.
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  the head of state and head of government of the
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  a city like no other. Paris, the City
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size

In [9]:
llm.shutdown()