# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

**To launch the offline engine in your python scripts, `__main__` condition is necessary, since we use `spawn` mode to create subprocesses. Please refer to this [simple example](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py) for more details.**

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.13it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.80it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.43it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Soren. I'm a young photographer living in Copenhagen, Denmark. I shoot a lot of street photography, capturing the everyday moments and the character of the city. I also enjoy experimenting with new techniques and styles, and I'm always looking to push my creativity and skills to the next level. When I'm not behind the lens, you can find me exploring the city's hidden gems or sipping coffee at a local café. I'm excited to share my photography with you and connect with other creatives! Follow me for a glimpse into the vibrant world of Copenhagen! #streetphotography #copenhagen #denmark #photography #
Prompt: The president of the United States is
Generated text:  trying to deceive you. It’s not a conspiracy theory; it’s a fact.  You might have noticed that his words often don’t align with reality.  He often uses language that’s intentionally misleading or outright false.
As a nation, we should be deeply concerned about this phenomenon.  It’s not 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new coffee shops. I'm currently working on a novel and trying to get my writing career off the ground. I'm a bit of a introvert, but I love meeting new people and hearing their stories. I'm always up for a good conversation and a cup of coffee.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions. It simply states facts about Kaida's life and interests. It also doesn't try to impress or manipulate the reader,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris.
The capital of France is Paris. This is a concise and factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or context. It is a simple and straightforward statement that can be used as a starting point for further discussion or exploration of the topic. The statement is also accurate and reliable, as it is a widely accepted fact about France’s capital city. Overall, this statement meets the requirements of a concise and factual statement about France’s capital city. The statement is also easy to understand and remember, making it a useful piece of information

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased Adoption in Everyday Life: AI is likely to become more ubiquitous in everyday life, with applications in areas such as healthcare, finance, transportation, and education.
2. Advancements in Natural Language Processing: NLP will continue to improve, enabling AI systems to better understand and generate human language, leading to more effective communication between humans and machines.
3. Rise of Explainable AI: As AI becomes more pervasive, there will be a growing need for transparency and explainability in AI decision



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Ava Stone. I'm a 25-year-old graphic designer living in Portland, Oregon. I love the eclectic vibe of my city and enjoy exploring its many neighborhoods and coffee shops. In my free time, I enjoy practicing yoga and hiking in the nearby forests. I'm a creative and curious person who is always looking for new inspiration and challenges. How can I help you?
The response should be in the first person and 3-4 sentences long, with a focus on the character's personality and interests.
I'm Eliana, a freelance journalist and avid traveler. I'm passionate about storytelling and capturing the beauty of the world around me

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Although this response is brief and to the point, it may not provide the reader with enough information. You might want to consider adding more d

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ember

.

 I

'm

 a

 quiet

,

 observational

 person

 who

 spends

 a

 lot

 of

 time

 alone

,

 observing

 the

 world

 around

 me

.

 I

 have

 a

 passion

 for

 collecting

 rare

 books

 and

 learning

 about

 the

 past

.

 I

'm

 not

 much

 of

 a

 people

 person

,

 but

 I

'm

 always

 eager

 to

 learn

 and

 discover

 new

 things

.


This

 self

-int

roduction

 is

 neutral

 because

 it

 doesn

't

 reveal

 any

 personal

 biases

 or

 opinions

.

 It

 also

 gives

 a

 glimpse

 into

 the

 character

's

 personality

 and

 interests

.

 It

's

 short

 and

 to

 the

 point

,

 making

 it

 easy

 to

 read

 and

 understand

.


Here

 are

 a

 few

 ways

 to

 make

 this

 self

-int

roduction

 more

 engaging

:


*

 Add

 a

 few

 personal

 details

 to

 make

 the

 character



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 capital

 city

 of

 France

 is

 Paris

.

 Paris

 is

 a

 city

 that

 is

 located

 in

 the

 north

-central

 part

 of

 the

 country

.

 The

 E

iff

el

 Tower

 is

 a

 famous

 landmark

 in

 Paris

.


Read

 more

:

 What

 is

 the

 capital

 of

 France

?


What

 is

 the

 capital

 of

 the

 United

 States

?


What

 is

 the

 capital

 of

 France

?

 Paris

.


What

 is

 the

 capital

 of

 Australia

?

 Canberra

.


What

 is

 the

 capital

 of

 Germany

?

 Berlin

.


What

 is

 the

 capital

 of

 Canada

?

 Ottawa

.


What

 is

 the

 capital

 of

 India

?

 New

 Delhi

.


What

 is

 the

 capital

 of

 Japan

?

 Tokyo

.


What

 is

 the

 capital

 of

 South

 Africa

?

 Pret

oria

.


What

 is

 the

 capital

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 it

 has

 a

 lot

 of

 potential

.

 Here

 are

 some

 possible

 trends

 in

 AI

 that

 could

 shape

 the

 future

 of

 technology

 and

 society

.


One

 of

 the

 key

 trends

 in

 AI

 is

 the

 increasing

 use

 of

 edge

 AI

,

 which

 involves

 running

 AI

 models

 on

 devices

 at

 the

 edge

 of

 the

 network

,

 rather

 than

 in

 the

 cloud

.

 This

 can

 improve

 performance

 and

 reduce

 latency

,

 making

 AI

 more

 accessible

 and

 useful

 in

 a

 wider

 range

 of

 applications

.


Another

 trend

 is

 the

 growing

 use

 of

 transfer

 learning

,

 which

 involves

 training

 AI

 models

 on

 one

 task

 and

 then adapting

 them

 to

 another

 task

.

 This

 can

 make

 AI

 more

 efficient

 and

 cost

-effective

,

 and

 enable

 it

 to

 learn

 from

 a

 wider

 range

 of




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.17it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.84it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.53it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]



In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Christy and I am a tour guide for I
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  set to visit India for the first time in 
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  a city of romance, love, art, fashion
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([

In [9]:
llm.shutdown()