# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.05it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.67it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.32it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.22it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lindsay and I am an owner of a 3 year old cat named Lola. Lola is my baby and I love her to pieces. She is a beautiful calico and has such a sweet disposition.
I have a question for the community, I have noticed that Lola has been drinking more water lately and I am not sure why. She is eating her regular food and I have not changed anything in her environment. Is this a sign of a medical issue? Should I be concerned?
I would love to hear your thoughts on this matter and any suggestions you may have.

### Step 1: Assessing Lola's Behavior
Lola's increased
Prompt: The president of the United States is
Generated text:  not above the law and cannot be sued for official acts committed while in office, the Supreme Court ruled Monday in a decision that undermines Donald Trump's legal strategy in a defamation lawsuit.
But the court said a president can be sued for actions taken while in office that are not related to his official duties.
The 8-1 deci

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy exploring the city's hidden corners and trying new foods. I'm currently working on a novel about a young woman's journey through the Japanese countryside. When I'm not writing, you can find me practicing yoga or browsing through used bookstores. I'm a bit of a introvert, but I'm always up for a good conversation. What do you think? Is there anything you'd like to add or change?
Here are a few suggestions to make your self-introduction more engaging and effective:
1.  Add a personal touch: While your introduction

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris.
The capital of France is Paris. The city is known for its iconic landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. Paris is also famous for its fashion, cuisine, and romantic atmosphere. It is a popular tourist destination and a hub for international business and culture. The city has a rich history dating back to the Middle Ages and has been a center of power and influence for centuries. Today, Paris is a vibrant and diverse city with a population of over 2.1 million people. It is a city that seamlessly blends tradition and modernity,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems can analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for transparency and accountability. Explainable AI (XAI) aims to provide insights into how AI systems make decisions, enabling humans to understand and trust



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kaida, and I am a 17-year-old high school student. I enjoy spending my free time reading and hiking in the mountains. I am a bit of an introvert, but I am always willing to lend a helping hand when needed. I am looking forward to meeting new people and making new friends.
I think the most important thing to include in a neutral self-introduction is to provide basic information about the character, such as their name, age, and interests. You should also aim to create a positive and friendly tone to make a good impression on others. Here are some tips to help you write a neutral self-introduction

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the capital city of France, located in the northern region of the country. It is situated along the Seine River and is known for its rich history, art, fa

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ka

elyn

 W

aver

ly

.

 I

'm

 

25

 years

 old

 and

 I

'm

 a

 software

 engineer

 working

 at

 a

 mid

-sized

 tech

 company

 in

 the

 city

.

 I

 like

 hiking

 and

 reading

 in

 my

 free

 time

,

 but

 I

'm

 also

 quite

 competitive

 when

 it

 comes

 to

 video

 games

 and

 board

 games

.

 I

 currently

 live

 alone

 in

 a

 cozy

 apartment

 in

 the

 suburbs

 with

 my

 cat

,

 Luna

.

 I

'm

 easy

-going

 and

 friendly

,

 and

 I

'm

 not

 one

 to

 shy

 away

 from

 a

 challenge

.



##

 Step

 

1

:

 Determine

 the

 key

 elements

 to

 include

 in

 the

 self

-int

roduction

.


To

 write

 a

 neutral

 self

-int

roduction

 for

 Ka

elyn

 W

aver

ly

,

 we

 need

 to

 include

 essential

 details



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Is

 this

 a

 direct

 answer

 to

 the

 prompt

?

 Yes

,

 the

 statement

 directly

 answers

 the

 prompt

.

 It

 is

 a

 concise

 factual

 statement

 that

 names

 the

 capital

 of

 France

.

 If

 you

 were

 to

 put

 this

 response

 into

 a

 list

,

 it

 would

 be

 the

 first

 item

.

 

1

.

 The

 capital

 of

 France

 is

 Paris

.

 Provide

 a

 second

 concise

 factual

 statement

 about

 Paris

,

 the

 capital

 of France

.

 The

 E

iff

el

 Tower

 is

 a

 famous

 landmark

 in

 Paris

.

 Is

 this

 a

 direct

 answer

 to

 the

 prompt

?

 Yes

,

 this

 statement

 is

 a

 direct

 answer

 to

 the

 prompt

.

 It

 provides

 a

 concise

 factual

 statement

 about

 Paris

.

 If

 you

 were

 to

 put

 this

 response

 into

 a

 list

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 advancements

 in

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.


Art

ificial

 Intelligence

 (

AI

)

 is

 a

 rapidly

 evolving

 field

 that

 is

 transforming

 the

 way we

 live

,

 work

,

 and

 interact

 with

 technology

.

 As

 AI

 continues

 to

 advance

,

 it

 is

 likely

 to

 have

 a

 significant

 impact

 on

 various

 aspects

 of

 our

 lives

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 Adoption

 of

 Edge

 AI

:

 Edge

 AI

 refers

 to

 the

 processing

 of

 AI

 tasks

 at

 the

 edge

 of

 the

 network

,

 closer

 to

 the

 data

 source

,

 rather

 than

 in

 the

 cloud

.

 This

 trend

 is

 expected

 to

 continue

 as

 edge

 AI

 enables

 faster

 and

 more

 efficient

 processing

 of

 data




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.14it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.78it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.44it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.34it/s]



In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Dan and I am a freelance writer and editor.
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  the leader of the executive branch of the federal government
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  a city like no other, full of romance,
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torc

In [9]:
llm.shutdown()