# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")





Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.16it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.80it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.50it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.40it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Liza and I'm a travel enthusiast. I'm a bit of a wanderlust, always on the move, and exploring new destinations. I love trying new foods, drinks, and cultural experiences. I'm also passionate about photography and capturing the beauty of the world around me.
I'm a proud foodie, always on the lookout for the best local cuisine and hidden gems. I'm also a bit of a thrill-seeker, always up for trying new adventures and activities, from hiking to skydiving.
I'm a firm believer that travel is one of the best ways to learn about the world and its people. It broadens
Prompt: The president of the United States is
Generated text:  the head of state and the head of government of the United States. The president serves a four-year term and is responsible for executing the laws passed by Congress. The president is also the commander-in-chief of the armed forces and has the power to negotiate treaties, appoint federal judges and other officials, and grant 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team and enjoy arguing about current events. I'm a bit of a perfectionist and can get pretty stressed out when things don't go according to plan. I'm a bit of a introvert and prefer to spend time alone or with close friends. I'm not really sure what I want to do with my life yet, but I'm hoping to figure that out in college. I'm a bit of a curious person and love learning new

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks, including the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city is also known for its romantic atmosphere, beautiful parks, and vibrant nightlife. Paris is a popular tourist destination and is considered one of the most beautiful and culturally rich cities in the world. The city has a population of over 2.1 million people and is a major hub for business

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Widespread adoption of AI in customer service: AI-powered chatbots and virtual assistants are becoming increasingly common in customer service, and this trend is expected to continue. AI-powered systems may be able to handle complex customer



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Lena. I work as a freelance writer and part-time librarian in a small town. I enjoy reading, hiking, and trying out new recipes in my free time.
The given text is a neutral self-introduction for a fictional character named Lena. The introduction provides a brief overview of her profession, hobbies, and interests. The tone of the text is calm and matter-of-fact, conveying no strong emotions or biases. It aims to give a straightforward impression of Lena, allowing the reader to form their own opinion about her.

To write a neutral self-introduction, we can follow these steps:

1.  **Start with a greeting**:

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. There are 20 arrondissements (districts) in Paris. Paris is the largest city in France, with a population of over 2.1 million people. The city has a ric

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Eli

an

ore

 Qu

asar

.

 I

'm

 a

 freelance

 journalist

 and

 a

 bit

 of

 a

 wander

er

,

 always

 looking

 for

 the

 next

 big

 story

.

 I

've

 written

 for

 various

 publications

,

 from

 the

 Galactic

 Times

 to

 the

 And

rom

edian

 Chronicle

.

 My

 articles

 often

 focus

 on

 the

 intersection

 of

 technology

 and

 society

,

 but

 I

'm

 not

 afraid

 to

 tackle

 other

 topics

 when

 they

 catch

 my

 eye

.

 I

'm

 based

 in

 a

 small

,

 orbit

ing

 city

 on

 the

 edge

 of

 the

 galaxy

,

 but

 I

'm

 not

 tied

 down

.

 I

've

 been

 known

 to

 travel

 to

 some

...

interesting

 places

 in

 search

 of

 a

 good

 scoop

.

 That

's

 me

 in

 a

 nutshell

.

 Now

,

 what

's

 your

 story

?



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 France

 has

 a

 land

 area

 of

 

643

,

801

 square

 kilometers

 and

 a

 population

 of

 

66

.

6

 million

 people

.

 The

 largest

 French

 overseas

 department

 is

 French

 Gu

iana

,

 which

 is

 located

 on

 the

 northern

 coast

 of

 South

 America

.

 The

 largest

 island

 in

 French

 Gu

iana

 is

 the

 Cay

enne

 Island

.

 France

 has

 a

 total

 of

 

13

 overseas

 departments

 and

 territories

,

 including

 French

 Gu

iana

,

 Martin

ique

,

 Gu

adel

ou

pe

,

 Ré

union

,

 and

 New

 C

aled

onia

.

 The

 population

 of

 France

 is

 projected

 to

 reach

 

67

.

6

 million

 by

 

202

0

.

 The

 average

 lifespan

 in

 France

 is

 

83

.

4

 years

.

 The

 official

 language

 of

 France



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 already

 being

 written

,

 and

 it

 will

 likely

 change

 the

 world

.

 In

 the

 last

 decade

,

 artificial

 intelligence

 (

AI

)

 has

 become

 a

 hot

 topic

.

 AI

 has

 made

 tremendous

 progress

 in

 various

 fields

,

 from

 computer

 vision

 to

 natural

 language

 processing

,

 and

 from

 robotic

 process

 automation

 (

R

PA

)

 to

 machine

 learning

 (

ML

).


The

 future

 of

 AI

 is

 already

 being

 written

,

 and

 it

 will

 likely

 change

 the

 world

.

 In

 this

 article

,

 we

 will

 explore

 possible

 future

 trends

 in

 artificial

 intelligence

.


Art

ificial

 Intelligence

 Trends




Here

 are

 some

 possible

 future

 trends

 in

 AI

 that

 you

 should

 be

 aware

 of

:


1

.

 Edge

 AI

:

 Edge

 AI

 refers

 to

 the

 processing

 of

 AI

 work

loads

 at

 the




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.27it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.90it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.56it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.45it/s]



In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Devyn. I am a freelance writer and editor
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  the chief executive of the federal government, and it
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  famous for its iconic landmarks, art museums, fashion
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])

In [9]:
llm.shutdown()