# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

**To launch the offline engine in your python scripts, `__main__` condition is necessary, since we use `spawn` mode to create subprocesses. Please refer to this [simple example](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py) for more details.**

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.07it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.70it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.40it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.31it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Wendy and I am a 30-something-year-old stay-at-home-mom, wife, and entrepreneur. I was born and raised in New York City, but now live in a small town in the Midwest with my husband, young daughter, and two mischievous cats. I love everything about life - the good, the bad, and the ugly. I am passionate about self-improvement, trying new things, and learning as much as I can.
My life has been a bit unconventional, to say the least. I started my career in the non-profit sector, but after a few years, I decided to pursue my passion
Prompt: The president of the United States is
Generated text:  one of the most powerful and influential positions in the world, and it is a highly sought-after job. Many people dream of becoming the president, but few have the ability and drive to achieve that goal. In this article, we will explore the qualifications, skills, and experiences that are necessary to become a successful president.
To become a successful pr

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team. I like to think I'm pretty laid-back and easy-going, but I can get pretty passionate about the topics I care about. I'm not really sure what I want to do with my life yet, but I'm excited to explore my options and see where they take me. That's me in a nutshell. What do you think? Is there anything you'd like to add or change?
I think your self-introduction is

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. The city is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks, including the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city has a population of over 2.1 million people and is a major hub for business, culture, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the "City of Light." The city has a diverse population and is home to people from all over the world. Paris is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, including the development of AI-powered robots that can assist with surgeries and other medical procedures.
2. Widespread adoption of AI in education: AI has the potential to revolutionize the education sector by providing personalized learning experiences, automating grading, and enabling real-time feedback



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Eira. I'm a twenty-something freelance journalist living in the city. I've been covering local news and events for a few years now. I like to think of myself as a curious and driven individual who is always looking for the next story. I've had my fair share of experiences, both good and bad, and I'm always eager to learn and grow.
Write a short, neutral self-introduction for a fictional character. Hello, my name is Eira. I'm a twenty-something freelance journalist living in the city. I've been covering local news and events for a few years now. I like to think of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the most populous city in France and serves as the political, economic, and cultural center of the country. Paris is known for its stunning architecture, rich history, and iconic landmarks 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Luna

 Night

shade

.

 I

'm

 a

 private

 investigator

,

 specializing

 in

 cases

 that

 involve

 the

 supernatural

.

 I

'm

 based

 in

 New

 Orleans

,

 where

 the

 veil

 between

 worlds

 is

 thin

.

 I

 have

 a

 background

 in

 the

 occult

 and

 I

'm

 familiar

 with

 the

 city

's

 various

 factions

.

 I

'm

 not

 easily

 intimidated

,

 but

 I

 do

 have

 a

 soft

 spot

 for

 cats

.


What

 do

 you

 think

?

 Is

 it

 a

 good

 introduction

?


It

 seems

 a

 good

 start

,

 but

 I

 think

 you

 can

 make

 it

 more

 interesting

.

 Here

 are

 a

 few

 suggestions

:


You

 can

 add

 a

 bit

 more

 personality

 to

 the

 introduction

.

 For

 example

,

 you

 could

 mention

 a

 distinctive

 feature

 of

 your

 character

,

 such

 as

 their

 appearance



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 post

 Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

 appeared

 first

 on

 on

lin

et

utor

help

.com

.


The

 post

 Provide

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

 appeared

 first

 on

 on

lin

et

utor

help

.com

.

 https

://

on

lin

et

utor

help

.com

/

202

2

/

06

/

24

/pro

vide

-a

-con

c

ise

-f

actual

-st

atement

-about

-fr

ances

-capital

-city

/

 https

://

on

lin

et

utor

help

.com

/wp

-content

/uploads

/

202

2

/

06

/C

apture

-

4

.png

 https

://

on

lin

et

utor

help

.com

/

 https

://

on

lin

et

utor

help

.com

/

 https

://

on

lin

et

utor

help

.com

/

 https

://

on



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 of

 great

 interest

 and

 speculation

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 AI

 Adv

ancements

 in

 Healthcare

:


AI

 is

 expected

 to

 improve

 healthcare

 outcomes

 and

 save

 lives

 by

:


Ident

ifying

 high

-risk

 patients

 and

 predicting

 disease

 progression




Develop

ing

 personalized

 treatment

 plans

 and

 medication

 reg

imens




Ass

isting

 surgeons

 with

 robotic

 surgery

 and

 reducing

 surgical

 complications




2

.

 Increased

 Adoption

 of

 AI

 in

 Education

:


AI

 will

 likely

 play

 a

 significant

 role

 in

 the

 education

 sector

 by

:


Develop

ing

 adaptive

 learning

 systems

 that

 tailor

 to

 individual

 student

 needs




Autom

ating

 grading

 and

 assessment

 to

 reduce

 teacher

 workload




Provid

ing

 AI

-powered

 virtual

 teaching

 assistants

 to

 supplement

 human

 instructors




3

.

 Growing

 Use




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.23it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.87it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.56it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.44it/s]



In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

Prompt: Hello, my name is
Generated text:  Thembelihle, which means happiness in
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  not a president but an employee of the corporate state
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  a city of grandeur and beauty, steeped
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096

In [9]:
llm.shutdown()