# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.09s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.50it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.26it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.13it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.17it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Max and I am a software engineer and a programmer. I have been programming for many years and have experience with various programming languages including C++, Java, Python, PHP, and JavaScript. I also have experience with various operating systems including Windows, Linux, and macOS. I am familiar with database design and development, web development, and mobile app development. I have a strong passion for technology and innovation, and I am always looking for new challenges and opportunities to learn and grow. In my free time, I enjoy reading books on computer science and programming, as well as playing video games and watching movies.

I can help with a wide range of
Prompt: The president of the United States is
Generated text:  a title reserved for the head of the executive branch of the federal government. It is the highest office in the U.S. government, and the president serves as both the head of state and the head of government.
As hea

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in a small town in the Pacific Northwest. I enjoy hiking and reading in my free time. I'm currently working on a novel, and I'm excited to see where my writing takes me. That's me in a nutshell. What do you think? Is it too short or too long? Should I add anything else?
Your self-introduction is concise and to the point. It provides a brief overview of who you are, what you do, and what you're interested in. It's a good starting point. However, it's a bit too short and lacks some

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris.
Paris is the capital and largest city of France, with a population of over 2.1 million people within its city limits. It is situated in the northern part of the country, along the Seine River. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, which houses the Mona Lisa. The city is also famous for its fashion, cuisine, and romantic atmosphere, making it a popular destination for tourists and a hub for international business and culture. Paris has a rich history dating back to the 3rd century

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future holds, there are several trends that are likely to shape the development and impact of artificial intelligence in the coming years. Here are some possible future trends in AI:
1. Increased Adoption in Everyday Life: AI is becoming increasingly integrated into our daily lives, from virtual assistants like Siri and Alexa to self-driving cars and smart home devices. As AI technology improves, we can expect to see even more widespread adoption in areas such as healthcare, finance, and education.
2. Advancements in Machine Learning: Machine learning is a key component of AI, and we



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Marcus Thompson. I am a 25-year-old artist who currently resides in Seattle, Washington. I enjoy painting, drawing, and experimenting with various mediums to express my creativity. My work often explores themes of nature, identity, and the human experience.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions. It provides basic information about the character and his interests, but doesn't give any insight into his personality or motivations. A neutral self-introduction is useful for a character who is still being developed or for a character who is meant to be enigmatic.
Here are some suggestions for refining the self-introduction:


Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located in the northern part of the country. It is situated on the Seine Ri

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Clar

issa

.

 I

 am

 a

 

20

-year

-old

 student

 living

 in

 a

 small

 town

 in

 the

 United

 States

.

 I

 am

 a

 psychology

 major

 at

 the

 local

 college

 and

 have

 a

 part

-time

 job

 at

 a

 local

 diner

.

 I

 work

 with

 my

 best

 friend

,

 Alex

.

 I

 am

 an

 only

 child

 and

 moved

 here

 from

 a

 larger

 city

 when

 I

 was

 

16

.

 I

 enjoy

 playing

 music

 and

 hiking

 in

 my

 free

 time

.


Next

,

 write

 a

 short

,

 personal

 self

-int

roduction

 for

 Clar

issa

.

 Hi

,

 I

'm

 Clar

issa

.

 I

'm

 a

 creative

 and

 driven

 person

 who

's

 always

 looking

 for

 new

 experiences

.

 I

've

 moved

 around

 a

 lot

 in

 my

 life

,

 and

 I

've



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 city

 of

 over

 

2

 million

 people

.

 Located

 in

 the

 north

-central

 part

 of

 the

 country

,

 Paris

 is

 situated

 along

 the

 Se

ine

 River

,

 which

 flows

 through

 the

 heart

 of

 the

 city

.

 With

 a

 rich

 history

 dating

 back

 to

 the

 Middle

 Ages

,

 Paris

 is

 home

 to

 many

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Known

 for

 its

 fashion

,

 cuisine

,

 and

 art

,

 Paris

 is

 one

 of

 the

 world

’s

 most

 popular

 tourist

 destinations

.


A

 key

 aspect

 of

 French

 culture

 is

 its

 emphasis

 on

 cuisine

.

 French

 cuisine

 is

 known

 for

 its

 sophistication

 and

 elegance

,

 with

 a

 focus

 on

 using

 high

-quality

 ingredients



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 of

 much

 speculation

,

 but

 there

 are

 several

 potential

 trends

 that

 could

 shape

 the

 field

 in

 the

 coming

 years

.


Art

ificial

 intelligence

 is

 rapidly

 evolving

,

 and

 several

 trends

 are

 expected

 to

 shape

 the

 field

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 adoption

 in

 industries

:

 AI

 is

 being

 used

 in

 various

 industries

,

 including

 healthcare

,

 finance

,

 and

 transportation

.

 As

 the

 technology

 improves

,

 we

 can

 expect

 to

 see

 increased

 adoption

 in

 more

 industries

,

 leading

 to

 improved

 efficiency

 and

 productivity

.


2

.

 Edge

 AI

:

 With

 the

 proliferation

 of

 IoT

 devices

,

 edge

 AI

 is

 becoming

 increasingly

 important

.

 Edge

 AI

 refers

 to

 the

 processing

 of

 AI

 algorithms

 at

 the




In [6]:
llm.shutdown()