# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.03it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.63it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.31it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ollie and I'm the lead singer and songwriter of the band Vulture Whale. I'm from a small town in southern England and started playing music at a young age. I've been writing songs for years but it wasn't until I met my bandmates that we decided to form Vulture Whale.
Our music is a blend of folk, rock and pop, with catchy melodies and lyrics that explore the human experience. We've been compared to artists such as Mumford & Sons and The Lumineers, but I think our sound is a bit more unique and experimental.
We've released two EPs so far, 'Stormy
Prompt: The president of the United States is
Generated text:  not a permanent position; he serves a four-year term, and his powers and responsibilities are limited by the Constitution. However, the president's role has expanded over time, and he has become one of the most influential and powerful people in the world. Here are some of the key aspects of the president's role:
1. Head of State and Govern

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my spare time. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a novel and trying to get my writing career off the ground. That's me in a nutshell. What do you think? Is it too long or too short? Should I add or remove anything?
Your self-introduction is concise and to the point. It provides a good overview of who you are and what you do. However

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city has a population of over 2.1 million people and is a major hub for international business, culture, and tourism.
The best answer is: The capital of France is Paris. Paris is located in the northern part of the country and is situated on the Seine River. It

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a significant role in healthcare, from diagnosis and treatment to personalized medicine and patient care. AI-powered systems will analyze medical data, identify patterns, and make predictions, leading to more accurate diagnoses and better patient outcomes.
2. Rise of Explainable AI (XAI): As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. XAI will focus on developing AI systems that provide transparent and interpretable



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Olivia Mae Saunders. I'm a 23-year-old graphic designer currently based in New York City. I enjoy trying out new restaurants, taking long walks in Central Park, and practicing yoga to manage stress. I'm a bit of a creative introvert, preferring quieter nights spent reading or sketching. I'm looking forward to connecting with like-minded individuals and collaborating on projects that bring value to my community. Here is a more detailed, personal introduction to the character of Olivia Mae Saunders:
Olivia Mae Saunders is a 23-year-old graphic designer with a passion for creativity and community engagement. Born and raised in a small town in upstate

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. What does Paris represent in the world? Paris represents one of the most iconic cities in the world. It is kn

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Li

ang

 Chen

.

 I

’m

 a

 

25

-year

-old

 anthrop

ologist

.

 I

 work

 as

 a

 research

 assistant

 at

 a

 university

 and

 am

 studying

 the

 cultural

 and

 social

 implications

 of

 urban

ization

 in

 Asia

.

 What

 would

 you

 like

 to

 know

?

 “

Okay

,

 so

 you

’re

 studying

 cities

.

 That

’s

 really

 interesting

,”

 they

 say

 with

 a

 skeptical

 tone

.

 “

Not

 really

,”

 I

 reply

 dry

ly

.

 “

It

’s

 just

 what

 I

’m

 good

 at

.

 So

,

 what

 about

 you

?”




At

 this

 point

,

 I

 can

 ask

 a

 question

,

 continue

 describing

 Li

ang

 Chen

's

 character

,

 or

 end

 the

 introduction

.


ask

 a

 question




continue

 describing

 Li

ang

 Chen

's

 character




end

 the



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 (

1

 sentence

)


What

 are

 the

 most

 notable

 characteristics

 of

 the

 city

?


Paris

,

 the

 capital

 of

 France

,

 is

 the

 most

 populous

 city

 in

 the

 country

 and

 a

 global

 hub

 for

 art

,

 fashion

,

 cuisine

,

 and

 culture

.

 It

 is

 known

 for

 its

 beautiful

 architecture

,

 museums

,

 and

 historical

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 also

 famous

 for

 its

 romantic

 atmosphere

,

 picturesque

 streets

,

 and

 parks

.


What

 are

 the

 key

 sectors

 that

 contribute

 to

 the

 city

’s

 economy

?


The

 economy

 of

 Paris

 is

 driven

 by

 various

 sectors

,

 including

:


Tour

ism

:

 Paris

 is

 one

 of

 the

 most

 visited



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 encompasses

 numerous

 areas

.

 According

 to

 different

 sources

,

 some

 trends

 that

 are

 expected

 in

 AI

 are

:


Increased

 Adoption

 of

 AI

 in

 Various

 Industries




Growing

 Importance

 of

 Explain

ability

 and

 Transparency

 in

 AI




Adv

ancements

 in

 Edge

 AI

 and

 IoT




R

ise

 of

 Hybrid

 Intelligence




Shift

 from

 Rule

-Based

 to

 Explain

able

 AI




Growing

 Use

 of

 Machine

 Learning

 and

 Deep

 Learning




Use

 of

 AI

 for

 Healthcare


AI

-P

owered

 Cyber

security




Eth

ics

 and

 Fair

ness

 in

 AI




Increased

 Adoption

 of

 AI

 in

 Various

 Industries




As

 AI

 technology

 continues

 to

 advance

 and

 becomes

 more

 affordable

,

 its

 adoption

 across

 various

 industries

 is

 expected

 to

 grow

.

 We

 can

 see

 the

 effects

 of

 AI

 already

 being




In [6]:
llm.shutdown()