# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.07it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.73it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.36it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Katie and I am thrilled to be joining the community here at Healthy Fit!
I am a wife, mother of three beautiful children, and a passionate health coach. My personal journey with food and fitness started over 10 years ago when I realized the impact it had on my overall well-being. I made the decision to change my lifestyle and start prioritizing my health, which led me to become a certified health coach.
Throughout the years, I have experienced firsthand the struggles of finding balance and creating sustainable habits that work for busy families. I believe that every individual deserves the opportunity to feel confident, energetic, and empowered to take control of their health
Prompt: The president of the United States is
Generated text:  a highly influential position that affects the lives of millions of people worldwide. However, the role of the president has changed significantly over time, with some arguing that it has become too powerful a

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student living in a small town in Japan. I enjoy reading, playing the guitar, and spending time with my friends. I'm a bit of a daydreamer and often get lost in my own thoughts. I'm not really sure what I want to do with my life yet, but I'm trying to figure it out.
This is a good start, but it's a bit too long and could be more concise. Here's a revised version: Hi, I'm Kaida. I'm a 17-year-old high school student from a small town in Japan. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city is also a major center for business, finance, and tourism. Paris is a popular destination for visitors from around the world, attracting over 23 million tourists each year. The city has a population of over 2.1 million people and is a hub for international relations, education, and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, with applications in medical diagnosis, personalized medicine, and patient care.
2. Rise of Explainable AI (XAI): As AI becomes more pervasive, there will be a growing need for transparency and explainability in AI decision-making. XAI will become increasingly important to build trust in AI systems.
3. Growing importance of human-AI collaboration: As AI becomes more capable, humans and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jaxon. I'm a 22-year-old computer science major at a local university. I spend most of my free time coding and working on personal projects. When I'm not studying or coding, I like to read and listen to music.
Here are a few things that I would change:
* "local university" is vague - I would replace it with the name of the university, or with a descriptive phrase that tells us a bit more about the school (e.g. "a small liberal arts college", "a mid-sized public university", etc.)
* "I spend most of my free time" is a bit of a clich

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Located in the north-central part of the country, Paris is often called the City of Light due to its historical association with the Enlightenment and its cultural and artistic significance. As the country’s largest city, Paris

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 A

stra

 Sl

ade

.

 I

’m

 a

 skilled

 ast

roph

ys

ic

ist

 and

 explorer

 who

 has

 traveled

 extensively

 throughout

 the

 galaxy

.

 My

 work

 focuses

 on

 understanding

 the

 mysteries

 of

 the

 cosmos

,

 particularly

 black

 holes

 and

 dark

 matter

.

 When

 I

’m

 not

 studying

 the

 stars

,

 I

 enjoy

 pil

oting

 my

 custom

-built

 spaceship

,

 the

 Cele

stial

 Quest

.

 I

’m

 passionate

 about

 uncover

ing

 new

 discoveries

 and

 pushing

 the

 boundaries

 of

 human

 knowledge

.



This

 is

 a

 neutral

 self

-int

roduction

 because

 it

:



*

  

 Provides

 basic

 information

 about

 the

 character




*

  

 Avoid

s

 expressing

 personal

 opinions

 or

 biases




*

  

 F

ails

 to mention

 any

 emotional

 or

 sentimental

 aspects




*

  

 Focus

es

 on

 professional

 and



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 located

 in

 the

 north

-central

 part

 of

 the

 country

.

 Paris

 is

 known

 for

 its

 rich

 history

 and

 is

 home

 to

 many

 famous

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

 and

 Notre

 Dame

 Cathedral

.

 It

 is

 a

 major

 cultural

 and

 economic

 center

 in

 Europe

.

 Paris

 is

 also

 known

 for

 its

 fashion

,

 art

,

 and

 cuisine

.

 It

 has

 a

 population

 of

 over

 

2

.

1

 million

 people

,

 but

 the

 greater

 metropolitan

 area

 has

 a

 population

 of

 over

 

12

 million

 people

.

 Paris

 is

 a

 popular

 tourist

 destination

 and

 is

 often

 referred

 to

 as

 the

 “

City

 of

 Light

.”

 Key

 Facts

:

 Capital

 of

 France

:

 Paris

 Location

:

 North

-central

 France

 Population

:

 Over

 



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 but

 experts

 have

 made

 several

 predictions

 and

 observations

 that

 offer

 insight

 into

 what

 might

 be

 in

 store

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


   

 

1

.

 Integration

 with

 the

 Internet

 of

 Things

 (

Io

T

):

 AI

 will

 be

 deeply

 integrated

 with

 the

 Internet

 of

 Things

,

 enabling

 machines

 to

 interact

 with

 each

 other

 and

 the

 physical

 world

 in

 a

 seamless

 way

.


   

 

2

.

 Increased

 use

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 systems

 become

 more

 complex

,

 there

 will

 be

 a

 growing

 need

 to

 explain

 their

 decisions

 and

 actions

,

 leading

 to

 the

 development

 of

 more

 transparent

 and

 explain

able

 AI

.


   

 

3

.

 Rise

 of

 Edge

 AI

:

 With

 the

 proliferation




In [6]:
llm.shutdown()