# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.02s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.49it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.29it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.20it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Gaurav and I am a sixth grader in a local school here. I am a member of the school's science club, and I am also part of the school's robotics team. I enjoy coding, robotics, and science in general. I have also been learning about sustainability and environmental conservation for some time now.
I have always been passionate about making a difference in my community and the world at large. I have been organizing small campaigns and initiatives to promote sustainability and environmental conservation in my school and local community.
Now, I would like to take this passion to the next level and make a larger impact. I am planning to start
Prompt: The president of the United States is
Generated text:  the most powerful person in the world. For better or for worse, the president is the leader of the free world and is responsible for making many of the most important decisions that affect the lives of hundreds of millions of people.
But have you eve

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and experimenting with different writing styles. I'm looking forward to meeting new people and learning more about their experiences. That's me in a nutshell. How would you rate this self-introduction? Here's my rating: 8/10. I'd give it a 9/10 if it included a bit more personality and a more specific interest or hobby. What do you think? Do you have any suggestions for improvement? I think the introduction is clear and concise,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, finance, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the City of Light. The city has a rich history dating back to the 3rd

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems will be able to analyze large amounts of medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for AI systems to be transparent and explainable. This will involve developing AI systems that can provide clear explanations for their decisions and actions.
3



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Lee Sparrow. I'm a 25-year-old artist, currently living in a small apartment in the city. I enjoy sketching, painting, and playing music in my free time. I'm just trying to make a name for myself in the art world. That's all for now. 1 0
That's a great start! To take your self-introduction to the next level, consider adding a bit more personality and flair. Here are some suggestions:

*   Add a few more details about your personality, interests, or hobbies that make you unique. For example, you might mention that you're a bit of a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
It is the capital and most populous city of France, with an area of 1,130.6 km2 and a population of approximately 2,161,000.
The name “Paris” is derived from the ancient Celtic name “Lutetia Parisiorum”, which was later modified

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 El

ara

 V

ex

.

 I

'm

 a

 skilled

 engineer

 and

 inventor

 from

 the

 planet

 X

er

idia

.

 I

 have

 a

 passion

 for

 creating

 innovative

 solutions

 to

 complex

 problems

 and

 a

 keen

 eye

 for

 detail

.

 Outside

 of

 work

,

 I

 enjoy

 tink

ering

 with

 gadgets

 and

 exploring

 the

 surrounding

 environment

.

 What

 do

 you

 think

?

 Here

's

 a

 revised

 version

:

 Hello

,

 I

'm

 El

ara

 V

ex

,

 a

 X

er

idian

 engineer

 and

 inventor

.

 I

 design

 and

 build

 innovative

 solutions

 to

 challenging

 problems

,

 driven

 by

 a

 passion

 for

 precision

 and

 a

 love

 of

 tink

ering

.

 In

 my

 free

 time

,

 I

 enjoy

 exploring

 new

 territories

 and

 experimenting

 with

 new

 technologies

.

 I

 think

 this

 revised

 version

 is

 a

 bit



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 situated

 in

 the

 north

-central

 part

 of

 the

 country

,

 near

 the

 River

 Se

ine

.

 Paris

 is

 known

 for

 its

 iconic

 landmarks

 like

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 The

 city

 has

 a

 population

 of

 approximately

 

2

.

1

 million

 people

.

 Paris

 is

 a

 major

 cultural

 and

 economic

 center

,

 attracting

 tourists

 and

 businesses

 from

 around

 the

 world

.

 The

 city

 is

 also

 home

 to

 many

 educational

 institutions

,

 including

 the

 Sor

bon

ne

 University

 and

 the

 É

cole

 Poly

techn

ique

.

 Overall

,

 Paris

 is

 a

 unique

 blend

 of

 history

,

 culture

,

 and

 modern

ity

.

 Paris

 is

 known

 for

 its

 fashion

,

 cuisine

,

 and

 romantic



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 advances

 in

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.


Art

ificial

 intelligence

 (

AI

)

 is

 a

 rapidly

 evolving

 field

 that

 is

 expected

 to

 continue

 shaping

 our

 lives

 in

 the

 coming

 years

.

 As

 AI

 technology

 advances

,

 we

 can

 expect

 to

 see

 significant

 changes

 in

 various

 industries

 and

 aspects

 of

 our

 society

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 Use

 of

 Machine

 Learning

:


Machine

 learning

 is

 a

 subset

 of

 AI

 that

 enables

 machines

 to

 learn

 from

 data

 and

 improve

 their

 performance

 over

 time

.

 As

 AI

 technology

 advances

,

 we

 can

 expect

 to

 see

 increased

 use

 of

 machine

 learning

 in

 various

 applications

,

 such

 as

:


Autom

ated

 decision

-making

:




In [6]:
llm.shutdown()