# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.00s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.53it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.33it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.22it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Hannah, and I'm a Christian who is passionate about making a difference in the lives of others. I'm excited to share with you my journey and the ways in which I'm striving to make a positive impact in my community.
I'm a bit of a hopeless romantic, and I believe that everyone deserves to experience the love and kindness of God. As a Christian, I strive to live out my faith in practical ways, whether that's through serving others, volunteering my time, or simply being a good friend.
One of my greatest passions is empowering others to live their best lives. I believe that every person has a unique purpose and calling
Prompt: The president of the United States is
Generated text:  proposing a tax on imported solar panels, which has raised eyebrows among those who support renewable energy and the development of domestic solar manufacturing.
The proposed tax, which would be part of a broader package of tariffs on foreign imports, would target solar 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team and enjoy arguing about current events. I'm a bit of a perfectionist and can get stressed out when things don't go according to plan. I'm a bit of a introvert and prefer to spend time alone or with close friends. I'm not really sure what I want to do with my life yet, but I'm hoping to figure that out in college. That's me in a nutshell. What do you think? Is

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. This is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or context. It is a simple and straightforward statement that can be used as a starting point for further discussion or exploration of the topic. The statement is also accurate and reliable, as it is a widely accepted and verifiable fact. Overall, this statement is a good example of a concise factual statement about France’s capital city. The statement is also neutral and objective, as it does not express any opinion or bias. It simply presents a factual piece of information,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems will be able to analyze large amounts of medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of autonomous vehicles: Autonomous vehicles are expected to become more common, with AI playing a key role in navigation, safety, and decision-making. Self-driving cars will be able to navigate complex road networks, avoid



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kaitlyn Jae Jackson. I'm 22 years old and a psychology major at California State University, Long Beach. I'm from Bakersfield, California, and I'm working part-time as a waitress while I'm in college. I like to read, watch movies, and try new restaurants. I'm looking for someone to share my life experiences with. I'm a bit of a hopeless romantic. I value honesty and loyalty above all else. I'm excited to meet new people and make friends.
Jenette
Hello, I'm Jenette. I'm 25, a graduate student in creative writing at the University of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Short Answer Question: What is the capital of France? The capital of France is Paris.
Multiple Choice Question: What is the capital of France? A) London, B) Paris, C) Rome, D) Berlin. The correct answer is B) Paris. Multiple Ch

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 L

ila

 V

ex

,

 and

 I

'm

 a

 

20

-year

-old

 student

 at

 the

 University

 of

 California

,

 Berkeley

.


What

 type

 of

 person

 is

 L

ila

 V

ex

?


From

 the

 neutral

 self

-int

roduction

,

 we

 can

 gather

 that

 L

ila

 V

ex

 is

:


1

.

 Young

 (

20

 years

 old

)


2

.

 Educ

ated

 (

student

 at

 the

 University

 of

 California

,

 Berkeley

)


3

.

 A

 student

 at

 a

 prestigious

 university




4

.

 Possibly

 intro

verted

 or

 reserved

 (

 neutral

 tone

)


5

.

 F

ocused

 on

 her

 studies

 (

emphasis

 on

 her

 education

)


Answer

:

 L

ila

 V

ex

 is

 a

 young

,

 educated

,

 and

 possibly

 intro

verted

 student

 who

 is

 focused

 on

 her

 studies

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 located

 in

 the

 northern

-central

 part

 of

 the

 country

.

 Paris

 is

 a

 global

 center

 for

 art

,

 fashion

,

 and

 cuisine

,

 and

 is

 home

 to

 the

 famous

 E

iff

el

 Tower

.

 The

 city

 is

 divided

 into

 

20

 arr

ond

isse

ments

 (

district

s

)

 and

 has

 a

 population

 of

 around

 

2

.

1

 million

 people

.

 Paris

 is

 a

 major

 tourist

 destination

 and

 is

 known

 for

 its

 iconic

 landmarks

,

 museums

,

 and

 cultural

 events

.

 The

 city

 is

 also

 home

 to

 many

 educational

 institutions

,

 including

 the

 University

 of

 Paris

 and

 the

 É

cole

 Poly

techn

ique

.

 Paris

 has

 a

 diverse

 economy

,

 with

 major

 industries

 in

 finance

,

 manufacturing

,

 and

 tourism

.

 The

 city



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 constantly

 evolving

,

 and

 there

 are

 several

 trends

 that

 are

 likely

 to

 shape

 the

 field

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 Adoption

 of

 Edge

 AI

:

 With

 the

 proliferation

 of

 IoT

 devices

,

 there

 will

 be

 a

 growing

 need

 for

 AI

 to

 be

 deployed

 at

 the

 edge

,

 closer

 to

 the

 source

 of

 the

 data

.

 This

 will

 enable

 faster

 processing

,

 reduced

 latency

,

 and

 improved

 security

.


2

.

 Rise

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make

 decisions

.

 Explain

able

 AI

 will

 become

 increasingly

 important

 to

 build

 trust

 and

 ensure

 accountability

 in

 AI

 decision




In [6]:
llm.shutdown()