# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.15it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.03it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.06s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.16it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Emily and I am a 22-year-old recent graduate with a degree in Psychology. I have always been passionate about helping others and working with children. After completing my degree, I decided to pursue a career in early childhood education to make a positive impact on young lives. I am excited to be joining the Elanora team and look forward to working with the children and families at this wonderful centre.
Outside of work, I enjoy spending time with my family and friends, trying out new recipes in the kitchen, and practicing yoga. I am a strong advocate for children's rights and education, and I am committed to providing a safe, engaging
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States. The president serves a four-year term and is responsible for executing the laws of the land. The president is also the commander-in-chief of the armed forces and has the power to veto la

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and experimenting with different writing styles. That's me in a nutshell. What do you think? Is there anything you'd like to add or change? I think it's a good start, but it's a bit too straightforward and lacks a bit of personality. Let's try to add a bit more flair to it. Here are a few suggestions: 
1. Add a bit of humor: "I'm Kaida, a 25-year-old freelance writer who's

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris. The city is located in the northern part of the country and is situated on the Seine River. Paris is known for its rich history, cultural landmarks, and romantic atmosphere. It is home to many famous museums, such as the Louvre and the Orsay, as well as iconic landmarks like the Eiffel Tower and Notre-Dame Cathedral. The city is also a major hub for fashion, art, and cuisine, and is a popular destination for tourists from around the world. Paris is a city that embodies the spirit of France, with its stunning architecture, vibrant culture, and charming streets

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Widespread adoption of AI in industries: AI is likely to become more prevalent in various industries, including finance, transportation, and education. AI-powered systems may be able to automate tasks, improve efficiency, and enhance decision-making.




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Asteria Frost. I'm a... (wait, don't tell me I have to decide on a specific job or role just yet, can I just get to know you first?) I'm a fairly curious and somewhat reclusive person who enjoys the quiet, peaceful moments in life. I love the snow and the stars, and I'm often found lost in thought, trying to make sense of the world around me. I'm easy to talk to once you get to know me, though, and I'm here to listen and learn from you.
What's your name, and what brings you to this place? Where are you from,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The sentence is grammatically correct and follows the standard sentence structure of subject-verb-object (SVO). The word order is: the capital (S) of France (O) is (V) Paris (O). The capital is in the nominative case, France is in the genitive case, i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

ael

in

 Dark

haven

.

 I

'm

 a

 

25

-year

-old

 former

 soldier

,

 currently

 working

 as

 a

 freelance

 merc

enary

.

 I

've

 got

 skills

 in

 hand

-to

-hand

 combat

,

 mark

sm

anship

,

 and

 tactical

 strategy

.

 I

'm

 available

 for

 hire

 and

 willing

 to

 take

 on

 a

 variety

 of

 jobs

.

 That

's

 about

 it

.


K

ael

in

 Dark

haven

 is

 a

 

25

-year

-old

 former

 soldier

 turned

 freelance

 merc

enary

.

 He

 has

 skills

 in

 hand

-to

-hand

 combat

,

 mark

sm

anship

,

 and

 tactical

 strategy

.

 He

 is

 available

 for

 hire

 and

 willing

 to

 take

 on

 a

 variety

 of

 jobs

.

 That

's

 about

 it

.

 K

ael

in

 Dark

haven

 is

 a

 skilled

 merc

enary

 with



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 climate

.

 France

 has

 a

 temper

ate

 maritime

 climate

,

 with

 mild

 winters

 and

 cool

 summers

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 cuisine

.

 French

 cuisine

 is

 known

 for

 its

 rich

 flavors

,

 elaborate

 preparation

 methods

,

 and

 high

-quality

 ingredients

,

 with

 popular

 dishes

 including

 esc

arg

ots

,

 Co

q

 au

 Vin

,

 and

 Bou

ill

ab

ais

se

.


Provide

 a

 concise

 factual

 statement

 about

 France

’s

 culture

.

 French

 culture

 is

 known

 for

 its

 rich

 artistic

 and

 literary

 heritage

,

 with

 famous

 artists

 such

 as

 Claude

 Mon

et

,

 Pierre

-Aug

uste

 Reno

ir

,

 and

 Paul

 C

é

z

anne

,

 and

 famous

 writers

 such

 as

 Victor

 Hugo

,

 Gust



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 one

 of

 the

 most

 discussed

 and

 anticipated

 topics

 in

 the

 tech

 industry

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 expected

 to

 play

 a

 significant

 role

 in

 healthcare

 in

 the

 future

,

 particularly

 in

 diagnosis

,

 treatment

,

 and

 patient

 care

.

 AI

 algorithms

 can

 help

 analyze

 medical

 images

,

 identify

 patterns

,

 and

 make

 predictions

 about

 patient

 outcomes

.


2

.

 Adv

ancements

 in

 natural

 language

 processing

:

 N

LP

 is

 a

 subset

 of

 AI

 that

 deals

 with

 the

 interaction

 between

 computers

 and

 humans

 using

 natural

 language

.

 Future

 advancements

 in

 N

LP

 could

 enable

 AI

 systems

 to

 have

 more

 human

-like

 conversations

,

 understand

 nuances

 of

 language

,

 and

 learn

 from

 feedback




In [6]:
llm.shutdown()