# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.05s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.47it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.29it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.20it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kaito Mabe, but you can call me Kaito. I'm a composer and producer, and I'm from Japan. My music style is a mix of electronic, house, and pop, with a strong emphasis on emotional and melodic elements.
I've been composing music for several years, and I've released several songs on various platforms such as SoundCloud, Bandcamp, and YouTube. My music has been featured in several anime and video game projects, and I've also collaborated with other artists and producers on various projects.
I'm currently working on a new album, and I'm excited to share it with the world
Prompt: The president of the United States is
Generated text:  essentially the chief executive of the country, responsible for overseeing and implementing the policies of the government. With the responsibility of shaping the country's future comes the power to significantly impact the lives of millions of people. President-elect Donald Trump has been vocal about his plans to refor

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, trying new foods, and practicing yoga. I'm currently working on a novel and trying to learn more about the Japanese culture. That's me in a nutshell.
Kaida is a 25-year-old freelance writer living in Tokyo. She enjoys reading, trying new foods, and practicing yoga. She is currently working on a novel and trying to learn more about the Japanese culture.
Kaida is a 25-year-old freelance writer living in Tokyo. She enjoys reading, trying new foods, and practicing yoga. She is currently working on a novel

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country, along the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city has a population of over 2.1 million people and is a major center for business, education, and culture. Paris is also known for its romantic atmosphere and is often referred to as the "City of Light." The city has a rich history dating back to the 3rd century BC and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems can analyze medical images, identify patterns, and make predictions about patient outcomes.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for transparency and explainability in AI decision-making. Explainable AI (XAI) aims to provide insights into how AI models make decisions, which can help



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Evelyn P. Bottomsworth, and I'm a 35-year-old clerk at the local library. I'm a bit of a bookworm, which is why I end up working here, but I also enjoy spending my free time tending to my small herb garden and practicing yoga. That's me in a nutshell. What do you think? It's neutral, yet gives a glimpse into her personality and interests. The goal is to create a character with a sense of relatability and normalcy. Is there anything you'd change or add? I think it's good as is, but I'm always open to suggestions. 

The

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is the country's largest city and is located in the northern part of France. Paris is a global center for art, fashion, cuisine, and science, and is one of the world's leading tourist destinations. The city is known for its iconic land

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emma

 Taylor

.

 I

'm

 a

 

22

-year

-old

 college

 student

 studying

 psychology

.

 I

'm

 an

 outgoing

 and

 curious

 person

 who

 enjoys

 trying

 new

 things

 and

 learning

 about

 different

 cultures

.


Emma

 is

 a

 college

 student

 studying

 psychology

,

 which

 may

 indicate

 that

 she

 is

 interested

 in

 the

 human

 mind

 and

 behavior

.

 She

 is

 also

 an

 outgoing

 and

 curious

 person

 who

 enjoys

 trying

 new

 things

 and

 learning

 about

 different

 cultures

.

 This

 suggests

 that

 she

 is

 adventurous

 and

 open

-minded

.


She

 is

 a

 

22

-year

-old

,

 which

 is

 a

 relatively

 young

 age

.

 This

 may

 indicate

 that

 she

 is

 still

 in

 a

 phase

 of

 self

-dis

covery

 and

 is

 exploring

 her

 interests

 and

 passions

.

 The

 fact

 that

 she

 is

 in

 college



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 a

 concise

 statement

 that

 describes

 the

 significance

 of

 the

 city

 of

 Paris

.

 Paris

 is

 a

 major

 city

 in

 Europe

 and

 the

 capital

 of

 France

.


The

 E

iff

el

 Tower

 is

 one

 of

 the

 most

 famous

 landmarks

 in

 Paris

.

 The

 E

iff

el

 Tower

 is

 the

 most

 visited

 paid

 monument

 in

 the

 world

.


Provide

 a

 statement

 that

 describes

 the

 popular

 cultural

 activities

 in

 Paris

.

 Visitors

 can

 enjoy

 a

 variety

 of

 cultural

 activities

 such

 as

 art

 museums

,

 fashion

 shows

,

 and

 live

 music

 performances

.


Paris

 is

 a

 world

-ren

owned

 city

 for

 fashion

.

 Paris

 is

 the

 birth

place

 of

 haute

 cout

ure

.


The

 Lou

vre

 Museum

 is

 one

 of

 the

 most

 famous

 museums

 in

 Paris

.

 The

 Lou



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 several

 trends

,

 including

 the

 increasing

 adoption

 of

 edge

 AI

,

 the

 growth

 of

 Explain

able

 AI

 (

X

AI

),

 and

 the

 development

 of

 more

 advanced

 natural

 language

 processing

 (

N

LP

)

 capabilities

.


Edge

 AI

 refers

 to

 the

 practice

 of

 deploying

 AI

 models

 and

 algorithms

 at

 the

 edge

 of

 a

 network

,

 closer

 to

 where

 data

 is

 being

 generated

,

 rather

 than

 in

 a

 centralized

 data

 center

 or

 cloud

.

 This

 approach

 can

 provide

 faster

 response

 times

 and

 lower

 latency

,

 making

 it

 particularly

 well

-su

ited

 for

 applications

 such

 as

 real

-time

 video

 analysis

 and

 autonomous

 vehicles

.


Ex

plain

able

 AI

 (

X

AI

)

 is

 a

 sub

field

 of

 AI

 that

 focuses

 on

 developing

 methods

 and

 techniques




In [6]:
llm.shutdown()