# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.00it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.62it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.28it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.18it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Brian and I am a web developer with a passion for creating beautiful, user-friendly, and scalable web applications. I have been working with web technologies for over 5 years now and I am excited to be a part of this community.

I primarily work with HTML, CSS, JavaScript, and various frameworks such as React, Angular, and Vue.js. I also have experience with back-end technologies like Node.js, Ruby on Rails, and Django.

I am a big believer in the importance of accessibility, SEO, and performance optimization in web development. I strive to create websites that are not only visually appealing but also easy to use and navigate for
Prompt: The president of the United States is
Generated text:  the head of state and government, and the commander-in-chief of the armed forces. The president is elected by the people through the Electoral College. The president serves a four-year term and is limited to two terms.
The president is responsible for enfo

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student who enjoys reading, playing video games, and spending time with friends. I'm a bit of a introvert, but I'm working on being more outgoing. I'm a junior in high school, and I'm trying to figure out my future plans. I'm a bit of a perfectionist, which can sometimes make things difficult for me, but I'm learning to balance that with being more relaxed and spontaneous. I'm a bit of a bookworm, and I love getting lost in a good story. I'm also a bit of a tech enthusiast, and I enjoy

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris.
Paris is the capital and most populous city of France, with a population of approximately 2.1 million people within its city limits. It is one of the most famous and iconic cities in the world, known for its stunning architecture, art museums, fashion, cuisine, and romantic atmosphere. Paris is situated in the northern part of France, along the Seine River, and is a major hub for business, culture, and tourism. The city is home to many famous landmarks, including the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, which houses the Mona Lisa

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems will be able to analyze large amounts of medical data, identify patterns, and make predictions about patient outcomes.
2. Widespread adoption of AI in industries: AI is expected to be adopted in various industries, including finance, transportation, and education. AI-powered systems will be able to automate tasks, improve efficiency, and make



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Axel Stone, but most people just call me Axel. I'm a 25-year-old information broker with a passion for uncovering hidden secrets. I'm well-versed in cryptography, surveillance, and infiltration techniques, which often comes in handy in my line of work. I'm not one for flashy introductions, but I'm always looking for new challenges and opportunities to make a name for myself.
Axel Stone is a 25-year-old information broker who uses his skills in cryptography, surveillance, and infiltration to uncover hidden secrets. He is a professional who prefers to keep a low profile, but is always looking for new challenges and opportunities

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
What is the capital city of France?
The capital of France is Paris.
What is the meaning of the word Paris?
The word Paris is of Gr

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Nova

 R

yz

h

ik

ov

.

 I

’m

 a

 

17

-year

-old

 high

 school

 student

 who

 lives

 in

 a

 small

 town

 in

 Russia

.

 I

 enjoy

 reading

,

 playing

 chess

,

 and

 listening

 to

 music

.

 I

’m

 not

 really

 sure

 what

 I

 want

 to

 do

 with

 my

 life

 yet

,

 but

 I

’m

 trying

 to

 figure

 it

 out

.

 That

’s

 me

.


I

 am

 Aurora

 "

R

ory

"

 Thompson

,

 a

 

25

-year

-old

 freelance

 writer

 living

 in

 Portland

,

 Oregon

.

 I

 love

 hiking

,

 trying

 new

 restaurants

,

 and

 practicing

 yoga

.

 My

 writing

 often

 focuses

 on

 social

 justice

 and

 the

 environment

.


I

'm

 

32

-year

-old

 Julian

 Styles

,

 a

 psychologist

 who

 specializes

 in

 anxiety

 disorders

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 France

 is

 divided

 into

 

13

 administrative

 regions

.

 

27

 metropolitan

 departments

 and

 

5

 overseas

 departments

.

 France

 is

 a

 country

 located

 in

 Western

 Europe

.

 France

 is

 bordered

 by

 the

 English

 Channel

,

 the

 Atlantic Ocean

,

 and

 the

 Mediterranean

 Sea

.

 France

 is

 one

 of

 the

 most

 visited

 countries

 in

 the

 world

.

 France

 has

 a

 diverse

 economy

 with

 a

 strong

 focus

 on

 service

 industries

.

 The

 official

 language

 is

 French

.

 France

 is

 home

 to

 the

 European

 Union

.

 France

 is

 a

 member

 of

 NATO

 and

 other

 international

 organizations

.

 France

 is

 a

 secular

 country

 with

 a

 strong

 tradition

 of

 secular

ism

.

 France

 has

 a

 diverse

 culture

 influenced

 by

 its

 history

 and

 geography

.

 France

 is

 home

 to

 many

 UNESCO



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 the

 convergence

 of

 various

 technologies

,

 including

 natural

 language

 processing

,

 computer

 vision

,

 and

 machine

 learning

.


The

 future

 of

 artificial

 intelligence

 (

AI

)

 is

 expected

 to

 be

 shaped

 by

 the

 convergence

 of

 various

 technologies

,

 including

 natural

 language

 processing

,

 computer

 vision

,

 and

 machine

 learning

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 **

Increased

 use

 of

 machine

 learning

**:

 Machine

 learning

 algorithms

 will

 continue

 to

 improve

,

 allowing

 AI

 systems

 to

 learn

 from

 data

 and

 adapt

 to

 new

 situations

.

 This

 will

 lead

 to

 more

 accurate

 and

 efficient

 decision

-making

 in

 various

 applications

,

 such

 as

 healthcare

,

 finance

,

 and

 transportation

.


2

.

 **

R

ise

 of

 explain

able

 AI

**:

 As




In [6]:
llm.shutdown()