# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.09s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.43it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.25it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.15it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Josie and I am a certified Professional Organizer (CPO). I am excited to introduce you to my business, Simply Organized. My goal is to help you achieve a more organized, peaceful, and productive life. I understand that organizing can be overwhelming and time-consuming, which is why I am here to provide personalized support and guidance to help you get your space and life organized.
I offer a range of services to fit your needs and budget, including:
Virtual Organizing Sessions
In-Home Organizing
Decluttering and Downsizing
Paper Management and Digital Organization
Space Planning and Design
Hoarder’s Support and Ref
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States. The president serves a four-year term and is limited to two terms by the 22nd Amendment to the Constitution. The president is elected through the Electoral College system, where each state is allocated a cert

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student. I'm a bit of a bookworm and enjoy reading about history and science. I'm also a member of the school's debate team and enjoy arguing about current events. I'm a bit of a perfectionist, which can sometimes make me come across as stubborn or overly critical. I'm working on balancing my desire for precision with being more open-minded and flexible. I'm looking forward to meeting new people and making friends. I'm a bit shy at first, but once you get to know me, I'm pretty outgoing and enjoy trying new things. I'm excited

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The capital of France is Paris.
The capital of France is Paris. This is a concise factual statement about France’s capital city. It is a simple and direct statement that provides accurate information. There is no need for additional context or explanation, as the statement is clear and unambiguous. This type of statement is often used in encyclopedias, dictionaries, and other reference materials where brevity and accuracy are essential. It is also a good example of a declarative sentence, which is a type of sentence that makes a statement or assertion. In this case, the statement is a simple declaration of fact. Overall,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future holds, here are some possible trends that could shape the development and impact of artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI could be used to develop new treatments and cures for diseases, and to improve patient outcomes.
2. Widespread adoption of AI in industries: AI is already being used in various industries such as finance, transportation, and customer service. In the future, AI could be used in



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Astra, and I'm a skilled programmer working in the city of New Eden. I'm a bit of a perfectionist and enjoy solving complex problems. In my free time, I enjoy reading science fiction novels and attending concerts. I'm currently looking for a new challenge in my career.

This self-introduction is neutral and professional, and it provides a brief glimpse into the character's personality and interests. It doesn't reveal too much about Astra's background or motivations, leaving room for further development and exploration.

Here are some key elements to consider when writing a neutral self-introduction for a fictional character:

1.  **Professional tone

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Provide a list of 5 interesting facts about France’s capital city.
1. Paris is built on an island in the Se

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Avery

 Martin

 and

 I

'm

 a

 

25

-year

-old

 freelance

 writer

 from

 San

 Francisco

.

 I

'm

 currently

 living

 in

 a

 small

 studio

 apartment

 in

 the

 Mission

 District

.

 In

 my

 free

 time

,

 I

 enjoy

 hiking

,

 reading

,

 and

 trying

 out

 new

 restaurants

 in

 the

 city

.


This

 is

 a

 great

 start

 because

 it

 gives

 a

 sense

 of

 who

 Avery

 is

 and

 where

 she

 lives

,

 but

 it

 could

 use

 some

 depth

 and

 character

 details

.

 Let

's

 revise

 it

.


Here

's

 a

 revised

 introduction

 that

 adds

 more

 depth

 and

 character

 to

 Avery

:

 Hello

,

 I

'm

 Avery

 Martin

,

 a

 

25

-year

-old

 freelance

 writer

 with

 a

 passion

 for

 storytelling

.

 I

've

 called

 San

 Francisco

 home

 for

 most

 of

 my



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Source

:

 CIA

 World

 Fact

book

.

 ...

more




Ident

ify

 the

 two

 major

 rivers

 that

 flow

 through

 the

 city

 of

 Paris

.

 The

 Se

ine

 and

 the

 Mar

ne

.

 ...

more




What

 is

 the

 name

 of

 the

 famous

 bridge

 in

 Paris

 that

 was

 originally

 a

 medieval

 stone

 bridge

 and

 was

 rebuilt

 after

 a

 catastrophic

 flood

 in

 

177

0

?

 Pont

 Ne

uf

.

 ...

more




What

 is

 the

 name

 of

 the

 famous

 square

 in

 Paris

 where

 artists

 and

 performers

 gather

 to

 entertain

 crowds

 of

 tourists

 and

 locals

 alike

?

 Place

 du

 T

ert

re

.

 ...

more




Which

 of

 the

 following

 Paris

ian

 landmarks

 is

 not

 a

 famous

 building

?

 The

 Lou

vre

 Museum

 is

 famous

 for

 housing

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 rapidly

 evolving

 and

 complex

 field

 that

 is

 constantly

 shifting

 and

 adapting

 to

 new

 technologies

 and

 discoveries

.

 While

 it

 is

 difficult

 to

 predict

 with

 certainty

,

 here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 focus

 on

 human

-A

I

 collaboration

:

 As

 AI

 becomes

 more

 prevalent

,

 there

 will

 be

 a

 greater

 emphasis

 on

 how

 humans

 and

 AI

 systems

 can

 work

 together

 effectively

.

 This

 could

 lead

 to

 new

 forms

 of

 human

-A

I

 collaboration

,

 such

 as

 AI

-ass

isted

 decision

-making

 and

 AI

-enabled

 creativity

.


2

.

 Adv

ancements

 in

 natural

 language

 processing

 (

N

LP

):

 N

LP

 is

 a

 critical

 component

 of

 AI

,

 enabling

 machines

 to

 understand

 and

 generate

 human

 language

.

 Future

 advancements

 in

 N




In [6]:
llm.shutdown()