# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.04s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.13s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.14s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.07it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Anne, and I am a librarian at a public library in a small town. I have been working here for about 6 months now, and I am still learning the ropes. My main responsibility is to assist patrons with their information needs, which can range from finding a good book to doing research for a school project. I also help with programs and events for kids and adults, such as author readings and craft workshops. I love my job because I get to meet so many interesting people and help them discover new things.

In my free time, I enjoy reading (of course!), hiking, and cooking. I am a bit of a food
Prompt: The president of the United States is
Generated text:  the chief executive of the federal government. The president is elected by the people through the Electoral College. The president serves a four-year term and is limited to two terms. The president's main responsibilities include appointing federal judges, including Supreme Court justices, and execu

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new coffee shops. That's me in a nutshell. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a novel and trying to get my writing career off the ground. I'm excited to see where life takes me.
This self-introduction is neutral because it doesn't reveal too much about Kaida's personality, background, or motivations. It simply states her name, occupation

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country, along the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. The city is home to many famous landmarks, including the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a popular tourist destination and is considered one of the most romantic cities in the world. The city has a population of over 2.1 million people and is a major hub for business, education, and entertainment. Paris is also known for its cuisine, which includes famous dishes such as esc

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems will be able to analyze medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of explainable AI: As AI becomes more pervasive, there is a growing need for AI systems to be transparent and explainable. This will involve developing techniques to interpret and understand the decisions made by AI systems.
3.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kaida. I'm a 20-year-old college student majoring in environmental science. When I'm not studying, you can find me hiking, playing guitar, or attempting to cook a new recipe. I'm a bit of a bookworm, always eager to learn and explore new ideas. What do you think of this introduction? Does it provide enough information, or is it too vague?
This introduction provides a good balance of information about Kaida's personal and professional life. Here are a few suggestions to improve it:
1. Consider adding a bit more detail about Kaida's interests: While it's nice to know that Kaida

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located in the northern part of the country, in the heart of the Île-de-France region. It is situated on the River Seine. The city is one of the most visited destinations in

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Kael

in

 Dark

haven

.

 I

'm

 a

 young

 woman

 in

 my

 early

 twenties

,

 with

 an

 un

remark

able

 appearance

 and

 a

 quiet

 demeanor

.

 I

 work

 as

 a

 junior

 researcher

 at

 a

 small

,

 local

 university

,

 where

 I

 spend

 most

 of

 my

 time

 studying

 the

 effects

 of

 environmental

 pollution

 on

 local

 ecosystems

.

 I

'm

 not

 particularly

 outgoing

,

 but

 I

 enjoy

 learning

 and

 appreciate

 the

 beauty

 of

 the

 natural

 world

.

 I

'm

 currently

 living

 in

 a

 small

 apartment

 above

 my

 family

's

 old

 bookstore

,

 which

 is

 a

 cozy

 and

 familiar

 place

 for

 me

.

 What

 do

 you

 think

?

 Is

 this

 a

 good

 starting

 point

 for

 your

 character

?

 Feel

ings

?

 Bio

?

 Any

 other

 details

 you

'd

 like

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 located

 in

 the

 north

-central

 part

 of

 the

 country

.

 It

 is

 situated

 on

 the

 Se

ine

 River

.


The

 River

 Se

ine

 is

 an

 important

 geographical

 feature

 of

 Paris

.

 The

 river

 divides

 the

 city

 into

 two

 parts

:

 the

 left

 bank

 (

R

ive

 Ga

uche

)

 and

 the

 right

 bank

 (

R

ive

 Dro

ite

).

 This

 division

 is

 also

 reflected

 in

 the

 city

's

 historical

 and

 cultural

 landscape

,

 with

 many

 famous

 landmarks

 and

 attractions

 situated

 along

 the

 river

banks

.

 The

 Se

ine

 has

 played

 a

 significant

 role

 in

 the

 city

's

 development

,

 providing

 a

 source

 of

 water

,

 transportation

,

 and

 inspiration

 for

 artists

 and

 writers

.


Paris

 is

 known

 for

 its

 stunning

 architecture

,

 rich

 history



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 of

 much

 speculation

,

 and

 various

 trends

 are

 predicted

 to

 shape

 the

 field

 in

 the

 coming

 years

.

 Some

 possible

 future

 trends

 in

 AI

 include

:


Increased

 use

 of

 Explain

able

 AI

 (

X

AI

):

 As

 AI

 becomes

 more

 pervasive

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 AI

 models

 make

 decisions

.

 Explain

able

 AI

 (

X

AI

)

 aims

 to

 provide

 insights

 into

 the

 reasoning

 behind

 AI

-driven

 decisions

,

 enhancing

 trust

 and

 transparency

.


Adv

ancements

 in

 Natural

 Language

 Processing

 (

N

LP

):

 N

LP

 has

 made

 significant

 progress

 in

 recent

 years

,

 enabling

 more

 accurate

 and

 efficient

 human

-A

I

 interaction

.

 Future

 advancements

 in

 N

LP

 could

 lead

 to

 more

 sophisticated

 chat

bots

,

 voice

 assistants

,

 and




In [6]:
llm.shutdown()