# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

The following error message 'operation scheduled before its operands' can be ignored.


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.06s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.13s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.14s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.07it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Fiona, and I am a client of Bellewood Homecare. I was referred to Bellewood by a family friend who highly recommended their services. I had been struggling to manage my daily tasks on my own, and it was becoming increasingly difficult for me to maintain my independence.
I was paired with a wonderful caregiver, Rachel, who has been a godsend to me. She is kind, compassionate, and has a heart of gold. She assists me with everything from personal care to light housekeeping, and has even helped me to stay organized and on top of my medications.
Rachel has become a trusted member of my support team, and I
Prompt: The president of the United States is
Generated text:  an extraordinary figure in world politics. The office is a unique blend of politics, power, and personality. The 45th president of the United States has stirred up controversy, even in the first year of his term. The man's presidency, and his leadership style, is an interesting and com

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my tiny kitchen. I'm a bit of a introvert, but I love meeting new people and hearing their stories. I'm currently working on a novel and a series of short stories, and I'm excited to see where my creative projects take me. I'm looking forward to getting to know you and sharing my work with you.
This is a good example of a neutral self-introduction because it:
* Provides basic information about the character, such as their

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city is also a major hub for business, education, and culture, attracting millions of tourists and visitors each year. Paris is a city that is steeped in history and culture, and it is a must-visit destination for anyone interested in exploring the best of France. The city is also

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it is difficult to predict exactly what the future will hold, here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI in industries: AI is already being used in various industries such as finance, transportation, and customer service. In the future, AI is likely



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Isabella Thompson. I am a 26-year-old art teacher at a local elementary school. I enjoy spending time with friends and family, reading, and trying out new recipes in my kitchen. I am a bit of a perfectionist and value honesty above all else. Isabella is a fictional character in a story that you are writing.
To write a self-introduction for Isabella, first, you need to understand the context of her character and the tone of the story. Since Isabella is a fictional character in a story, the introduction should be concise and to the point.
Here are some tips to consider when writing a self-int

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Recognize that a factual statement is clear and concise, without including unnecessary information. The capital of France is Paris. 
Identify the main topic of the sen

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 K

aida

 K

ats

ur

agi

.

 I

'm

 a

 

19

-year

-old

 college

 student

 who

's

 studying

 engineering

.

 I

 like

 playing

 video

 games

 and

 reading

 science

 fiction

 novels

 in

 my

 free

 time

.

 That

's

 pretty

 much

 it

 for

 me

.

 What

 do

 you

 think

?

 Does

 it

 sound

 like

 a

 good

 starting

 point

 for

 a

 character

?

 You

 can

 suggest

 any

 additional

 details

 you

'd

 like

 to

 add

 or

 change

 to

 make

 it

 more

 interesting

.


self

-int

roduction

 character

 development




Your

 self

-int

roduction

 is

 clear

 and

 concise

,

 which

 is

 great

 for

 a

 starting

 point

.

 It

 provides

 a

 good

 foundation

 for

 developing

 K

aida

's

 personality

 and

 background

.

 To

 make

 it

 more

 interesting

,

 you

 might

 consider

 adding

 some



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


This

 response

 is

 a

 concise

 factual

 statement

 about

 France

’s

 capital

 city

.

 It

 provides

 the

 name

 of

 the

 city

,

 which

 is

 Paris

,

 and

 is

 stated

 in

 a

 straightforward

 and

 accurate

 manner

.

 There

 is

 no

 additional

 information

 or

 commentary

 included

,

 which

 meets

 the

 request

 for

 a

 factual

 statement

.

 However

,

 it

 is

 worth

 noting

 that

 a

 more

 complete

 answer

 could

 include

 additional

 details

 about

 Paris

,

 such

 as

 its

 population

,

 location

,

 or

 notable

 landmarks

.

 Nevertheless

,

 the

 given

 response

 satisfies

 the

 initial

 request

 for

 a

 concise

 factual

 statement

.

 The

 final

 answer

 is

:

 Paris

.

 ##

 Step

 

1

:

 Identify

 the

 question

 about

 France

's

 capital

 city

.


The

 question

 asks

 for

 a

 concise

 factual

 statement



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 predicted

 to

 be

 shaped

 by

 advances

 in

 several

 areas

.


Ex

plain

 possible

 future

 trends

 in

 artificial

 intelligence

.


The

 future

 of

 AI

 is

 predicted

 to

 be

 shaped

 by

 advances

 in

 several

 areas

.

 Some

 of

 the

 possible

 future

 trends

 in

 artificial

 intelligence

 include

:


1

.

 Increased

 emphasis

 on

 Explain

ability

 and

 Transparency

:

 As

 AI

 becomes

 more

 ubiquitous

,

 there

 will

 be

 a

 growing

 need

 for

 explain

ability

 and

 transparency

 in

 AI

 decision

-making

 processes

.

 This

 means

 that

 AI

 systems

 will

 need

 to

 provide

 clear

 and

 understandable

 explanations

 for

 their

 decisions

,

 which

 will

 require

 significant

 advances

 in

 areas

 such

 as

 interpret

ability

 and

 caus

ality

.


2

.

 Rise

 of

 Edge

 AI

:

 With

 the

 increasing

 demand

 for

 real

-time

 processing

 and

 low

-lat




In [6]:
llm.shutdown()