# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

The following error message 'operation scheduled before its operands' can be ignored.


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.08it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.04it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.04it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tom. I am an American who lives in a small town in Japan. I have a 12 year old son named Alex. We are an Anglo-Japanese family, and I am a translator/interpreter by profession. I enjoy learning about and sharing my experiences in Japan with others, and I am also passionate about helping my fellow expats navigate the cultural and practical challenges of living in Japan.
When I'm not working or spending time with my family, you can find me in the mountains hiking, practicing martial arts, or writing short stories and poetry. I love to travel and explore new places, and I'm always looking for new adventures
Prompt: The president of the United States is
Generated text:  the head of state and head of government of the United States, and is the highest-ranking official in the federal government. The president serves as commander-in-chief of the armed forces and is responsible for executing the laws and policies of the federal government. The preside

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new coffee shops. I'm a bit of a introvert and prefer quieter environments, but I'm always up for a good conversation. I'm currently working on a novel and trying to get my writing career off the ground. That's me in a nutshell.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions. It simply states facts about Kaida's life and interests. It also doesn't try to impress or manipulate the reader, which is a good

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city is also known for its romantic atmosphere and is often referred to as the City of Light. Paris is a popular tourist destination and is considered one of the most beautiful and culturally rich cities in the world. The city has a population of over 2.1 million people and is a major

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it's difficult to predict exactly what the future holds, here are some possible future trends in artificial intelligence:
1. Increased use of AI in everyday life: As AI technology becomes more advanced and affordable, we can expect to see its use in more aspects of our daily lives. This could include AI-powered personal assistants, smart homes, and self-driving cars.
2. Advancements in natural language processing: Natural language processing (NLP) is a key area of AI research, and we can expect to see significant advancements in this area in the coming years. This could include more accurate and natural-s



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Mayuko. I'm a 19-year-old student who attends a local community college. I'm studying business administration, with a focus on marketing. I'm currently working part-time as a barista at a small coffee shop downtown. That's about it. I enjoy reading and trying out new coffee flavors in my free time. That's me. ~Mayuko
Mayuko's introduction is neutral and straightforward, providing basic information about her life without expressing any strong opinions or emotions. She mentions her education, work experience, and hobbies, which helps to give a sense of who she is without going into too much detail. The tone is

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Identify the mayor of Paris since 2001. Bertrand Delanoë (as of 2021) served as the mayor of Paris from 2001 to 2014. He was succeeded by Anne Hidalg

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Marcus

 Thompson

,

 and

 I

'm

 a

 

25

-year

-old

 graphic

 designer

 living

 in

 the

 bustling

 city

 of

 New

 York

.

 I

 enjoy

 hiking

 and

 exploring

 new

 places

 in

 my

 free

 time

.

 I

'm

 a

 creative

 problem

-s

olver

 and

 a

 detail

-oriented

 individual

 who

 values

 simplicity

 and

 efficiency

 in

 my

 work

.

 I

'm

 excited

 to

 meet

 new

 people

 and

 collaborate

 on

 exciting

 projects

.


This

 self

-int

roduction

 highlights

 key

 points

 about

 Marcus

,

 such

 as

 his

 profession

,

 hobbies

,

 and

 personal

 qualities

.

 It

 aims

 to

 create

 a

 positive

 and

 professional

 impression

 without

 being

 overly

 boast

ful

 or

 promotional

.

 It

 also

 includes

 a

 friendly

 and

 approach

able

 tone

,

 suggesting

 that

 Marcus

 is

 open

 to

 meeting

 new

 people

 and

 working

 together



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


a

)

 In

 what

 region

 is

 Paris

 located

?


Answer

:

 Î

le

-de

-F

rance




b

)

 What

 is

 the

 approximate

 population

 of

 the

 city

 of

 Paris

?


Answer

:

 Approximately

 

2

.

1

 million

 people




c

)

 What

 are

 some

 notable

 landmarks

 in

 Paris

?


Answer

:

 The

 E

iff

el

 Tower

,

 The

 Lou

vre

,

 Notre

 Dame

 Cathedral

,

 Arc

 de

 Tri

omp

he

,

 Mont

mart

re




d

)

 What

 is

 the

 official

 language

 spoken

 in

 Paris

?


Answer

:

 French




e

)

 What

 is

 the

 climate

 like

 in

 Paris

?


Answer

:

 Ocean

ic

 climate

,

 with

 mild

 winters

 and

 warm

 summers




f

)

 Is

 Paris

 a

 culturally

 significant

 city

?


Answer

:

 Yes



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 but

 experts

 predict

 several

 trends

 to

 shape

 the

 field

.


The

 future

 of

 artificial

 intelligence

 (

AI

)

 is

 uncertain

,

 but

 experts

 predict

 several

 trends

 that

 will

 shape

 the

 field

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 **

Increased

 Adoption

 in

 Industries

**:

 AI

 will

 become

 increasingly

 adopted

 across

 various

 industries

,

 including

 healthcare

,

 finance

,

 education

,

 and

 manufacturing

.

 This

 will

 lead

 to

 improved

 efficiency

,

 accuracy

,

 and

 decision

-making

.


2

.

 **

Adv

ancements

 in

 Natural

 Language

 Processing

 (

N

LP

)**

:

 N

LP

 will

 continue

 to

 improve

,

 enabling

 AI

 systems

 to

 better

 understand

 human

 language

 and

 communicate

 more

 effectively

 with

 humans

.

 This

 will

 lead

 to

 more

 convers

ational

 interfaces

 and




In [6]:
llm.shutdown()