# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.05it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:02,  1.00s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.02s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.18it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jay and I'm a math enthusiast! I've been fascinated by math since I was a child, and I've been fortunate enough to have had some amazing math mentors throughout my life. I'm excited to share my passion for math with you and help you with any math questions or problems you may have.

In my free time, I enjoy solving math problems, reading math books and articles, and exploring the connections between math and other subjects like science, history, and philosophy. I'm always looking for new ways to apply math to real-world problems and to make math more accessible and fun for everyone.

Some of the math topics I'm most
Prompt: The president of the United States is
Generated text:  not a monarch, but an elected official who serves as the head of state and head of government. The president has a multitude of responsibilities, including serving as commander-in-chief of the armed forces, negotiating treaties, and appointing federal judges. The presid

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new recipes in my free time. I'm a bit of a introvert, but I'm always up for a good conversation when I'm feeling energized. I'm currently working on a novel and trying to build my writing portfolio. That's me in a nutshell.
This introduction is neutral because it doesn't reveal any personal secrets or biases. It simply presents Kaida's basic information and interests in a straightforward way. It also doesn't try

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. This statement is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or context. It is a simple and straightforward statement that can be used as a starting point for further discussion or research. The statement is also accurate, as Paris is widely recognized as the capital of France. Overall, this statement meets the requirements of a concise factual statement about France’s capital city. 
Note: This response is a simple and direct answer to the question, and it does not provide any

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it's difficult to predict exactly what the future holds, here are some possible trends that could shape the development and impact of artificial intelligence:
1. Increased Adoption in Everyday Life: AI will become increasingly integrated into our daily lives, from virtual assistants like Siri and Alexa to more sophisticated applications in healthcare, finance, and education.
2. Advancements in Machine Learning: Machine learning algorithms will continue to improve, enabling AI systems to learn from experience, adapt to new situations, and make more accurate predictions.
3. Rise of Explainable AI: As AI becomes more pervasive, there will be a growing need



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Esteban Javier. I'm a 25-year-old historian who specializes in the late 19th-century history of Latin America. I've worked in several archives across the continent, researching topics ranging from the Mexican Revolution to the rise of the Argentine economy. My latest project is a book on the lives of women during the Chilean copper strike of 1907. I'm based in Buenos Aires for now, but I'm open to traveling for research and collaboration opportunities.
I'm a writer, a teacher, and a researcher. My work focuses on the intersection of history and culture, particularly in the context of Latin America. I'm

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. France is a country with a rich history and culture. Paris is known for its beautiful architecture, famous landmarks, and romantic atmosphere. It is a popu

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 El

ara

 V

ex

,

 and

 I

'm

 a

 skilled

 programmer

 and

 hacker

 from

 the

 city

 of

 New

 Haven

.


I

've

 worked

 on

 various

 projects

 for

 both

 private

 and

 public

 clients

,

 and

 I

've

 built

 a

 reputation

 for

 being

 reliable

 and

 discreet

.

 When

 I

'm

 not

 working

,

 I

 enjoy

 reading

 about

 computer

 history

 and

 trying

 out

 new

 coding

 languages

.


I

'm

 a

 bit

 of

 a

 intro

vert

,

 and

 I

 prefer

 to

 keep

 to

 myself

,

 but

 I

'm

 not

 a

verse

 to

 collaborating

 with

 others

 when

 the

 project

 requires

 it

.

 I

'm

 a

 quick

 learner

,

 and

 I

'm

 always

 looking

 for

 ways

 to

 improve

 my

 skills

 and

 expand

 my

 knowledge

.


What

 makes

 El

ara

 V

ex

 stand

 out

 from



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 statement

 should

 be

 a

 clear

 and

 direct

 statement

 of fact

.

 The

 sentence

 should

 be

 short

 and

 easy

 to

 understand

.


The

 statement

 should

 not

 include

 any

 opinion

 or

 bias

.

 The

 statement

 should

 be

 a

 neutral

 statement

 of

 fact

.


The

 statement

 should

 be

 written

 in

 a

 clear

 and

 concise

 manner

.

 The

 statement

 should

 be

 easy

 to

 understand

 for

 a

 general

 audience

.


The

 statement

 should

 be

 accurate

 and

 true

.

 The

 statement

 should

 not

 contain

 any

 errors

 or

 misinformation

.


The

 statement

 should

 be

 written

 in

 a

 professional

 tone

.

 The

 statement

 should

 be

 suitable

 for

 a

 variety

 of

 audiences

 and

 purposes

.


The

 statement

 should

 be

 written

 in

 a

 way

 that

 is

 easy

 to

 verify

.

 The

 statement

 should

 be

 supported



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 but

 here

 are

 some

 possible

 trends

 we

 can

 expect

:


AI

 will

 continue

 to

 improve

 and

 become

 more

 integrated

 into

 various

 aspects

 of

 life

,

 including

 healthcare

,

 finance

,

 transportation

,

 and

 education

.

 AI

 will

 also

 become

 more

 accessible

 and

 affordable

,

 making

 it

 a

 valuable

 tool

 for

 individuals

 and

 businesses

.


We

 will

 see

 advancements

 in

 areas

 such

 as

:


Edge

 AI

:

 AI

 will

 become

 more

 decentralized

,

 with

 AI

 processing

 happening

 at

 the

 edge

 of

 the

 network

,

 closer

 to

 the

 user

.

 This

 will

 improve

 the

 speed

 and

 efficiency

 of

 AI

 applications

.


Ex

plain

able

 AI

:

 As

 AI

 becomes

 more

 pervasive

,

 there

 will

 be

 a

 growing

 need

 to

 understand

 how

 AI

 decisions

 are

 made

.

 Explain

able




In [6]:
llm.shutdown()