# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.35it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.27it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.25it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.52it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Elena and I'm a Russian student from St. Petersburg, studying business administration in the USA. I'm here on a scholarship to explore and learn about American culture, and I'm excited to experience all that this beautiful country has to offer!
I have to admit, when I first arrived in the States, everything was quite overwhelming – the food, the customs, the language (I'm sure you know how hard it can be to pronounce 'th'!). However, with every passing day, I'm becoming more and more comfortable, and my language skills are improving rapidly.
My host family has been incredibly welcoming and supportive, and I'm loving
Prompt: The president of the United States is
Generated text:  going to make a big speech to Congress about America's plans for space exploration. With this speech, he will announce his vision for a new space program. He plans to send astronauts to the Moon and Mars, and even to establish colonies on those planets.
I. Preparation
A

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a novel and trying to learn more about the world around me. That's me in a nutshell. What do you think? Is it too long or too short? Is there anything you'd change?
I think it's a good start! It's concise and gives a sense of who Kaida is. However, it's a bit on the safe side. To make it

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
The Eiffel Tower is a famous landmark in Paris, France. It was built for the 1889 World’s Fair and was the tallest man-made structure in the world at the time. The tower stands at 324 meters (1,063 feet) tall and is made of iron. It was designed by Gustave Eiffel and took approximately two years to build. The Eiffel Tower is now one of the most recognizable landmarks in the world and is a popular tourist destination.
The Louvre Museum is a famous museum in Paris, France. It was originally a royal palace built in the 12th century and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems will be able to analyze large amounts of medical data, identify patterns, and make predictions about patient outcomes.
2. Widespread adoption of AI in industries: AI is expected to be adopted in various industries, including finance, transportation, and education. AI-powered systems will be able to automate tasks, improve efficiency, and enhance



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Elianore Quasar. I am a software engineer and data analyst by profession. I enjoy learning about different cultures and trying out new foods from around the world. I am currently based in a city called Nova Haven, which is known for its vibrant tech industry and diverse community. I like to spend my free time reading science fiction novels and attending local coding meetups.
Hello, my name is Elianore Quasar. I am a software engineer and data analyst by profession. I enjoy learning about different cultures and trying out new foods from around the world. I am currently based in a city called Nova Haven, which is known for its

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Provide a concise factual statement about Germany’s capital city. The capital of Germany is Berlin.
Provide a concise factual statem

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Zach

ary

 Wood

son

 and

 I

'm

 a

 twenty

-eight

-year

-old

 freelance

 writer

 living

 in

 New

 York

 City

.

 I

 enjoy

 writing

 about

 history

,

 science

,

 and

 culture

,

 and

 I

'm

 always

 looking

 to

 learn

 more

 about

 the

 world

 around

 me

.


Z

ach

ary

 Wood

son

 is

 a

 freelance

 writer

 living

 in

 New

 York

 City

.

 He

 specializes

 in

 writing

 about

 history

,

 science

,

 and

 culture

.

 He

 is

 twenty

-eight

 years

 old

 and

 is

 always

 looking

 to

 expand

 his

 knowledge

 of

 the

 world

.


I

 am

 a

 freelance

 writer

,

 specializing

 in

 history

,

 science

,

 and

 culture

.

 I

 reside

 in

 New

 York

 City

 and

 am

 always

 looking

 to

 expand

 my

 knowledge

 of

 the

 world

.


Z

ach

ary

 Wood

son



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Provide

 an

 analysis

 of

 the

 statement

 about

 the

 capital

 of

 France

.

 This

 analysis

 should

 not

 only

 focus

 on

 the

 basic

 fact

 that

 the

 capital

 is

 Paris

,

 but

 also

 include

 more

 in

-depth

 information

 about

 the

 city

 and

 its

 significance

.

 Paris

 is

 not

 just

 the

 capital

 of

 France

,

 but

 also

 a

 symbol

 of

 French

 culture

,

 history

,

 and

 art

.

 The

 city

 is

 famous

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

,

 which

 houses

 some

 of

 the

 world

’s

 most

 famous

 artworks

,

 including

 the

 Mona

 Lisa

.

 Paris

 is

 also

 known

 for

 its

 romantic

 atmosphere

,

 beautiful

 parks

 and

 gardens

,

 and

 a

 vibrant

 cultural

 scene

 that



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 influenced

 by

 various

 factors

,

 including

 advancements

 in

 technology

,

 changes

 in

 society

,

 and

 the

 emergence

 of

 new

 applications

.

 Based

 on

 current

 trends

 and

 the

 work

 of

 experts

 in

 the

 field

,

 here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased

 Aut

onomy

:

 AI

 systems

 will

 become

 more

 autonomous

,

 making

 decisions

 without

 human

 intervention

.

 This

 could

 lead

 to

 increased

 efficiency

 and

 productivity

,

 but

 also

 raises

 concerns

 about

 accountability

 and

 transparency

.


2

.

 AI

-P

owered

 Creat

ivity

:

 AI

 will

 be

 used

 to

 generate

 creative

 content

,

 such

 as

 art

,

 music

,

 and

 writing

.

 This

 could

 revolution

ize

 the

 creative

 industries

,

 but

 also

 raises

 questions

 about

 author

ship

 and

 ownership

.





In [6]:
llm.shutdown()