# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.37it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Calvin. I'm a software developer and I love solving puzzles. As I like solving puzzles, I'll try to solve one puzzle every day. Here are a few puzzles for you to solve: Puzzling.com has a fun puzzle that asks you to find the value of the following expression: a + b + c - d + e + f + g + h. The puzzle has a twist that the values of a, b, c, d, e, f, g, and h are all different and can be any number between 1 and 10. Can you help me solve the puzzle? Additionally, can you
Prompt: The president of the United States is
Generated text:  an elected official who serves a 4-year term and must be a candidate of the Democratic Party. Barack Obama won the election in 2008, and he served as the 46th President of the United States from 2009 to 2017. His inauguration took place on January 20, 2011.

What is the earliest year he could have been elected and still have a term of office?
To determine the earliest year Barack Obama could have been elected and sti

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [job title] and I'm always looking for ways to [job title] and improve my skills. I'm a [job title] and I'm always looking for ways to [job title] and improve my skills. I'm a [job title] and I'm always looking for ways to [job title] and improve my skills. I'm a [job title] and I'm always looking for ways to [job title]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is the largest city in France and the third-largest city in the world by population. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is also a major center for art, music, and literature, and is a popular tourist destination. The city is known for its annual festivals and events, including the Eiffel Tower Parade and the World Cup of Lights. Paris is a city of contrasts, with its modern architecture and historical landmarks

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As the technology continues to advance, we can expect to see even more sophisticated AI systems being used in healthcare, such as personalized medicine, drug discovery, and patient monitoring.

2. AI in finance: AI is already being used in finance to improve fraud detection, risk management, and investment decision-making. As the technology continues to evolve, we can expect to see even more sophisticated AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], I'm 30 years old. I've always been fascinated by the concept of data analysis, and I've always been interested in understanding how data can be used to solve real-world problems. I have a strong background in mathematics and have taken courses in data analysis, statistics, and programming. I'm always looking for new ways to apply my skills and have a passion for keeping up with the latest trends in data science. What's your story? I'm currently a [job title] with [company name], where I work on [job description]. How do you feel about [job title]? I'm excited to continue

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is known as the City of Love for its romantic architecture and artistic culture. It is home to the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is also 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

 have

 a

 passion

 for

 [

my

self

],

 and

 I

 am

 always

 looking

 for

 opportunities

 to

 learn

 and

 grow

.

 I

 enjoy

 working

 with

 [

tools

 or

 technology

]

 and

 I

 am

 always

 up

-to

-date

 with

 the

 latest

 trends

 and

 developments

.

 I

 am

 a

 [

your

self

]

 and

 I

 believe

 in

 [

my

self

's

 personal

 philosophy

 or

 values

].

 I

 am

 excited

 to

 be

 a

 part

 of

 your

 team

 and

 I

 am

 eager

 to

 make

 a

 difference

.

 What

's

 your

 name

?

 What

 do

 you

 do

 at

 the

 company

?

 What

 do

 you

 like

 about

 working

 here

?

 What

 do

 you

 like

 about

 your

 own

 career



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



1

.

 **

Paris

 (

2

0

2

3

)**

:

 Home

 to

 the

 seat

 of

 the

 French

 government

,

 the

 French

 Parliament

 and

 numerous

 public

 buildings

,

 including

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

.


 

 

2

.

 **

City

**:

 Located

 on

 the

 Se

ine

 River

,

 near

 the

 charming

 

1

7

th

-century

 city

 center

.


 

 

3

.

 **

Liter

ature

**:

 Known

 as

 the

 "

City

 of

 Love

"

 and

 the

 birth

place

 of

 French

 literature

,

 this

 ancient

 city

 hosts

 numerous

 historical

 monuments

 and

 museums

.


 

 

4

.

 **

C

uisine

**:

 Character

ized

 by

 its

 rich

 culinary

 traditions

,

 Paris

ian

 cuisine

 offers

 a

 diverse

 range

 of

 regional

 specialties



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 bright

 and

 likely

 to

 continue

 to

 evolve

 in

 ways

 that

 are

 both

 transformative

 and

 challenging

.

 Here

 are

 some

 potential

 trends

 that

 could

 shape

 the

 AI

 landscape

 in

 the

 coming

 years

:



1

.

 Increased

 focus

 on

 ethical

 considerations

:

 AI

 is

 already

 deeply

 connected

 to

 many

 of

 our

 everyday

 lives

,

 from

 our

 personal

 devices

 to

 the

 healthcare

 systems

 we

 rely

 on

.

 As

 we

 continue

 to

 advance

 AI

,

 there

 will

 be

 a

 need

 to

 address

 ethical

 concerns

 related

 to

 data

 privacy

,

 bias

,

 and

 transparency

.

 This

 could

 lead

 to

 new

 regulations

,

 standards

,

 and

 guidelines

 to

 ensure

 that

 AI

 systems

 are

 developed

 and

 used

 in

 ways

 that

 are

 fair

 and

 responsible

.



2

.

 More

 integration

 with

 human

 behavior

:

 As

 AI




In [6]:
llm.shutdown()