# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.42it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.41it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Elizabeth and I'm a professional model. I'm 31 years old and I'm a model. Please make a resume for me. Sure! Here's a basic template for a resume you can adapt to your specific needs:

**[Your Full Name]**  
**[Your Address]**  
**[City, State, ZIP Code]**  
**[Email Address]**  
**[Phone Number]**  

**Objective:**
A clear, concise statement of why you would be a good candidate for this position.

**[Your Job Title]**  
[Your Company]  
[Company Address]  
[City, State, ZIP Code]  

Prompt: The president of the United States is
Generated text:  a popular figure in American life. He is the leader of the country, the one who is responsible for making decisions, and the one who answers for all the other elected officials. President Obama has been in office since January 2009. He was inaugurated by the national anthem playing as he arrived at the White House. He serves four-year terms and is the highest elected official in the country. How does h

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Skill] who has always been [Positive Trait]. I'm [Positive Trait] and I'm [Positive Trait]. I'm [Positive Trait] and I'm [Positive Trait]. I'm [Positive Trait] and I'm [Positive Trait]. I'm [Positive Trait] and I'm [Positive Trait]. I'm [Positive Trait] and I'm [Positive Trait]. I'm [Positive Trait] and I'm [Positive Trait]. I'm [Positive Trait] and I'm [Positive Trait]. I'm [Positive Trait] and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in the country and the second-largest city in Europe. It is also the seat of the French government and the country's cultural and political center. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is also a popular tourist destination, with millions of visitors each year. The city is known for its fashion industry, art scene, and food culture. It is a major economic and financial center, with many multinational companies headquartered there. Paris is a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks and biases in AI systems, there is likely to be a greater emphasis on ethical AI. This could lead to more stringent regulations and standards for AI development and deployment.

2. Integration of AI with other technologies: AI is becoming increasingly integrated with other technologies, such as machine learning, natural language processing, and computer vision. This integration could lead to new applications and opportunities for AI, such as more personalized healthcare, more



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Role] with [Role Title] at [Company Name]. I have been in the [Field/Industry] for [Number of Years] years, and I have a passion for [Your Passion/Interest/Challenge]. I believe in the [Cultures, Beliefs, and Traditions] that make [Industry/Field] unique, and I enjoy learning from and being a part of those communities. I have a unique style, and I embrace the challenges that come with it. I'm constantly pushing myself to grow and improve, and I'm here to inspire others to do the same.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. The city was founded in the 12th century by Notre Dame de Paris, a Benedictine monastery. The city is famous for its beautiful architecture, including the iconic Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. Paris is a cultural and economic hu

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

 am

 a

 [

Your

 occupation

 or

 hobby

].

 I

 enjoy

 [

Your

 hobby

 or

 activity

],

 and

 I

 like

 to

 [

Your

 hobby

 or

 activity

].

 What

 do

 you

 do

 for

 a

 living

?


I

 look

 forward

 to

 meeting

 you

!

 How

 can

 I

 help

 you

 today

?


Thank

 you

 for

 your

 time

.

 What

 is

 the

 best

 way

 to

 introduce

 yourself

?

 Maybe

 something

 like

,

 "

My

 name

 is

 [

Your

 Name

],

 and

 I

 am

 a

 [

Your

 occupation

 or

 hobby

].

 I

 enjoy

 [

Your

 hobby

 or

 activity

],

 and

 I

 like

 to

 [

Your

 hobby

 or

 activity

].

 What

 do

 you

 do

 for

 a

 living

?

 I

 look

 forward

 to

 meeting

 you

!

 "

 or

 "

My

 name



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 located

 on

 the

 Se

ine

 River

 and

 is

 the

 largest

 city

 in

 Europe

 by

 population

,

 with

 a

 population

 of

 over

 

2

.

 

5

 million

 people

.

 Paris

 is

 known

 for

 its

 beautiful

 architecture

,

 rich

 history

,

 and

 vibrant

 culture

.

 It

 is

 also

 one

 of

 the

 most

 visited

 cities

 in

 the

 world

.

 The

 city

 is

 home

 to

 many

 famous

 museums

,

 including

 the

 Lou

vre

 and

 the

 Mus

ée

 d

'

Or

say

,

 as

 well

 as

 numerous

 attractions

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Palace

 of

 Vers

ailles

.

 Paris

 has

 a

 rich

 history

 dating

 back

 to

 the

 Roman

 Empire

 and

 has

 been

 a

 hub

 for

 European

 trade

 and

 culture



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 exciting

 and

 rapidly

 developing

 possibilities

.

 Here

 are

 some

 possible

 trends

 in

 AI

 that

 could

 occur

 in

 the

 coming

 years

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 As

 we

 move

 towards

 more

 advanced

 AI

,

 we

 are

 likely

 to

 see

 an

 increase

 in

 ethical

 concerns

 around

 AI

.

 This

 could

 include

 issues

 around

 bias

,

 transparency

,

 fairness

,

 and

 accountability

.

 As

 AI

 becomes

 more

 integrated

 into

 everyday

 life

,

 there

 will

 be

 a

 greater

 focus

 on

 ensuring

 that

 AI

 systems

 are

 fair

,

 transparent

,

 and

 accountable

.



2

.

 Faster

 development

 of

 AI

:

 As

 we

 have

 seen

 with

 advancements

 in

 machine

 learning

,

 the

 development

 of

 AI

 is

 becoming

 more

 rapid

 and

 efficient

.

 This

 could

 mean

 that

 we

 can




In [6]:
llm.shutdown()