# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.15it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Peter. I have a question on the topic of education and I need help with it. Please provide me with a detailed answer. 

Question: What are some effective teaching strategies that can help students improve their academic performance and achieve their goals? 

Thank you for considering my question. I am eager to hear your suggestions on how to improve the academic performance of my students. I appreciate your help and hope to have a productive interaction with you in the near future.

Best regards, [Your Name] Peter
The answer should be based on the following structure:  
- Introduction: Briefly explain the topic of the question.  
- Main body
Prompt: The president of the United States is
Generated text:  a very important person. He or she is in charge of the country, and he or she makes important decisions. This makes the president of the United States very popular with people. But some people believe that the president is too powerful. They sa

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major center for art, music, and fashion, and is home to many famous museums and theaters. Paris is a bustling metropolis with a rich cultural heritage and is a popular tourist destination. Its status as the world's most populous city is due to its large population and diverse population of immigrants and refugees. The city is also home to many international organizations and institutions, including the French Academy of Sciences and the French Academy of Fine Arts. Paris is a city of contrasts, with its

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks of AI, there will be an increased focus on ethical AI. This will include issues such as bias, transparency, accountability, and the potential for AI to be used for malicious purposes.

2. Integration of AI with other technologies: AI is likely to become more integrated with other technologies, such as machine learning, natural language processing, and computer vision. This will allow for more complex and sophisticated AI systems to be developed



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name] and I am a [short description of your character, such as "writer", "biographer", "photographer", etc.]. I have been writing for [length of time] years and I specialize in [specific subject area, such as [specific genre, like "fiction", "non-fiction", etc.]]. I enjoy [why you enjoy writing, such as "exploring the mysteries of the human mind", "shining a light on the hidden truths", "transforming the way we perceive the world", etc.]. I believe my work is important because it helps to [add reason why it's important

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic Eiffel Tower, Notre Dame Cathedral, and vibrant cultural scene. The French language and the French culture are highly valued and celebrated in the city. Paris also has a rich history dating back to the Roman and H

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 friendly

 and

 charismatic

 individual

 who

 is

 always

 ready

 to

 lend

 a

 helping

 hand

.

 Whether

 I

 am

 helping

 someone

 navigate

 through

 a

 complicated

 situation

,

 offering

 a

 suggestion

,

 or

 simply

 chatting

 with

 a

 friend

,

 I

'm

 always

 ready

 to

 join

 in

 and

 provide

 a

 positive

 impact

.

 I

 am

 always

 looking

 for

 ways

 to

 inspire

 and

 motivate

 others

,

 and

 I

'm

 always

 eager

 to

 learn

 new

 things

 and

 broaden

 my

 hor

izons

.

 My

 character

 is

 friendly

,

 enthusiastic

,

 and

 always

 seeks

 to

 connect

 with

 others

.

 I

 am

 someone

 who

 is

 always

 ready

 to

 help

 and

 have

 a

 warm

,

 inviting

 demeanor

.

 I

 am

 a

 proactive

 individual

 who

 values

 teamwork

 and

 working

 with

 others

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



The

 statement

 is

:

 Paris

 is

 the

 capital

 of

 France

.

 



Is

 the

 following

 statement

 true

 or

 false

:

 The

 capital

 of

 France

 is

 located

 on

 the

 Mediterranean

 Sea

?

 False

.

 The

 capital

 of

 France

 is

 located

 on

 the

 Lo

ire

 River

,

 which

 runs

 through

 the

 center

 of

 the

 country

.



True

 or

 False

:

 The

 French

 government

 and

 the

 French

 people

 have

 a

 strong

 cultural

 influence

 on

 the

 country

's

 urban

 culture

.

 True

.

 The

 French

 government

 and

 French

 people

 have

 a

 significant

 influence

 on

 the

 country

's

 urban

 culture

 through

 policies

,

 traditions

,

 and

 lifestyle

 that

 are

 deeply

 rooted

 in

 French

 cultural

 heritage

.

 This

 influence

 can

 be

 seen

 in

 the

 French

 urban

 landscape

,

 the

 cuisine

,

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 highly

 complex

.

 However

,

 here

 are

 some

 potential

 trends

 that

 are

 likely

 to

 shape

 the

 field

 in

 the

 coming

 years

:



1

.

 Increasing

 emphasis

 on

 ethical

 AI

:

 As

 the

 AI

 industry

 becomes

 more

 advanced

,

 there

 is

 a

 growing

 emphasis

 on

 creating

 AI

 that

 is

 ethical

 and

 transparent

.

 This

 means

 that

 developers

 will

 need

 to

 create

 AI

 that

 is

 designed

 to

 be

 responsible

 and

 accountable

 for

 its

 actions

.



2

.

 More

 personalized

 and

 context

-aware

 AI

:

 As

 more

 data

 is

 collected

 and

 analyzed

,

 AI

 will

 become

 more

 personalized

 and

 context

-aware

.

 This

 means

 that

 AI

 will

 be

 able

 to

 understand

 the

 context

 of

 a

 situation

 and

 make

 better

 decisions

 than

 ever

 before

.



3

.

 Greater

 use

 of




In [6]:
llm.shutdown()