# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.77it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.76it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah. I'm a high school student in Hangzhou, Zhejiang Province, China. My interest in computer science is growing day by day. My mother is a computer science teacher at a high school. She loves teaching us about computer science. We have fun learning new things every day. My family members all agree that computers are very important and that I should learn about computers. I hope to learn more about them. My parents always take me to visit the technology department at school to see how computers work and what they do. The more I see, the more I enjoy learning about them. Besides learning about computers, I also enjoy
Prompt: The president of the United States is
Generated text:  seeking to determine if a certain state has illegally shut down a bridge in order to make it more visible and thereby making it more attractive to construction companies. The president is not only interested in the legality of the action itself, but also in the potent

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is home to many notable French artists, writers, and musicians, and is known for its rich cultural heritage and historical significance. Paris is a vibrant and dynamic city with a rich history and a strong sense of community. The city is also known for its diverse cuisine and its role in the French culinary scene. Paris

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way that AI is used and developed. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This will include issues such as bias, transparency, accountability, and the impact of AI on society.

2. Development of more advanced AI: As AI technology continues to advance, there will be a need for more advanced AI that can perform tasks that are currently beyond the capabilities of current AI systems.





### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [Occupation] with over [Number] years of experience in the [Industry/Field]. [Name] has a [Role/Status] within our team. I am excited to be here and I look forward to sharing my knowledge and experiences with everyone.
Remember to keep the tone of the introduction friendly and informative. I want to make my team feel welcome and comfortable when they first meet me. Let me know if you have any questions or would like me to elaborate on any part of my introduction. [Name] [Phone Number] [Email Address] [LinkedIn Profile] [Social Media Handles]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located on the Seine River, which flows through the heart of the city. It is the largest city in France and is also the oldest city in Europe. The city has a rich history, with its origins dating back

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Sarah

.

 I

 am

 a

 recent

 graduate

 from

 the

 University

 of

 California

,

 Berkeley

.

 I

 have

 a

 passion

 for

 writing

 and

 I

 have

 always

 been

 fascinated

 by

 the

 mysteries

 and

 intrigue

 of

 the

 world

 around

 me

.

 I

 enjoy

 exploring

 new

 places

,

 meeting

 new

 people

,

 and

 imm

ers

ing

 myself

 in

 stories

.

 I

 am

 a

 strong

 learner

 and

 have

 a

 love

 for

 learning

 new

 things

.

 I

 am

 excited

 to

 meet

 you

 and

 discuss

 the

 world

 around

 me

.

 



In

 a

 nutshell

,

 my

 personality

 is

 quiet

 and

 intros

pective

.

 I

 am

 deeply

 curious

,

 passionate

 about

 learning

 and

 exploring

 the

 world

,

 and

 I

 am

 always

 eager

 to

 learn

 more

.

 I

 have

 a

 knack

 for

 storytelling

,

 and

 I

 enjoy



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 south

-central

 region

 of

 France

.

 It

 is

 the

 largest

 and

 most

 populous

 city

 in

 France

.

 Paris

 is

 a

 UNESCO

 World

 Heritage

 site

 and

 home

 to

 numerous

 museums

,

 art

 galleries

,

 and

 landmarks

.

 The

 city

 is

 known

 for

 its

 rich

 culture

,

 fashion

,

 and

 gastr

onomy

,

 including

 the

 famous

 E

iff

el

 Tower

.

 Paris

 is

 also

 a

 major

 transportation

 hub

 and

 a

 center

 for

 business

,

 education

,

 and

 entertainment

.

 It

 is

 home

 to

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Ch

amps

-E

lys

ées

.

 Paris

 is

 known

 for

 its

 iconic

 architecture

,

 including

 the

 Arc

 de

 Tri

omp

he

,

 E

iff

el



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 diverse

,

 and

 it

's

 likely

 to

 continue

 to

 evolve

 at

 a

 rapid

 pace

.

 Here

 are

 some

 possible

 trends

 that

 could

 shape

 the

 AI

 landscape

 in

 the

 coming

 years

:



1

.

 Increasing

 reliance

 on

 machine

 learning

 and

 deep

 learning

:

 As

 AI

 technology

 becomes

 more

 capable

 of

 solving

 complex

 problems

,

 we

're

 likely

 to

 see

 a

 greater

 emphasis

 on

 machine

 learning

 and

 deep

 learning

.

 This

 could

 lead

 to

 even

 more

 advanced

 algorithms

 that

 can

 learn

 from

 vast

 amounts

 of

 data

 to

 make

 predictions

 and

 decisions

.



2

.

 Improved

 precision

 and

 accuracy

:

 AI

 is

 increasingly

 being

 used

 to

 automate

 tasks

 and

 improve

 the

 accuracy

 of

 predictions

,

 but

 there

's

 still

 room

 for

 improvement

.

 We

 may

 see

 further

 developments

 in




In [6]:
llm.shutdown()