# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.66it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.65it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ben. I come from a different part of the world and I was born and raised in America. Now, I come from Japan and I'm here to learn English. I'm so excited and look forward to being able to speak English and improve my language skills. I hope to travel to Japan someday and meet some Japanese people. I would like to improve my conversation skills and to be able to ask questions in Japanese. I'm looking for a teaching method that can help me with my Japanese language learning and I believe that a native Japanese speaker would be the most suitable as he/she would have a wealth of knowledge and experience. My goal is
Prompt: The president of the United States is
Generated text:  seeking to increase voter participation in elections, and a certain percentage of eligible voters will vote for him. His campaign is spending $200,000,000 on political advertisements, and the expenditure per eligible voter is $250,000. How many eligible voters does he need t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason for interest in the industry]. I'm always looking for ways to [action or goal]. I'm a [reason for interest in the industry] and I'm always eager to learn and grow. I'm [reason for interest in the industry] and I'm always eager to learn and grow. I'm [reason for interest in the industry] and I'm always eager to learn and grow. I'm [reason for interest in

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is known for its rich history, art, and cuisine, and is home to many notable French artists and writers. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also known for its diverse population, with many French-speaking residents and visitors. Overall, Paris

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential trends include:

1. Increased integration with other technologies: AI is likely to become more integrated with other technologies such as blockchain, IoT, and quantum computing, creating new possibilities for new applications and services.

2. Enhanced privacy and security: As AI systems become more sophisticated, there will be increased concerns about privacy and security. There will be efforts to develop more secure and transparent AI systems that can be trusted to operate in a safe and ethical manner.

3. Greater focus on ethical AI: As AI systems become more complex and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  __________. I'm a/an _____________. What's your role in the company?
Sure, here's a short, neutral self-introduction for a fictional character:

"Hello, my name is [insert name here]. I'm a/an [insert job title here]. I've been working at [insert company here] for [insert number of years here]. My role in the company has been [insert job description here]. Thank you for the opportunity to meet you, [insert name here]." 

Feel free to replace the placeholders with your actual character's name, job title, and experience level, if you'd like.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a major international city, known for its history, architecture, and cuisine. It is also a popular tourist destination and a cultural hub. Paris is home to many museums, galleries, and cultural institutions. The c

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

'm

 a

 [

Gender

]

 [

Your

 Gender

]

 with

 the

 unique

 ability

 to

 communicate with

 a

 vast

 array

 of

 languages

 and

 dialect

s

.

 My

 natural

 language

 is

 English

,

 but

 I

'm

 fluent

 in

 several

 languages

 such

 as

 Mandarin

 Chinese

,

 Spanish

,

 and

 Arabic

.

 I

 enjoy

 exploring

 and

 learning

 about

 different

 cultures

 and

 languages

.

 I

'm

 dedicated

 to

 helping

 others

 understand

 the

 nuances

 of

 their

 language

 and

 culture

.

 My

 ultimate

 goal

 is

 to

 become

 a

 fluent

 speaker

 of

 multiple

 languages

 and

 to

 create

 a

 bridge

 between

 different

 cultures

 and

 people

.

 I

'm

 excited

 to

 share

 my

 knowledge

 and

 knowledge

 with

 everyone

.

 Hello

,

 my

 name

 is

 [

Your

 Name

].

 I

'm

 a

 [

Gender



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 the

 country

,

 with

 a

 population

 of

 over

 

2

 million

 people

,

 making

 it

 one

 of

 the

 world

's

 most

 populous

 cities

.

 Paris

 is

 famous

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 It

 also

 has

 a

 rich

 history

,

 with

 Paris

 being

 the

 birth

place

 of

 many

 famous

 figures

 such

 as

 Marie

 Ant

oin

ette

,

 Napoleon

 Bon

ap

arte

,

 and

 Louis

 XVI

.

 The

 city

 is

 known

 for

 its

 world

-class

 architecture

,

 culture

,

 and

 food

,

 and

 is

 a

 popular

 tourist

 destination

 for

 visitors

 from

 all

 over

 the

 world

.

 Paris

 is

 a

 symbol

 of

 France

's

 rich



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 and

 many

 different

 trends

 are

 possible

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

:



1

.

 Increased

 integration

 of

 AI

 with

 human

 intelligence

:

 As

 AI

 becomes

 more

 sophisticated

,

 it

 may

 be

 integrated

 with

 human

 intelligence

 to

 enhance

 its

 capabilities

.

 For

 example

,

 AI

 may

 be

 used

 to

 analyze

 human

 decision

-making

 processes

,

 enabling

 them

 to

 make

 more

 informed

 decisions

.



2

.

 AI

 becoming

 more

 autonomous

:

 As

 AI

 technology

 advances

,

 it

 may

 become

 more

 capable

 of

 making

 autonomous

 decisions

 and

 actions

,

 reducing

 the

 need

 for

 human

 oversight

.

 This

 could

 lead

 to

 increased

 efficiency

,

 as machines

 can

 take

 over

 tasks

 that

 were

 once

 done

 by

 humans

.



3.

 AI

 becoming

 more

 versatile

:

 As

 AI




In [6]:
llm.shutdown()