# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:06<00:00,  6.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:06<00:00,  6.37s/it]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sanne Oorlinden and I'm a web developer based in the Netherlands. I've been working in web development for more than a decade, working on front-end and back-end web projects, developing web apps, web services and API's. I have extensive experience with JavaScript, HTML/CSS, and Node.js. My expertise includes web applications, mobile applications, and web services and APIs.
As a developer, I have taken on a wide variety of projects, ranging from simple web apps to complex web services and APIs. I am particularly interested in web development for mobile applications and I have worked on various platforms such as iOS and Android
Prompt: The president of the United States is
Generated text:  5 feet 3 inches tall, which is taller than the average height of adult males. The average height of adult males in the United States is 5 feet 6 inches. How much taller is the president compared to the average height of adult males? To determine how much talle

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in Europe and the third-largest city in the world by population. Paris is known for its rich history, art, and cuisine. It is also a major financial center and a major tourist destination. The city is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a popular destination for international tourists and has a rich cultural and artistic heritage. The city is also known for its fashion industry, which is one of the largest in the world. Paris is a city of contrasts, with its modern architecture and historical landmarks blending together to create

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased integration of AI into everyday life: AI is already becoming more integrated into our daily lives, from smart home devices to self-driving cars. As AI continues to improve, we can expect to see even more integration into our daily routines, such as voice assistants, virtual assistants, and other AI-powered tools.

2. AI will become more autonomous: As AI technology continues to improve, we can expect to see more autonomous vehicles on the road



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a/an [occupation] [Name]. I'm [age], [gender]. I have been writing and editing my novel for [number of years], and I currently work as a/an [job title]. I'm [job role], and I really enjoy [why I love writing]. I'm [any personal qualities or interests that make me special]. I hope you enjoy [any stories, projects, or hobbies] that I may have. Welcome to my world of creative writing! [Addressing the character, express any emotion or perspective that you want to convey]. In any case, thanks for taking the time

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light. It is a historical and cultural center that is home to many world-renowned landmarks, including Notre Dame Cathedral, the Eiffel Tower, and the Louvre Museum. Paris is known for its unique architecture, 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

 and

 I

 am

 a

 software

 engineer

 with

 a

 focus

 on

 automation

 and

 data

 analysis

.

 I

 have

 been

 working

 in

 the

 tech

 industry

 for

 over

 

1

0

 years

 and

 have

 experience

 in

 a

 variety

 of

 software

 development

 tools

,

 including

 Python

,

 JavaScript

,

 and

 Ruby

.

 I

 have

 a

 keen

 interest

 in

 emerging

 technologies

 such

 as

 blockchain

 and

 artificial

 intelligence

 and

 strive

 to

 stay

 up

-to

-date

 with

 the

 latest

 trends

 in

 the

 industry

.

 In

 my

 free

 time

,

 I

 enjoy

 playing

 video

 games

,

 reading

 books

,

 and

 spending

 time

 with

 my

 family

 and

 pets

.

 Emily

 is

 a

 confident

 and

 organized

 individual

 who

 is

 always

 looking

 for

 new

 challenges

 and

 opportunities

 to

 grow

 as

 a

 professional

.

 I

 am

 always

 eager

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

,

 the

 city

 of

 light

 and

 shadows

,

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Notre

 Dame

 Des

 Pins

 Basil

ica

,

 making

 it

 a

 must

-

visit

 destination

 for

 visitors

 to

 explore

 Paris

.

 It

 is

 also

 a

 center

 of

 art

,

 culture

,

 and

 politics

,

 hosting

 the

 Paris

 Opera

,

 the

 Notre

 Dame

 Cathedral

,

 and

 the

 French

 National

 Opera

,

 among

 other

 important

 cultural

 institutions

.

 Paris

 is

 also

 renowned

 for

 its

 vibrant

 nightlife

 and

 shopping

,

 and

 is

 a

 popular

 tourist

 destination

 for

 its

 world

-f

amous

 museums

,

 museums

,

 and

 art

 galleries

,

 as

 well

 as

 its

 numerous

 outdoor



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 exciting

 possibilities

 and

 potential

 challenges

.

 Some

 potential

 trends

 in

 AI

 include

:



1

.

 AI

 will

 continue

 to

 advance

 and

 become

 more

 accurate

 and

 efficient

 at

 various

 tasks

.



2

.

 AI

 will

 become

 more

 integrated

 with

 everyday

 technology

,

 such

 as

 voice

 assistants

 and

 smart

 home

 devices

.



3

.

 AI

 will

 become

 more

 personalized

 and

 context

-aware

,

 allowing

 for

 more

 sophisticated

 natural

 language

 processing

 and

 computer

 vision

.



4

.

 AI

 will

 be

 used

 to

 improve

 healthcare

 outcomes

,

 such

 as

 through

 more

 accurate

 diagnosis

 and

 treatment

 of

 diseases

.



5

.

 AI

 will

 be

 used

 to

 reduce

 the

 need

 for

 human

 labor

 and

 increase

 efficiency

 in

 industries

 such

 as

 manufacturing

 and

 transportation

.



6

.

 AI

 will

 be

 used

 to

 create




In [6]:
llm.shutdown()