# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.24it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Shinji. I am a 16-year-old college student living in Tokyo, Japan. I am now using the internet, and I have noticed that some people are using a method of sleeping with their phone in their pocket while they sleep. I have heard of this method, and I have been wondering what makes it so popular. In addition, I am curious about the effects of this behavior on the body and the mind. Can you provide me with some information about this behavior and the effects it could have on me? Please try to provide accurate and detailed information that a 16-year-old would find useful.
Certainly! I'd
Prompt: The president of the United States is
Generated text:  traveling on a plane. The plane can travel at a speed of 1000 km/h for 1 hour, 1500 km/h for 2 hours, and 2500 km/h for 3 hours. If the plane starts from rest and travels at a constant speed, what is the total distance it will cover? To determine the total distance the plane will cover, we need to calcul

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a brief description of your profession or role]. I enjoy [insert a brief description of your hobbies or interests]. I'm [insert a brief description of your personality or character]. I'm always looking for new challenges and opportunities to grow and learn. What do you think makes you unique? I'm [insert a brief description of your unique trait or characteristic]. I'm always eager to learn and grow, and I'm always looking for ways

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also the seat of the French government and the largest city in the European Union. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and the French Revolution. It is also a major financial center and a major tourist destination. The city is known for its fashion, art, and cuisine, and is home to many famous museums, theaters, and restaurants. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased focus on ethical AI: As more people become aware of the potential risks and biases in AI systems, there will be a greater emphasis on ethical considerations. This could lead to more stringent regulations and guidelines for AI development and deployment, as well as increased investment in research and development to address ethical concerns.

2. Greater use of AI in healthcare: AI is already being used in a variety of healthcare applications, from personalized medicine to disease diagnosis



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [career goal] at [Company]. I'm a [excuse me, but I can think of a few possibilities], and I'm currently [the next step in my career path]. My goal is to [explain what drives me, such as "to improve [company's product/service] or [improve [company's reputation or community]]"]. How can I get started on this journey? Let me know, and I'll get started. You can call me [Name], or [any other name you'd prefer]. [Write a few sentences that highlight the core elements of your character, such

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the largest and most populous city in France, with a population of over 1 million inhabitants. The city is located on the Seine river and is the cultural, economic, and political center of France. It is known for its beautiful architecture, rich h

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

________

.

 I

 am

 a

/an

 __________________

____________

__.

 I

 am

 __________________

________

_

.



This

 introduction

 gives

 the

 reader

 a

 clear

 sense

 of

 the

 character

's

 identity

 and

 purpose

 without

 being

 overly

 drawn

 out

 or

 filled

 with

 too

 much

 detail

.

 It

's

 neutral

 and

 doesn

't

 come

 across

 as

 a

 pre

lude

 to

 a

 conversation

 or

 introduction

.

 It

's

 easy

 to

 read

 and

 understand

,

 and

 doesn

't

 distract

 from

 the

 character

's

 purpose

 or

 abilities

.



I

'd

 love

 to

 hear

 more

 about

 the

 character

's

 background

 and

 how

 their

 skills

 or

 abilities

 relate

 to

 their

 role

.

 What

 would

 you

 like

 to

 know

 about

 this

 character

?

 (

e

.g

.,

 "

Describe

 your

 skills

 or

 abilities

 as

 a

/the

 character

."



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 the

 country

 and

 one

 of

 the

 most

 populous

 in

 Europe

.

 It

 is

 known

 for

 its

 classical

 architecture

,

 rich

 history

,

 and

 vibrant

 culture

.

 Paris

 is

 home

 to

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 The

 city

 also

 hosts

 various

 festivals

,

 fashion

 shows

,

 and

 annual

 events

 throughout

 the

 year

.

 Paris

 is

 a

 cultural

 center

 for

 Europe

 and

 is

 the

 seat

 of

 the

 French

 government

 and

 one

 of

 the

 most

 important

 cities

 in

 the

 world

.

 It

 is

 a

 popular

 tourist

 destination

 for

 those

 who

 want

 to

 experience

 the

 city

's

 unique

 culture

 and

 history

.

 As

 of

 

2

0

2

1



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 incredibly

 bright

 and

 diverse

,

 with

 a

 wide

 array

 of

 technologies

 that

 are

 poised

 to

 change

 the

 world

 in

 profound

 ways

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 integration

 with

 everyday

 devices

:

 As

 AI

 technology

 continues

 to

 advance

,

 we

 can

 expect

 to

 see

 more

 integration

 between

 AI

 and

 everyday

 devices

 like

 smartphones

,

 wear

ables

,

 and

 home

 automation

 systems

.

 This

 could

 lead

 to

 more

 personalized

 and

 efficient

 services

,

 as

 well

 as

 a

 wider

 range

 of

 applications

 for

 AI

.



2

.

 Greater

 reliance

 on

 AI

 for

 tasks

 previously

 done

 by

 humans

:

 AI

 will

 likely

 become

 more

 involved

 in

 many

 tasks

 that

 were

 previously

 done

 by

 humans

,

 such

 as

 diagn

osing

 diseases

,

 analyzing

 large

 amounts




In [6]:
llm.shutdown()