# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.68it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.68it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kolya. I want to learn how to code. I am looking for a good course to start learning Python. Any suggestions?

1. Beginner
2. Intermediate
3. Advanced
4. Any level
4. Intermediate

To start with, what are some good Python courses available?

The course I want to take is:

https://www.udemy.com/course/python-for-beginners/

Can you suggest some other Python courses similar to this one? Sure! Python is a versatile language with a wide range of applications and a vast number of resources. Here are some excellent Python courses available that cater to various levels:

### Beginner

Prompt: The president of the United States is
Generated text:  a very important person. The president is a member of Congress. Congress is a group of members of the United States. So the president and the Congress are related. 

a) What does the term "member of" refer to?

b) How many people are in Congress?

c) How many people are members of Congress?

d) Why is the pr

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? As an AI language model, I don't have a physical presence, but I can certainly provide you with some information about myself. I'm a large language model created by Alibaba Cloud, and I'm designed to assist users in generating human-like text. I'm programmed to understand and respond to a wide range of prompts, and I can provide information on a variety of topics, including technology, science, history, and more. Additionally, I can help with

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city of light and art. It is a bustling metropolis with a rich history and a vibrant culture. The city is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. Paris is also home to many world-renowned museums, including the Louvre, the Musée d'Orsay, and the Musée Rodin. The city is also known for its delicious cuisine, including French cuisine, and its fashion scene. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly together. It is a city that is both a cultural and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more sophisticated and nuanced decision-making. This could lead to more personalized and context-aware AI systems that can better understand and respond to the needs of individuals.

2. Enhanced machine learning capabilities: AI is likely to continue to improve its ability to learn from data and make more accurate predictions and decisions. This could lead to more efficient and effective use of resources, as well as more accurate predictions of future events.

3. Increased focus on ethical and social implications: As AI becomes



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I’m a 35-year-old IT specialist with a master's degree in software engineering. I have been working in IT for over 10 years and have a passion for learning and constantly improving my skills. My hobbies include reading, hiking, and trying new foods. I am always looking for ways to stay up-to-date with the latest technology and trends in the industry. I am confident and approachable, and I enjoy working with people of all backgrounds and experiences. How can you describe your career and interests outside of work? Can you give me an example of a project you worked on that involved using a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.ToolBox is a game where players work as chefs who must select ingredients and then mix them into a dish. The game is designed to develop their spatial awar

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

]

 and

 I

 am

 a

 [

age

].

 I

've

 always

 been

 an

 avid

 reader

 and

 I

 enjoy

 exploring

 new

 genres

.

 I

've

 been

 writing

 short

 stories

 for

 about

 

5

 years

 now

,

 and

 I

've

 found

 that

 my

 writing

 has

 the

 ability

 to

 bring

 characters

 to

 life

 in

 a

 unique

 way

.

 My

 writing

 style

 is

 straightforward

 and

 I

 like

 to

 use

 dialogue

 to

 bring

 the

 characters

 to

 life

.

 I

 have

 a

 passion

 for

 storytelling

 and

 I

 believe

 in

 the

 power

 of

 words

 to

 convey

 emotion

 and

 ideas

.

 I

'm

 excited

 to

 get

 started

 with

 you

 and

 see

 where

 this

 journey

 of

 writing

 will

 take

 us

.

 What

 would

 you

 like

 to

 learn

 more

 about

?

 



This

 self

-int

roduction



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Ex

plain

 why

 you

 believe

 this

 statement

 is

 fact

ually

 correct

.

 I

 believe

 this

 statement

 is

 fact

ually

 correct

 because

 Paris

,

 the

 capital

 of

 France

,

 is

 the

 largest

 and

 most

 important

 city

 in

 the

 country

,

 and

 it

 is

 known

 for

 its

 rich

 history

,

 culture

,

 and

 architecture

.

 The

 city

 is

 also

 home

 to

 many

 important

 institutions

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 Additionally

,

 Paris

 is

 known

 for

 its

 beautiful

 gardens

,

 such

 as

 the

 Par

c

 des

 Fant

ô

mes

,

 and

 its

 annual

 fashion

 and

 music

 festivals

.

 All

 of

 these

 factors

 make

 Paris

 one

 of

 the

 most

 important

 cities

 in

 Europe

,

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 continue

 to

 evolve

 rapidly

,

 with

 many

 exciting

 developments

 and

 trends

 shaping

 the

 way

 it

 is

 used

.

 Some

 potential

 future

 trends

 in

 AI

 include

:



1

.

 Adv

ancements

 in

 machine

 learning

 and

 neural

 networks

:

 AI

 is

 already

 becoming

 increasingly

 powerful

 and

 accurate

,

 and

 new

 breakthrough

s

 are

 expected

 to

 further

 increase

 this

 success

.

 These

 will

 include

 faster

 and

 more

 powerful

 GPUs

,

 more

 sophisticated

 algorithms

,

 and

 a

 growing

 variety

 of

 data

 types

.



2

.

 More

 autonomous

 and

 self

-driving

 vehicles

:

 As

 AI

 is

 integrated

 into

 more

 and

 more

 vehicles

,

 we

 may

 see

 more

 autonomous

 and

 self

-driving

 vehicles

 on

 the

 roads

.

 This

 will

 require

 the

 development

 of

 more

 advanced

 AI

 systems

 and

 techniques

,

 such

 as




In [6]:
llm.shutdown()