# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.87it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.86it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Baozuo. In the year 2032, my name was "Baozuo". In the year 1996, my name was "Qingguo". In the year 2000, my name was "Dongyi". If I want to get my name back, which year did I have to be born in? 1. **Identify the pattern in the names:**
   - 2032: "Baozuo"
   - 1996: "Qingguo"
   - 2000: "D
Prompt: The president of the United States is
Generated text:  30 years older than the president of Brazil, and the president of Brazil is twice as old as the president of Russia. The president of the United States is 50 years old. What is the average age of the three presidents? Let's denote the age of the president of Brazil as \( B \), the president of Russia as \( R \), and the president of the United States as \( U \). According to the information given:

1. The president of the United States is 50 years old.
2. The president of Brazil is twice as old as the president of Russia, so \( U = 2
Prompt: The capital of France is
Generated text:  Paris. It 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the third-largest city in the European Union. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is also home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is a popular tourist destination and a major economic center in France. It is also known for its fashion industry and its role in the French Revolution. The city is home to many important institutions such as the French Academy of Sciences and the French National Library. Paris is a city of contrasts, with its modern architecture and historical landmarks blending together to

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation and robotics: As AI technology continues to advance, we can expect to see more automation and robotics in various industries, from manufacturing to healthcare. This will likely lead to increased efficiency and productivity, but it will also create new jobs and challenges for workers.

2. AI ethics and privacy: As AI technology becomes more advanced, we will need to address the ethical implications of its use. This will likely involve developing new ethical guidelines and standards for AI, as well as



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a software engineer who has been working in the field of technology for [number] years. I am always up-to-date on the latest technological developments and use my knowledge of programming to create innovative software solutions. I am passionate about the intersection of technology and creativity and love to stay up-to-date on the latest trends and methodologies in software development. I am a team player, comfortable with working in a fast-paced environment and enjoy collaborating with other developers and engineers to achieve our goals. Thank you for asking about me.
As an artificial intelligence, I don't have personal experiences or memories like humans do. However, I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as Louvain. It is the largest city in France by population 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

'm

 excited

 to

 meet

 you

 and

 help

 you

 with

 any

 questions

 you

 may

 have

.

 Looking

 forward

 to

 the

 conversation

!

 [

Name

]

 [

Company

 Name

]

 CEO

 [

Name

]

 CEO

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name

]

 President

 [

Name



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 "

City

 of

 Light

."

 It

 is

 located

 on

 the

 Se

ine

 River

 and

 is

 home

 to

 the

 European

 Parliament

 and

 the

 headquarters

 of

 numerous

 multinational

 corporations

.

 The

 city

 is

 known

 for

 its

 picturesque

 architecture

,

 vibrant

 culture

,

 and

 festive

 festivals

.

 Paris

 is

 the

 cultural

 and

 economic

 center

 of

 France

 and

 is

 a

 popular

 tourist

 destination

.

 Its

 historic

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 Lou

vre

 Museum

,

 are

 significant

 tourist

 attractions

.

 The

 city

 is

 also

 a

 major

 hub

 for

 international

 diplomacy

 and

 trade

,

 and

 its

 influence

 extends

 far

 beyond

 France

's

 borders

.

 As

 a

 major

 global

 city

,

 Paris

 plays

 a

 crucial

 role

 in

 Europe



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 set

 to

 be

 more

 complex

,

 innovative

,

 and

 interdisciplinary

 than

 ever

 before

.

 Here

 are

 some

 possible

 trends

 that

 are

 expected

 to

 shape

 the

 AI

 landscape

 in

 the

 coming

 years

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 With

 the

 rapid

 growth

 of

 AI

 applications

,

 there

 is

 a

 growing

 need

 to

 address

 ethical

 concerns

 and

 ensure

 that

 AI

 systems

 are

 used

 in

 a

 responsible

 manner

.

 This

 will

 likely

 lead

 to

 increased

 focus

 on

 developing

 AI

 systems

 that

 are

 transparent

,

 accountable

,

 and

 accountable

 to

 the

 people

 who

 rely

 on

 them

.



2

.

 Growth

 of

 AI

-based

 technologies

:

 AI

 is

 already

 having

 a

 significant

 impact

 on

 industries

 such

 as

 healthcare

,

 finance

,

 and

 transportation

.

 With

 continued

 advancements

 in

 AI

 technology




In [6]:
llm.shutdown()