# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.61it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.61it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Aiden and I am a Software Engineer from Australia. I work at a technology company and have been working at the company for about 5 years. I am interested in computer science and I am actively involved in the research and development of new computer languages and algorithms. I am particularly interested in natural language processing and I am looking for a job at a company that focuses on natural language processing.
Could you please suggest a job that would be a good fit for someone like me?
Sure, here are some job opportunities that could be suitable for someone like you:
1. Natural Language Processing Researcher - This position could be perfect for someone who is
Prompt: The president of the United States is
Generated text:  seeking to draw a budget of $13,000,000 for the budget year 2023. He has already spent $1,500,000 on the military and the remaining amount on the federal government. How much did he spend on the federal government?
To de

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city that serves as the political, cultural, and economic center of the country. It is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, as well as its rich history and diverse cultural scene. Paris is also a major tourist destination, attracting millions of visitors each year. The city is home to many famous museums, including the Louvre, the Musée d'Orsay, and the Musée Rodin, and is a hub for the arts and entertainment industry. Paris is a vibrant and dynamic city that continues to grow and evolve, with a rich

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI is likely to become more prevalent in various industries, including manufacturing, healthcare, and transportation. Automation will likely lead to increased efficiency and productivity, but it will also lead to the creation of new jobs and the displacement of human workers.

2. AI ethics and privacy: As AI becomes more integrated into our daily lives, there will be increasing concerns about its ethical implications and the potential for privacy violations. There will likely be a push for greater regulation and oversight of AI development and deployment.

3. AI for human augmentation: AI is likely to be used for human augmentation in



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  __________, and I am a/an ___________ (which you will be able to identify from the prompt). I am a/an ___________ (which you will be able to identify from the prompt). I am an __________ (which you will be able to identify from the prompt). I am an ___________ (which you will be able to identify from the prompt). I am an ___________ (which you will be able to identify from the prompt). I am an ___________ (which you will be able to identify from the prompt).

Sure, I'd love to hear your self-introduction! Let's

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in Europe and is the third largest in the world, after Rome and Tokyo. It is known for its picturesque streets, beautiful architecture, and world-class museums. The city is also home to numerous museums, theaters, and cultura

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 [

Age

]

 years

 old

.

 I

 am

 a

 [

Occup

ation

]

 who

 have

 been

 working

 in

 [

Industry

/

Field

]

 for

 [

Number

]

 years

.

 I

 started

 my

 journey

 into

 the

 world

 of

 [

Industry

/

Field

]

 with

 a

 [

Initial

 Job

/

Position

]

 and

 have

 always

 been

 driven

 by

 the

 [

Why

 Do

 You

 Want

 to

 Work

 in

 This

 Industry

]?

 [

Career

 Goals

]?

 I

 am

 a

 [

Att

itude

/

Imp

uls

iveness

]

 person

 and

 always

 strive

 to

 push

 boundaries

 and

 make

 things

 better

 for

 others

.

 I

 am

 a

 [

Rel

iable

/

Int

ense

]

 person

 and

 can

 work

 long

 hours

,

 but

 I

 am

 always

 focused

 on

 my

 goals



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 an

 historic

 city

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

.

 It

 is

 also

 a

 cultural

 and

 political

 center

 of

 France

,

 hosting

 numerous

 international

 events

 and

 festivals

 throughout

 the

 year

.

 Additionally

,

 Paris

 is

 a

 popular

 tourist

 destination

,

 known

 for

 its

 rich

 history

,

 art

,

 and

 cuisine

.

 The

 city

's

 architecture

,

 particularly

 its

 Gothic

 and

 Renaissance

 architecture

,

 is

 also

 a

 notable

 feature

,

 reflecting

 the

 country

's

 architectural

 heritage

.

 Overall

,

 Paris

 is

 a

 city

 with

 a

 rich

 and

 diverse

 history

 and

 culture

,

 making

 it

 a

 popular

 destination

 for

 tourists

 and

 locals

 alike

.

 



Does

 this

 statement

 accurately

 reflect

 the

 content

 and

 meaning

 of

 the

 provided



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 and

 there

 are

 many

 trends

 that

 we

 can

 expect

 to

 see

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 accuracy

 and

 reliability

:

 As

 AI

 systems

 become

 more

 sophisticated

,

 they

 are

 likely

 to

 become

 even

 more

 accurate

 and

 reliable

 in

 their

 predictions

 and

 decisions

.

 This

 will

 require

 continuous

 improvement

 and

 refinement

 of

 the

 algorithms

 used

 to

 train

 the

 models

.



2

.

 Enhanced

 privacy

 and

 security

:

 As

 AI

 systems

 are

 used

 to

 process

 personal

 data

,

 there

 is

 a

 risk

 that

 they

 may

 be

 used

 for

 nef

arious

 purposes

.

 To

 address

 this

,

 there

 will

 likely

 be

 increased

 efforts

 to

 enhance

 privacy

 and

 security

 features

 in

 AI

 systems

.



3

.

 Enhanced




In [6]:
llm.shutdown()