# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.52it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Grace, and I am a Level 1 Digital Marketing Consultant who specializes in helping businesses grow their digital presence through digital marketing. Currently, I work for a digital marketing firm in Nashville, Tennessee, where I work with businesses to create content that is compelling and valuable to their target audience. My goal is to provide businesses with a well-rounded approach to digital marketing that not only improves their visibility on the internet, but also increases their conversion rates and revenue. I have experience with a variety of digital marketing strategies, including SEO, social media marketing, email marketing, and content marketing.
I have a strong background in digital marketing and have helped businesses to
Prompt: The president of the United States is
Generated text:  a man. Does it follow that the president of the United States is not a woman?
To answer this question, let's analyze the logical implications step by s

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and experiences. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your interests and experiences. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your interests and experiences. What can you tell me about yourself? [Name] is a [job title] at [company name]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and a diverse population. The city is home to iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, as well as a vibrant arts and culture scene. Paris is also known for its fashion industry, with many famous designers and boutiques. The city is a major transportation hub, with many major highways and rail lines connecting it to other parts of France and the world. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, reduce costs, and increase efficiency. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, particularly in areas such as diagnosis, treatment, and patient care.

2. Increased use of AI in finance: AI is already being used in finance to improve risk management, fraud detection, and investment decision



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [general category] [Name]. I recently graduated from [University], where I studied [major]. In my free time, I enjoy [relaxation hobbies], [eating habits], [health habits], [attitude towards work and relationships], [friends and family], [goals for the future], and [any other interesting facts]. I am confident in [ability], [strengths], [weaknesses], [glory], [honor], and [pride]. I am always ready to learn and grow, and I am a [positive trait] person. Thank you for having me!

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral.
Paris is known for its vibrant culture, delicious cuisine, and iconic landmarks, including the Eiffel Tower, Louvre Museum, and Notre-

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

First

 Name

]

 and

 I

 am

 a

/an

 [

Last

 Name

]

 [

First

 Name

].

 I

 have

 always

 had

 a

 passion

 for

 [

brief

ly

 describe

 your

 favorite

 hobby

 or

 activity

]

 and

 have

 always

 been

 fascinated

 by

 the

 idea

 of

 [

brief

ly

 describe

 the

 idea

 or

 interest

 that

 drives

 you

].

 I

 love

 to

 [

describe

 a

 thing

 that

 can

 motivate

 you

],

 and

 I

 am

 passionate

 about

 [

brief

ly

 describe

 the

 thing

 that

 motiv

ates

 you

].

 What

 makes

 you

 unique

 among

 your

 peers

 or

 classmates

?

 In

 my

 opinion

,

 I

 am

 [

insert

 your

 opinion

 or

 attribute

 that

 makes

 you

 stand

 out

 from

 others

].

 I

 also

 like

 to

 [

describe

 a

 lesson

 or

 activity

 that

 interests

 you

 or

 makes



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 on

 the

 Se

ine

 River

 in

 the

 Î

le

 de

 la

 Cité

. It

 is the

 largest city

 in France

 and the

 second-largest

 city in

 the

 European

 Union

 by

 population

,

 with

 a

 population

 of

 over

 

7

 million

 people

 as

 of

 

2

0

2

1

.

 The

 city

 is

 known

 for

 its

 historical

 architecture

,

 vibrant

 arts

 scene

,

 and

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 Lou

vre

 Museum

.

 Paris

 is

 also

 a

 major

 transportation

 hub

,

 home

 to

 the

 headquarters

 of

 many

 major

 French

 companies

 and

 a

 major

 international

 airport

.

 The

 city

 is

 an

 important

 cultural

 and

 political

 center

 for

 France

 and

 the

 world

.

 The

 population

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 exciting

 possibilities

 and

 potential

 challenges

.

 Some

 of

 the

 possible

 trends

 in

 AI

 are

:



1

.

 Increased

 automation

:

 AI

 is

 already

 making

 automation

 more

 efficient

,

 from

 manufacturing

 to

 customer

 service

.

 As

 more

 AI

 technologies

 are

 developed

,

 we

 can

 expect

 automation

 to

 become

 even

 more

 prevalent

.



2

.

 Enhanced

 creativity

 and

 imagination

:

 AI

 is

 capable

 of

 generating

 new

 ideas

 and

 creative

 solutions

,

 much

 like

 humans

.

 AI

 can

 assist

 in

 the

 creative

 process

,

 helping

 researchers

 and

 engineers

 to

 come

 up

 with

 new

 solutions

.



3

.

 Improved

 healthcare

:

 AI

 can

 help

 doctors

 and

 medical

 professionals

 make

 better

 decisions

,

 by

 analyzing

 large

 amounts

 of

 medical

 data

 and

 identifying

 patterns

 and

 trends

.



4

.

 Environmental

 impact

:




In [6]:
llm.shutdown()