# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.89it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.89it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  [Name] and I am a Game Designer/Developer at [Company]. Currently, I'm working on a project called [Project Name]. My project revolves around developing a new app that allows users to create and share art collections. The app will be available on both Android and iOS devices. The app will have a user-friendly interface and will allow users to upload their own art, add a description, and share it with friends. The app will also include a search function to help users find and share art collections that match their interests. Additionally, the app will have a social sharing feature that allows users to share their art collections with their friends
Prompt: The president of the United States is
Generated text:  3 feet tall. If it's a special event day, the president can grow an additional 5 feet. On a special event day, how tall will the president be?
To determine the height of the president on a special event day, we need to consider the additio

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a popular tourist destination and a major economic center. Paris is home to many cultural institutions, including the Louvre Museum, the Musée d'Orsay, and the Centre Pompidou. The city is also known for its cuisine, including its famous croissants and its traditional French wine. Paris is a vibrant and diverse city with a rich history and a strong sense of French identity. It is a popular destination for tourists and locals alike. The city is also home to many

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This will include issues such as bias, transparency, accountability, and the potential for AI to be used for harmful purposes.

2. Advancements in machine learning: As AI technology continues to advance, we are likely to see more sophisticated models that can learn from data and make more accurate predictions and decisions. This will require significant improvements in algorithms and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [profession/character] at [Company/Entertainment Organization]. I was born and raised in [Location] and I've always had a natural inclination towards [Your hobby, interest, or talent]. I'm constantly learning new things and having a lot of fun with my friends.

I enjoy [X] and I love [Y]. I love being able to make people laugh and I'm always looking for new ways to entertain people with [Z]. And most importantly, I love the feeling of success. I'm very good at [Your skill or ability] and I'm always looking for ways to improve

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris, officially known as the City of Paris, is the capital city of France and the largest city in the European Union. It is the 29th largest city in the world and the most populous city in the European Union. It 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Character

's

 Name

]

 and

 I

'm

 a

 [

role

].

 I

 am

 a

 [

occupation

]

 at

 [

company

/

website

],

 [

character

's

 profession

].

 [

Character

's

 profession

]

 has

 been

 the

 leader

 of

 [

number

]

 of

 people

 for

 [

number

]

 years

.

 I

 have

 a

 reputation

 as

 a

 [

adv

antage

]

 person

,

 and

 my

 team

 always

 performs

 at

 its

 best

.

 I

 have

 a

 talent

 for

 [

special

ism

],

 and

 I

 am

 always

 eager

 to

 learn

 new

 things

.

 I

 believe

 in

 [

value

 or

 principle

],

 and

 I

 strive

 to

 make

 the

 world

 a

 better

 place

.

 I

'm

 an

 [

h

onest

y

]

 person

,

 and

 I

 don

't

 take

 personal

 advantage

 or

 seek



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



The

 Paris

 (

France

)

 coordinates

 the

 administrative

,

 political

,

 and

 economic

 life

 of

 the

 country

.

 It

 has

 been

 known

 as

 the

 largest

 city

 in

 Europe

 since

 the

 

1

5

th

 century

.

 Paris

 is

 the

 cultural

 and

 economic

 hub

 of

 the

 country

 and

 hosts

 many

 of

 France

’s

 major

 festivals

 and

 events

.

 It

 is

 also

 home

 to

 the

 European

 Parliament

.

 The

 city

 sits

 on

 the

 river

 Se

ine

 and

 has

 a

 long

 history

 dating

 back

 to

 Roman

 times

.

 It

 is

 a

 UNESCO

 World

 Heritage

 site

 and

 a

 UNESCO

 City

 of

 Literature

 and

 Culture

.

 Paris

 has

 a

 rich

 history

 and

 culture

,

 with

 many

 landmarks

 and

 attractions

 worth

 visiting

.

 Its

 population

 of

 over

 

2

 million

 people



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 but

 some

 possible

 trends

 are

:



 

 

1

.

 Increased

 efficiency

 and

 accuracy

:

 As

 AI

 technology

 continues

 to

 improve

 and

 become

 more

 capable

,

 it

 is

 likely

 to

 become

 even

 more

 efficient

 and

 accurate

 in

 its

 tasks

.

 This

 could

 lead

 to

 a

 wider

 range

 of

 applications

 being

 developed

,

 from

 self

-driving

 cars

 to

 medical

 diagnosis

.


 

 

2

.

 Integration

 with

 human

 systems

:

 AI

 is

 already

 being

 integrated

 into

 many

 systems

 and

 services

,

 from

 social

 media

 to

 healthcare

,

 but

 it

 is

 likely

 to

 become

 even

 more

 integrated

 in

 the

 future

.

 This

 could

 lead

 to

 a

 more

 seamless

 and

 seamless

 user

 experience

.


 

 

3

.

 AI

-powered

 solutions

:

 AI

 is

 already

 being

 used

 in




In [6]:
llm.shutdown()