# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.25it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tony and I’m an educator, former world-class professional basketball player and former 300+ foot shotfielder who now coaches young players. My name has always been Tony Northam, but I’m also referred to as Tony the Bet. I have been coaching basketball for 23 years and have coached for the past 12 years as a full-time professional. My wife, the wonderful and inspiring Cindy, and I are proud to have our daughters, Jennifer and Andrea. I have four children who were born through adoption.
I was the highest professional basketball player in Australia in 1988, and in 2
Prompt: The president of the United States is
Generated text:  a rich man. He spends a lot of money every year. He lives in a large house in a beautiful city. He goes to the White House on horseback. He drives cars, trucks and boats. He has many kids and a wife. He likes to travel a lot. He has a lot of pets like cats, dogs and fish. He likes to have lots of parties. He also likes to 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, with a rich history dating back to the Roman Empire and a modern city that has undergone significant development over the centuries. Paris is home to many famous French artists, writers, and musicians, and is a popular tourist destination for visitors from around the world. The city is also known for its cuisine, with a wide variety of dishes and flavors to choose from. Overall, Paris is a vibrant and exciting city that is a must-visit for anyone interested in

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way that AI is used and developed. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This will include issues such as bias, transparency, accountability, and privacy.

2. Greater use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes more advanced, it is likely to be used in even more areas, including diagnosis,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], I'm a [Age] year old software engineer. I work at [Company Name], and I'm passionate about [Reason for Passion] and [Things You Love About the Industry]. I enjoy [Favorite Things to Do or Activities]. I'm also a [Favorite Hobby]. And I'm always looking for new challenges and opportunities to grow as a [Title]. I thrive on [Why You're the Right Candidate]. I'm excited to meet you at [Date and Time]. Let's chat about [What You Can Discuss]! Let's connect! **[Your Name]**  
**[Your Occupation]**  
**[Your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

The statement is accurate and can be summarized as: "Paris is the capital city of France, serving as the country's cultural, political, and economic center." 

For example, "Paris has been the seat of government and nationhood since the 12th centu

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 a

 [

type

 of

 person

]

 who

 possesses

 [

unique

 skill

,

 talent

,

 or

 personality

].

 I

'm

 passionate

 about

 [

a

 hobby

,

 interest

,

 or

 goal

].

 I

'm

 [

character

's

 age

,

 education

 level

,

 or

 life

 experience

].

 I

 enjoy

 [

a

 hobby

,

 interest

,

 or

 activity

]

 and

 I

'm

 always

 looking

 for

 opportunities

 to

 grow

 and

 learn

.

 I

'm

 [

character

's

 personality

 trait

 or

 personal

 characteristic

].

 Thank

 you

 for

 asking

!

 I

 look

 forward

 to

 having

 the

 opportunity

 to

 learn

 more

 about

 you

.

 [

Your

 Name

]

 


[

Your

 Name

's

 profession

,

 industry

,

 or

 occupation

]


[

Your

 Name

's

 location

,

 city

,

 or



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 the

 most

 populous

 city

 in

 France

 and

 is

 located

 in

 the

 northeast

 of

 the

 country

.



The

 statement

 is

:

 "

Paris

,

 the

 most

 populous

 city

 in

 France

,

 is

 located

 in

 the

 northeast

 of

 the

 country

."

 



Here

's

 the

 reasoning

:



1

.

 "

Paris

"

 is

 explicitly

 stated

 as

 the

 capital

 city

 of

 France

.


2

.

 The

 statement

 describes

 the

 location

 of

 Paris

,

 which

 is

 the

 northeast

 of

 the

 country

.


3

.

 Both

 pieces

 of

 information

 are

 accurate

 and

 complete

.



The

 statement

 accurately

 reflects

 the

 key

 facts

 about

 Paris

's

 importance

 as

 both

 a

 capital

 and

 a

 major

 city

 in

 France

.

 The

 northeast

 region

 of

 the

 country

 is

 known

 for

 its

 rich

 cultural

 heritage



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 several

 key

 trends

:



 

 

1

.

 Increased

 focus

 on

 ethical

 considerations

:

 As

 AI

 systems

 become

 more

 advanced

 and

 more

 complex

,

 it

 is

 likely

 that

 they

 will

 be

 subject

 to

 increasing

 scrutiny

 from

 regulators

 and

 the

 public

.

 As

 a

 result

,

 there

 will

 be

 an

 increased

 focus

 on

 ethical

 considerations

,

 such

 as

 how

 AI

 systems

 are

 developed

,

 deployed

,

 and

 used

,

 and

 how

 they

 affect

 the

 privacy

 and

 rights

 of

 individuals

.


 

 

2

.

 Greater

 reliance

 on

 data

:

 AI

 systems

 will

 continue

 to

 rely

 more

 heavily

 on

 data

 to

 function

,

 and

 this

 will

 likely

 lead

 to

 an

 increased

 reliance

 on

 large

 datasets

.

 This

 will

 make

 it

 more

 difficult

 to

 provide

 accurate




In [6]:
llm.shutdown()