# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.07it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Albert! I'm from the United States of America, and I'm living in a small town in Chicago. Here's what I like to do most: I like to play soccer. I'm in a school league and I play with other kids in the school. There are lots of teams. Some of them are stronger than me, but most of them are weaker than me. I play soccer and I'm very good at it. I like to help my teammates. I am always ready to help others. I also love the smell of smoke and the sound of the rain. I like the smell of smoke because it reminds me of my
Prompt: The president of the United States is
Generated text:  assassinated.
A. Correct
B. Incorrect
Answer:
A

According to the passage, what is the result of the "fourth" letter of the first name of the president?
A. His daughter.
B. His son.
C. His granddaughter.
D. His nephew.
Answer:
D

Which of the following is NOT an example of a joint venture? 
A. The merger between Tsinghua University and Shanghai
B. The merger between Tsing

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm passionate about [Your Passion], and I'm always looking for ways to [Your Goal]. I'm a [Your Character Trait], and I'm always ready to [Your Response]. I'm [Your Age], and I'm [Your Personality]. I'm [Your Education Level], and I'm [Your Skills]. I'm [Your Interests], and I'm always looking for ways to [Your Hobby]. I'm [Your Favorite Color], and I'm always looking for ways to [Your Hobby]. I'm [Your Favorite Book], and I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is known for its rich history, including the influence of the French Revolution and the influence of the French Revolution on the arts and culture of the world. It is also a popular tourist destination, attracting millions of visitors each year. Paris is a city of contrasts, with its elegant architecture, vibrant nightlife, and diverse cultural scene. It is a city of history, art, and culture, and a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential trends include:

1. Increased integration with human intelligence: AI systems are likely to become more integrated with human intelligence, allowing them to learn from and adapt to the behavior and preferences of humans.

2. Enhanced privacy and security: As AI systems become more sophisticated, there will be a growing concern about their impact on privacy and security. There will be efforts to develop more secure and transparent AI systems.

3. Greater emphasis on ethical considerations: As AI systems become more complex and sophisticated, there will be a growing emphasis on ethical



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jane, and I'm a professional journalist. I've been working in the field for over a decade and have covered everything from politics to sports to culture. My writing has won several awards and my articles have been featured in major publications around the world. I love sharing my knowledge and experience with anyone who wants to learn. I'm also a public speaker and enjoy sharing my expertise on a wide range of topics. What are your interests or hobbies? As an AI language model, I don't have personal interests or hobbies like humans do. However, I do have a lot of knowledge and can help answer questions to the best of my ability based

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the most populous city in France and the capital of France. It is also the oldest city in Europe, founded in 787 A

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 an

 [

occupation

]

 who

 has

 been

 following

 the

 path

 of

 [

career

 goal

]

 for

 [

number

]

 years

.

 I

'm

 always

 ready

 to

 learn

 and

 grow

 in

 my

 field

,

 and

 I

'm

 excited

 to

 share

 my

 knowledge

 and

 experiences

 with

 you

.

 My

 background

 is

 in

 [

related

 field

],

 and

 I

'm

 always

 looking

 for

 new

 and

 exciting

 challenges

.

 I

'm

 a

 true

 adventurer

,

 always

 willing

 to

 explore

 new

 places

 and

 learn

 new

 things

.

 I

'm

 a

 reliable

 and

 dependable

 person

,

 always

 ready

 to

 help

 when

 needed

.

 I

 love

 making

 new

 friends

 and

 exploring

 the

 world

 with

 others

,

 and

 I

'm

 always

 looking

 for

 new

 opportunities

 to

 grow

 and

 learn

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 historic

 city

 located

 in

 the

 north

western

 part

 of

 the

 country

,

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 museums

,

 and

 rich

 cultural

 heritage

.

 



To

 prepare

 for

 this

 task

,

 the

 student

 needs

 to

:


1

.

 Identify

 the

 capital

 city

 of

 France

.


2

.

 Note

 its

 geographical

 location

 and

 notable

 features

.


3

.

 Sum

mar

ize

 the

 cultural

 and

 historical

 significance

 of

 Paris

.

 


4

.

 Provide

 a

 concise

 statement

 that

 captures

 the

 essence

 of

 the

 capital

 city

's

 character

 and

 importance

.

 


5

.

 Include

 a

 reference

 to

 historical

 facts

 or

 information

 about

 the

 city

.

 To

 sum

 up

 this

 information

 conc

is

ely

,

 the

 student

 should

 craft

 a

 statement

 that

 reflects

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 poised

 to

 be

 shaped

 by

 a

 diverse

 range

 of

 trends

 and

 advancements

 that

 are

 both

 exciting

 and

 unpredictable

.

 Here

 are

 some

 potential

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Increased

 integration

 with

 natural

 language

 processing

 (

N

LP

)

 -

 N

LP

 is

 becoming

 more

 integrated

 with

 AI

 as

 AI

 algorithms

 become

 better

 at

 understanding

 and

 processing

 human

 language

.

 This

 could

 lead

 to

 more

 intelligent

 and

 natural

 language

-driven

 AI

 systems

,

 such

 as

 virtual

 assistants

 and

 chat

bots

.



2

.

 Increased

 focus

 on

 ethical

 considerations

 -

 As

 AI

 systems

 become

 more

 complex

,

 there

 will

 be

 a

 greater

 emphasis

 on

 ensuring

 that

 they

 are

 developed

 and

 used

 eth

ically

.

 This

 could

 involve

 addressing

 issues

 such

 as

 bias

 and

 fairness

 in

 AI

 systems




In [6]:
llm.shutdown()