# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.28it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Nia and I'm a Senior at the University of Melbourne. I'm a small and curious person who has a love for science and technology. I like to apply my knowledge to solve problems and to learn about new things. I'm really excited to share what I learn on this website. This website is a hub for students and researchers who want to learn about new technologies and technologies that can help solve future challenges.
I'm going to talk about some of the things that I've learned through this website. I'll also discuss my research interests, as well as some of the technical skills I've developed. Also, I'll share a link
Prompt: The president of the United States is
Generated text:  45 years older than the president of Brazil. The president of Brazil is 30 years younger than the president of the United States. How old are the presidents of the United States and Brazil together?
To determine the ages of the presidents of the United States and Brazil, we star

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title] and I'm always looking for ways to [job title] and [job title]. I'm a [job title] at [company name] and I'm always looking for ways to [job title] and [job title]. I'm a [job title] at [company name] and I'm always looking for ways

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and the Louvre Museum. It is the largest city in France and the seat of the French government. Paris is also known for its rich history, including the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. It is a popular tourist destination and a major cultural center. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. The city is known for its fashion, art, and food, and is a hub for business and commerce. Paris is a city of people, with a diverse population and a rich cultural heritage

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and robots to personalized medicine and virtual assistants. Additionally, AI is likely to play an increasingly important role in shaping the future of work, with the rise of automation and artificial intelligence becoming more prevalent in industries such as manufacturing, finance, and healthcare. Finally, AI is likely to continue to be used for good, with the goal of creating more efficient, sustainable, and equitable societies. However, it is important to note that AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Sarah, I'm 25 years old, and I have a degree in marketing. I'm a kind and outgoing person who thrives on meeting new people and connecting with others. I enjoy helping people find their way in life, and I'm always looking for new opportunities to contribute to the world. I'm always ready to learn new skills and stay up-to-date with the latest trends in marketing. I'm confident in my abilities and look forward to sharing my knowledge and experiences with others. I'm excited to meet new people and make new connections, and I'm always looking for new ways to help people succeed in life. If you

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest and most populous city in the European Union. Paris is known for its iconic architecture, including the Eiffel Tower, Notre-Dame Cathedral, and Louvr

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

________

_.

 I

'm

 a

/an

 ___

 and

 ___

 (

job

 title

 and

 job

 description

).

 I

'm

 a

/an

 ___

 by

 __

_.

 I

 enjoy

 ___

 and

 ___

 (

any

 hobbies

 or

 interests

 you

're

 passionate

 about

).

 I

'm

 ___

 (

your

 favorite

 color

,

 pet

,

 or

 something

 else

)

 and

 ___

 (

any

 other

 details

 that

 make

 you

 stand

 out

).

 I

'm

 ___

 and

 ___

 (

your

 strength

,

 weakness

,

 or

 biggest

 strength

/d

iff

iculty

).

 I

'm

 ___

 (

how

 you

 want

 people

 to

 think

 of

 you

).

 I

'm

 excited

 to

 meet

 you

 and

 learn

 more

 about

 you

.

 It

's

 a

 pleasure

 to

 meet

 you

,

 __

_.

 Have

 a

 great

 day

!

 (

Author

's

 name



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



The

 statement

 about

 the

 capital

 city

 of

 France

 is

:

 Paris

 is

 the

 capital

 of

 France

.

 



I

've

 found

 it

:

 



Paris

 is

 the

 capital

 of

 France

.

 



I

've

 found

 it

:

 



The

 capital

 of

 France

 is

 Paris

.

 



I

've

 found

 it

:

 



Paris

 is

 the

 capital

 of

 France

.

 



I

've

 found

 it

:

 



Paris

 is

 the

 capital

 of

 France

.

 



I

've

 found

 it

:

 



Paris

 is

 the

 capital

 of

 France

.

 



I

've

 found

 it

:

 



Paris

 is

 the

 capital

 of

 France

.

 



I

've

 found

 it

:

 



Paris

 is

 the

 capital

 of

 France

.

 



I

've

 found

 it

:

 



Paris

 is

 the

 capital



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 see

 significant

 advancements

 in

 several

 key

 areas

:



1

.

 Improved

 interpret

ability

:

 AI

 systems

 are

 becoming

 more

 transparent

 and

 explain

able

,

 making

 it

 easier

 for

 humans

 to

 understand

 how

 and

 why

 they

 are

 making

 decisions

.

 This

 will

 lead

 to

 more

 trust

 and

 confidence

 in

 AI

,

 as

 people

 will

 be

 better

 able

 to

 gauge

 the

 effectiveness

 and

 impact

 of

 AI

 applications

.



2

.

 Natural

 language

 processing

:

 As

 AI

 becomes

 more

 sophisticated

,

 natural

 language

 processing

 (

N

LP

)

 will

 become

 even

 more

 powerful

.

 This

 will

 enable

 AI

 systems

 to

 understand

 and

 respond

 to

 human

 language

,

 including

 complex

 queries

 and

 conversations

.

 This

 will

 lead

 to

 more

 accurate

 and

 effective

 communication

 between

 humans

 and

 AI

 applications

.



3




In [6]:
llm.shutdown()