# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0829 18:50:46.140000 148914 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0829 18:50:46.140000 148914 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




W0829 18:50:54.525000 149273 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0829 18:50:54.525000 149273 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.64it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=76.52 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=76.52 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.99it/s]Capturing batches (bs=2 avail_mem=76.46 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.99it/s]Capturing batches (bs=1 avail_mem=76.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.99it/s]Capturing batches (bs=1 avail_mem=76.45 GB): 100%|██████████| 3/3 [00:00<00:00,  9.54it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Annabelle and I am 14 years old and I am single. I have a strong interest in books and reading, and I am in my second year at Winchester College. I have a passion for social issues and I have been involved in a variety of charities, including the Durham Women's Shelter and The Children's Research Centre. I have worked with various volunteers in the communities and I have provided support and guidance to children in need. I am an excellent communicator and I am able to speak confidently in front of groups of people. I have a very positive attitude towards life and I take pride in what I am and what I do.
Prompt: The president of the United States is
Generated text:  trying to decide how many military bases to build in a certain country. The country has 100 bases, and the president knows that each base costs a different amount to build. The cost of each base is given as a fraction of a million dollars. For example, a base that costs $1 million w

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and passions. What can I expect from our conversation? [Name]: Hello! I'm [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and passions. What can I expect from our conversation? [Name]: Of course! I'm here to learn more about your interests and passions, and to get to know you better. What do you like to do? [Name]: I like to read, watch movies

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Middle Ages, and is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also a major cultural and economic center, with a diverse population and a thriving arts scene. The city is home to many famous museums, including the Louvre and the Musée d'Orsay, as well as the Notre-Dame Cathedral and the Champs-Élysées. Paris is a popular tourist destination and a major hub for international business and diplomacy. The

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks of AI, there will be a greater emphasis on developing AI that is designed to be ethical and responsible. This could involve developing AI that is transparent, accountable, and accountable to humans.

2. AI will become more integrated into our daily lives: As AI becomes more integrated into our daily lives, we will see more widespread adoption of AI technologies. This could include things like smart homes, self-driving cars, and virtual



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm here to say hello to everyone. I'm a young woman with an infectious sense of humor and a warm smile. I love spending time with my friends and trying new things, and I'm always eager to learn something new. I'm always on the lookout for new experiences and adventures, and I'm always ready to explore. So, if you're looking for a new friend or just someone to share your adventures with, I'm your man. Take a look around, and you'll see why I'm a special kind of friend. #Friendship #Exploration #NewExperiences #FriendlyFriendly. Can

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light. 

This statement encapsulates the capital city's status as one of the world's most important and vibrant cities, characterized by its historical significance, cultural richness, and 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Sarah

 and

 I

'm

 a

 young

 environmental

 advocate

.

 I

 am

 passionate

 about

 protecting

 the

 planet

 and

 taking

 action

 to

 reduce

 my

 impact

 on

 the

 world

.

 I

'm

 an

 active

 member

 of

 local

 environmental

 groups

 and

 am

 committed

 to

 sustainable

 living

.

 I

 have

 a

 strong

 work

 ethic

 and

 love

 to

 help

 others

 learn

 about

 environmental

 issues

.

 



Please

 let

 me

 know

 how

 you

 would

 like

 me

 to

 start

 the

 conversation

.

 Sure

,

 feel

 free

 to

 start

!

 Let

's

 begin

 the

 conversation

!

 What

's

 your

 favorite

 book

?

 Sarah

,

 what

's

 your

 favorite

 book

?

 That

's

 an

 interesting

 question

.

 As

 an

 AI

 language

 model

,

 I

 don

't

 have

 personal

 preferences

 like

 humans

 do

,

 but

 I

 can

 suggest

 some



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 the

 largest

 city

 and

 the

 largest

 metropolitan

 area

 in

 the

 European

 Union

.



What

 are

 the

 two

 main

 types

 of

 French

 cuisine

 and

 which

 country

 does

 it

 come

 from

?

 The

 two

 main

 types

 of

 French

 cuisine

 are

 classic

 French

 cuisine

 and

 continental

 French

 cuisine

.

 Classic

 French

 cuisine

 is

 influenced

 by

 French

 colonial

ism

 and

 uses

 ingredients

 and

 techniques

 from

 the

 French

 colonies

.

 Continental

 French

 cuisine

 is

 based

 on

 regional

 traditions

 and

 dishes

 that

 were

 developed

 by

 the

 people

 of

 a

 particular

 region

.



When

 is

 the

 French

 New

 Year

 celebrated

?

 The

 French

 New

 Year

 is

 celebrated

 on

 December

 

2

5

th

.



Is

 it

 true

 that

 the

 French

 are

 the

 only

 country

 in

 Europe

 to

 have

 a

 monarchy

?



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 rapidly

 evolving

,

 with

 many

 potential

 directions

 and

 trends

 shaping

 its

 direction

.

 Some

 possible

 future

 trends

 in AI

 include

:

1

. Adv

ancements

 in

 Machine

 Learning

:

 With

 the

 continuous

 development

 of

 machine

 learning

 algorithms

,

 AI

 models

 are

 becoming

 more

 sophisticated

 and

 capable

 of

 performing

 increasingly

 complex

 tasks

.

 This

 will

 enable

 AI

 to

 learn

 from

 new

 data

,

 adapt

 to

 new

 situations

 and

 make

 more

 accurate

 predictions

 and

 decisions

.



2

.

 Personal

ization

 and

 Tail

ored

 Experience

:

 AI

 will

 continue

 to

 improve

 its

 ability

 to

 personalize

 and

 tailor

 user

 experiences

 to

 meet

 individual

 needs

 and

 preferences

.

 This

 will

 be

 realized

 through

 the

 use

 of

 natural

 language

 processing

,

 speech

 recognition

 and

 other

 technologies

 that

 allow

 machines

 to

 understand

 and




In [6]:
llm.shutdown()