# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.34it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Haya and I am a freshman in college. I was born in July 1996 and have been studying English for more than 10 years. I have learned many different languages and am fluent in at least 10 different languages. I can read and write in about 20 languages.

I have been involved in a number of club activities, particularly in the English club at my university. I have been an English champion many times and won many competitions and awards.

I have also helped teach Spanish, French, German and other languages as a volunteer for the college's Spanish club. I have always been a hard worker
Prompt: The president of the United States is
Generated text:  a high-ranking government official of the country. The position of president in the United States is elected. Who may nominate him or her to be president?
A) The Secretary of State
B) The Speaker of the House of Representatives
C) The Vice President
D) The President of the Senate

To determine who may nomin

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, and is the largest city in the country and the second largest in Europe. It is located on the Seine River and is home to many of France's most famous landmarks, including the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its rich cultural heritage, including its museums, theaters, and art galleries, and its vibrant nightlife. The city is a major economic and cultural center, and is home to many of France's most important institutions and institutions of higher learning. It is a popular tourist destination, with millions of visitors annually.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more complex and nuanced decision-making. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human emotions and behaviors.

2. Enhanced privacy and security: As AI systems become more sophisticated, there will be a growing concern about privacy and security. There will be a need for more robust privacy protections and measures to ensure that AI systems are not used to harm or mislead individuals.

3. Greater reliance on data: AI systems will become more reliant on large amounts of



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm [Your Age]. I'm an experienced [Your Profession], and I'm passionate about [Your Interest or hobby]. How would you like to meet you? [Your Name] [Your Profession] (Note: Fill in the blank with the full name and profession) [Your Age] [Your Age] [Your Age] Hello, my name is [Your Name] and I'm [Your Age]. I'm an experienced [Your Profession], and I'm passionate about [Your Interest or hobby]. How would you like to meet you? [Your Name] [Your Profession] (Note: Fill in

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in France by population and is the seat of government, education, commerce, and industry in France.

This statement encapsulates the main facts about Paris, including its population size, its role as the capital, and its functions in French society.

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

],

 and

 I

'm

 a

 [

insert

 profession

 or

 title

]

 with

 [

insert

 short

 biography

].

 I

've

 always

 been

 fascinated

 by

 the

 human

 condition

,

 and

 [

insert

 why

 this

 interest

 is

 important

],

 and

 I

've

 always

 been

 interested

 in

 learning

 more

 about

 the

 world

 around

 me

.

 I

 enjoy

 exploring

 new

 places

,

 trying

 new

 foods

,

 and

 reading

 interesting

 books

.

 I

'm

 a

 [

insert

 what

 you

 would

 say

 is

 your

 unique

 personality

 trait

 or

 something

 you

're

 good

 at

].

 I

'm

 always

 eager

 to

 learn

 and

 grow

.

 What

's

 your

 name

,

 and

 what

's

 your

 profession

 or

 title

?

 Hello

,

 my

 name

 is

 [

insert

 name

],

 and

 I

'm

 a

 [

insert

 profession



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 largest

 city

 in

 the

 country

 and

 the

 city

 where

 the

 French

 Revolution

 took

 place

.

 



1

.

 Paris

 is

 a

 city

 located

 in

 the

 center

 of

 France

,

 at

 the

 footh

ills

 of

 the

 Alps

.


2

.

 It

 is

 the

 most

 populous

 city

 in

 France

 and

 the

 second

-largest

 city

 in

 the

 European

 Union

 after

 London

.


3

.

 The

 city

 is

 known

 for

 its

 museums

,

 historical

 sites

,

 and

 art

 galleries

,

 as

 well

 as

 its

 unique

 cuisine

 and

 fashion

 industry

.


4

.

 The

 city

 has

 a

 rich

 and

 diverse

 cultural

 heritage

,

 with

 its

 many

 medieval

 and

 modern

 neighborhoods

,

 including

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.


5

.

 Paris

 is

 home

 to

 numerous



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 many

 factors

,

 including

 advances

 in

 computing

 power

 and

 processing

 power

,

 improvements

 in

 data

 and

 model

 capabilities

,

 and

 the

 increasing

 focus

 on

 ethical

 and

 societal

 implications

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

:



1

.

 Greater

 integration

 with

 other

 technologies

:

 AI

 is

 likely

 to

 become

 more

 integrated

 with

 other

 technologies

,

 such

 as

 machine

 learning

,

 natural

 language

 processing

,

 and

 robotics

.

 This

 could

 lead

 to

 a

 more

 cohesive

 and

 cohesive

 system

 that

 can

 learn

 and

 adapt

 to

 new

 situations

 more

 quickly

.



2

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 healthcare

 to

 improve

 diagnosis

,

 treatment

,

 and

 patient

 care

.

 As

 AI

 technology

 continues

 to

 improve

,




In [6]:
llm.shutdown()