# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.46it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Edward and I'm a former American mathematician who used to teach mathematics at the University of Toronto. I would like to explain some of my work to you, particularly how I came to the conclusion that only a vast amount of mathematics was needed to explain the universe. I have never used this method in my own work, but have found it very useful in my attempt to explain the universe.\nThere are two concepts that are very important to me. The first is “circularity.” It’s the idea that it’s very important to have connections between different pieces of mathematics. Mathematics is not merely a separate thing. It’s essential to
Prompt: The president of the United States is
Generated text:  a person. Which of the following statements is true?
A. The president of the United States is a government official.
B. The president of the United States is a person who has the right to run for president.
C. The president of the United States is a person who h

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and festivals throughout the year. Paris is known for its rich history, including the influence of the French Revolution and the influence of the French Revolution on the arts and sciences. It is also a popular tourist destination, with millions of visitors annually. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. Its reputation as a city of art, culture, and sophistication is well-documented

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and efficient AI systems.

2. Enhanced machine learning capabilities: AI is likely to become more powerful and capable, with the ability to learn from vast amounts of data and make more accurate predictions and decisions.

3. Greater emphasis on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations and the responsible use of AI. This could lead to more stringent regulations and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am a [profession] with expertise in [specific field of expertise]. I enjoy [amusement or hobby]. I am [age] years old. What brings you to this field or profession? Your experience, skills, and passion could really help someone else in their journey. Your background and education would also be very valuable for a potential employer. Thank you for taking the time to meet me. 

[Your Name] - [Your Profession] - [Your Age] - [Your Education] - [Your Experience]
Your background and education would also be very valuable for a potential employer. Thank you for taking the time

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 
(1 point) 
- Yes
- No

Your answer is: No. 
The correct answer is: Yes. 
Paris is the capital city of France. It is the largest and most populous city in Europe, located on the Î

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 a

 highly

 skilled

 and

 experienced

 software

 developer

 with

 a

 proven

 track

 record

 of

 delivering

 successful

,

 innovative

,

 and

 robust

 solutions

.

 I

 have

 a

 passion

 for

 technology

 and

 always

 strive

 to

 improve

 my

 skills

 and

 knowledge

 in

 order

 to

 be

 a

 valuable

 asset

 to

 my

 clients

.

 My

 team

 and

 I

 have

 the

 ability

 to

 work

 efficiently

 and

 effectively

,

 and

 we

 are

 constantly

 seeking

 ways

 to

 innovate

 and

 solve

 complex

 problems

.

 I

 am

 dedicated

 to

 ensuring

 that

 my

 clients

 are

 fully

 satisfied

 with

 the

 results

 they

 receive

,

 and

 I

 am

 committed

 to

 delivering

 work

 that

 meets

 or

 exceeds

 their

 expectations

.

 As

 an

 experienced

 software

 developer

,

 I

 am

 confident

 in

 my

 abilities

 and

 ready

 to

 take

 on

 any

 challenge



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



A

.

 True




B

.

 False





B

.

 False





France

's

 capital

 city

 is

 not

 Paris

.

 The

 capital

 is

 Lyon

.

 Lyon

 is

 the

 official

 capital

 of

 France

.

 Therefore

,

 the

 correct

 answer

 is

 B

.

 False

.

 Lyon

 is

 the

 official

 capital

 of

 France

,

 not

 Paris

.

 Paris

 is

 the

 main

 capital

 of

 France

.

 Therefore

,

 the

 correct

 answer

 is

 A

.

 True

.

 France

's

 capital

 city

 is

 not

 Paris

.

 The

 official

 capital

 is

 Lyon

,

 not

 Paris

.

 Lyon

 is

 the

 capital

 of

 France

.

 Therefore

,

 the

 correct

 answer

 is

 B

.

 False

.

 France

's

 capital

 is

 Lyon

.

 Lyon

 is

 the

 official

 capital

 of

 France

,

 not

 Paris

.

 Therefore

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 a

 rapid

 and

 significant

 increase

 in

 the

 capabilities

 and

 applications

 of

 AI

,

 as

 well

 as

 an

 increasing

 focus

 on

 ethical

 considerations

 and

 societal

 implications

.

 Some

 possible

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 autonomy

 and

 self

-learning

:

 AI

 is

 increasingly

 being

 used

 in

 a

 wide

 range

 of

 applications

,

 including

 healthcare

,

 transportation

,

 and

 manufacturing

.

 Autonomous

 vehicles

,

 for

 example

,

 are

 already

 being

 tested

 on

 the

 road

,

 and

 AI

 systems

 are

 being

 designed

 to

 learn

 and

 adapt

 to

 new

 situations

.



2

.

 Natural

 language

 processing

:

 Natural

 language

 processing

 (

N

LP

)

 is

 becoming

 increasingly

 important

 as

 AI

 is

 used

 to

 automate

 customer

 service

,

 language

 translation

,

 and

 other

 areas

.




In [6]:
llm.shutdown()