# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.63it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.62it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  a noun.
I'm part of a language learning community. My name is added to a list of names so I can be easily identified for query purposes. As a noun, I'm part of a vocabulary list which is the basis of my understanding of the language.
I'm named after a general noun that I am part of a lexical database for.
The meaning of this word depends on the context of the sentence, so I must be distinguished from other words with the same meaning. As a noun, I am named after a general noun that I am part of a vocabulary list which is the basis of my understanding of the language.
I'm
Prompt: The president of the United States is
Generated text:  24 years older than the president of Brazil. The president of Brazil is 20 years younger than the president of the United States. How old is the president of the United States? Let's denote the age of the president of the United States as \( U \) and the age of the president of Brazil as \( B \).

From the problem,

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your character or profession]. I enjoy [insert a short description of your hobbies or interests]. What brings you to this company? I'm drawn to [insert a short description of your reason for joining the company]. I'm looking forward to [insert a short description of your next steps in your career]. Thank you for taking the time to meet me. I look forward to our conversation. [Name] [Company Name

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is a bustling metropolis with a rich history dating back to the Roman Empire and a modern city with a diverse population. The city is home to many famous landmarks and attractions, including the Palace of Versailles and the Champs-Élysées. Paris is a cultural and political center of France and a major tourist destination. It is also known for its cuisine, fashion, and art scene. The city is home to many international organizations and has a strong economy, with a thriving service sector. Paris is a city of contrasts

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could shape the future of AI:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human emotions and behaviors.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations. This could lead to more stringent regulations and guidelines for AI development and use, as well as



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm an [occupation] with [number] years of experience in the field. I'm currently [current position], and I'm excited to bring my [abilities or strengths] to this role. I'm confident that I have the skills and knowledge required to make a positive impact on [target audience], and I'm eager to help make a difference in [target audience's life]. I'm here to learn and grow in [field] and I'm here to provide a new perspective and a fresh approach to [target audience's problem]. I'm excited to meet you and I look forward to working with you.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

[Mark down code]
```python
def summarize_city_info(city):
    """This function takes a city name as input and returns a concise factual statement about its capital."""
    # Example output: "Paris is a f

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

occupation

 or

 profession

]

 by

 day

,

 but

 [

h

obbies

,

 interests

,

 or

 passions

]

 in

 my

 free

 time

.

 Despite

 my

 busy

 schedule

,

 I

 maintain

 a

 balance

 by

 spending

 time

 with

 [

rel

atives

,

 friends

,

 or

 community

 members

],

 engaging

 in

 [

activities

,

 such

 as

 cooking

,

 reading

,

 or

 playing

 sports

].

 I

'm

 dedicated

 to

 [

goals

 or

 objectives

,

 such

 as

 helping

 others

,

 promoting

 kindness

,

 or

 improving

 my

 skills

].

 Despite

 my

 busy

 schedule

,

 I

 maintain

 a

 peaceful

 and

 focused

 mind

,

 often

 reflecting

 on

 [

thought

s

 or

 emotions

,

 such

 as

 gratitude

,

 kindness

,

 or

 personal

 growth

].

 I

'm

 an

 [

age

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 France

 and

 the

 largest

 metropolitan

 area

 in

 the

 European

 Union

.

 It

 is

 located

 on

 the

 Se

ine

 River

 and

 is

 known

 for

 its

 historic

 architecture

,

 art

,

 music

,

 and

 cuisine

.

 The

 city

 is

 also

 home

 to

 the

 French

 government

,

 the

 French

 parliament

,

 and

 the

 French

 universities

.

 Its

 nickname

 is

 "

la

 ville

 ou

verte

,"

 which

 means

 "

the

 open

 city

."

 Paris

 is

 known

 as

 the

 "

City

 of

 Light

"

 and

 is

 a

 UNESCO

 World

 Heritage

 site

.

 It

 is

 home

 to

 numerous

 museums

,

 monuments

,

 and

 cultural

 institutions

,

 including

 the

 Lou

vre

 Museum

 and

 the

 Palace

 of

 Vers

ailles

.

 Paris

 is

 a

 bustling

 and

 diverse

 city

 with



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 exciting

 possibilities

 and

 possibilities

.

 Here

 are

 some

 possible

 future

 trends

 that

 could

 shape

 how

 AI

 is

 used

 and

 developed

:



1

.

 Personal

ized

 AI

:

 AI

 will

 become

 more

 personalized

 as

 it

 learns

 from

 user

 data

 and

 behavior

,

 offering

 increasingly

 tailored

 solutions

 to

 help

 users

 achieve

 their

 goals

.



2

.

 Autonomous

 vehicles

:

 AI

 will

 play

 a

 key

 role

 in

 autonomous

 vehicles

,

 reducing

 accidents

 and

 improving

 road

 safety

.

 This

 will

 also

 lead

 to

 the

 development

 of

 driver

less

 taxis

,

 self

-driving

 shopping

 carts

,

 and

 even

 self

-driving

 drones

.



3

.

 Cognitive

 computing

:

 AI

 will

 be

 able

 to

 process

 and

 interpret

 complex

 cognitive

 tasks

,

 such

 as

 understanding

 natural

 language

,

 learning

 from

 experience

,

 and

 making




In [6]:
llm.shutdown()