# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.89it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.89it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Vivian. I'm a bright and curious student in my fourth year of high school. I'm really interested in taking up photography. I'd like to buy a camera to learn to take good pictures. What would you recommend to me?

Can you help me choose a good camera to buy? Or do you have any other suggestions?

Please don't give me any product reviews.

Thank you!
Vivian
Sure, I'd be happy to help guide you in choosing a good camera! Let's break down the factors to help you make a decision:

### 1. **Purpose and Purpose**
   - **Photography**: If
Prompt: The president of the United States is
Generated text:  now the world's most powerful person, and he or she is every leader's dream. Unfortunately, however, the United States faces many problems. There is a saying that: "The more money one makes, the bigger the room to be rich." To make more money, Americans depend on the fast food industry, foreign trade, and money. While the United States has many advantages

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major center for art, culture, and politics, and is home to many of the world's most famous museums and attractions. Paris is a vibrant and diverse city with a rich history and a strong sense of French identity. Its status as the world's most populous city is reflected in its population of over 25 million people. The city is also home to many international organizations and institutions, including the European Union and the United Nations. Paris is a city of contrasts, with its modern architecture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with other technologies: AI will continue to be integrated with other technologies such as machine learning, robotics, and natural language processing, creating a more complex and interconnected system.

2. Enhanced privacy and security: As AI becomes more integrated with other technologies, there will be increased concerns about privacy and security. Governments and organizations will need to develop new privacy and security policies to protect the data and information that is generated and used by AI systems.

3. Increased focus on ethical considerations: As AI becomes more integrated with other technologies, there will be increased focus on ethical



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a 25-year-old individual who is always looking to learn and grow. I'm not afraid to ask questions and have a positive attitude. I love exploring new things and trying new things, and I'm always up for a challenge. I also enjoy working with people and helping them achieve their goals. I'm confident and passionate about my work and want to continue pushing myself to do better. What is your name? What's your occupation? What's your favorite hobby? What's your greatest achievement? What's your favorite book? What's your favorite food? What's your favorite movie? What's your favorite

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is the largest city in Europe and home to the City of Light and the Eiffel Tower. Paris is known for its historical significance, world-renowned museums, art gal

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

occupation

]

 who

 loves

 [

your

 interest

 or

 hobby

].

 I

 am

 a

 [

job

 title

]

 at

 [

company

 name

],

 where

 I

 work

 hard

 to

 [

your

 professional

 goal

 or

 responsibility

].

 I

 am

 a

 [

language

]

 speaker

 who

 can

 speak

 [

language

]

 flu

ently

.

 I

 have

 [

number

 of

 years

 of

 experience

]

 years

 of

 experience

 in

 [

your

 field

 or

 area

 of

 work

].

 I

 am

 a

 [

type

 of

 character

],

 [

your

 personality

 traits

]

 and

 I

 am

 always

 [

your

 character

 trait

]

 at

 all

 times

.

 I

 am

 a

 [

job

 title

]

 at

 [

company

 name

],

 where

 I

 work

 hard

 to

 [

your

 professional

 goal

 or

 responsibility



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 has

 been

 the

 seat

 of

 government

,

 capital

,

 and

 largest

 city

 in

 France

 since

 the

 

1

9

th

 century

.

 It

 is

 a

 major

 international

 center

 and

 the

 largest

 metropolitan

 area

 in

 the

 world

,

 and

 is

 known

 for

 its

 rich

 cultural

 heritage

,

 including

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 iconic

 landmarks

 like

 the

 Lou

vre

 Museum

 and

 the

 Palace

 of

 Vers

ailles

.

 The

 city

 is

 also

 known

 for

 its

 artistic

,

 literary

,

 and

 entertainment

 scenes

,

 and

 is

 a

 major

 cultural

 center

 for

 Europe

 and

 beyond

.

 Paris

 has

 a

 population

 of

 over

 

2

 million

 people

 and

 is

 home

 to

 many

 world

-ren

owned

 museums

,

 theaters

,

 and

 restaurants

.

 The



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 promising

 and

 there

 are

 several

 trends

 that

 are

 likely

 to

 shape

 the

 development

 of

 this

 technology

 in

 the

 years

 to

 come

:



1

.

 Increased

 integration

:

 AI

 systems

 will

 become

 more

 integrated

 with

 other

 technologies

,

 such

 as

 sensors

 and

 IoT

 devices

,

 to

 better

 understand

 and

 interact

 with

 the

 physical

 world

.



2

.

 Enhanced

 natural

 language

 processing

:

 AI

 will

 be

 able

 to

 understand

 and

 respond

 to

 human

 language

 in

 a

 way

 that

 is

 more

 natural

 and

 easier

 to

 understand

.



3

.

 Personal

ization

:

 AI

 will

 be

 able

 to

 learn

 from

 users

'

 interactions

 and

 provide

 personalized

 recommendations

 and

 solutions

 based

 on

 their

 preferences

.



4

.

 Autonomous

 systems

:

 Self

-driving

 cars

,

 drones

,

 and

 other

 autonomous

 systems

 will

 become

 more




In [6]:
llm.shutdown()