# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.13s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.13s/it]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Taro. I'm a 22-year-old science student. I'm interested in AI. I like the idea of creativity, and I want to do a job that I can express my creativity. I have some experience in teaching, so I think I can use my experience to do this job.

Please answer the following question about my profile: Do you have any relevant skills, experience, or education that I can use to apply to this job? Additionally, can you provide me with some suggestions on what specific skills or experiences I should focus on in my application? Taro's profile is a description of a job seeker who is interested
Prompt: The president of the United States is
Generated text:  a member of the Legislative branch of the government. Which of the following is true regarding the President of the United States?

A) He is the most powerful person in the government
B) He is the only person in the government
C) He serves a shorter term than the Vice President
D) He is elected by the peopl

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville-Marie" or "La Ville de Paris". It is the largest city in France and the second-largest city in the European Union. Paris is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major center for the arts, music, and fashion. Paris is known for its rich history, culture, and cuisine, and is a popular tourist destination. The city is also home to many international organizations and institutions, including the French Academy of Sciences and the French National Library. Paris is a vibrant and dynamic city with

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased automation: As AI continues to become more advanced, it is likely to automate more and more tasks, freeing up human workers to focus on more complex and creative work. This could lead to a shift in the job market, with many jobs being replaced by AI.

2. AI ethics and privacy: As AI becomes more advanced, there will be a need to address ethical and privacy concerns. This could lead to new regulations and standards being



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a/an [Job Title] at [Company Name]. I have [Number of years in Position], [Number of years in Role] experience in [Industry]. I was [First Job] at [Company Name] for [Number of Years], and I have [Number of Years] years of experience in [Industry]. I have a passion for [Your Passion], and I'm looking to [What You Desire to Achieve]. I'm excited to learn more about you and how I can help you achieve your goals. [Tell a little bit about yourself that is not job-related or industry-specific]. I am

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its stunning architecture, vibrant culture, and romantic atmosphere.

That's correct! Paris, also known as the "City of Light," is the capital of France and is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Not

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

'm

 a

 [

Occup

ation

]

 with

 [

Number

]

 years

 of

 experience

.

 I

 am

 a

 [

Degree

]

 with

 [

Number

]

 years

 of

 education

,

 and

 I

 specialize

 in

 [

Your

 Field

 of

 Interest

].

 I

 have

 a

 [

Number

]

 of

 years

 of

 experience

 in

 [

Your

 Field

 of

 Interest

],

 and

 I

 am

 always

 looking

 for

 ways

 to

 [

What

 You

 Do

 Best

].

 I

 am

 a

 [

Number

]

 person

,

 and

 I

 thrive

 on

 [

Why

 It

's

 Important

].

 And

,

 I

 am

 an

 [

Number

]

 person

,

 and

 I

 love

 [

What

 You

 Do

 Best

].

 If

 you

 would

 like

 to

 chat

 about

 anything

,

 feel

 free

 to

 reach

 out

.

 I

 look



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 France

 and

 the

 world

's

 third

-largest

 city

 by

 population

,

 after

 Tokyo

 and

 Delhi

.

 It

 is

 home

 to

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 E

iff

el

 Tower

.

 Paris

 is

 known

 for

 its

 beautiful

 architecture

,

 world

-f

amous

 museums

 and

 art

 galleries

,

 and

 iconic

 landmarks

.

 The

 city

 is

 also

 famous

 for

 its

 food

,

 music

,

 and

 cultural

 events

.

 Paris

 has

 a

 rich

 history

 and

 has

 been

 a

 major

 center

 of

 European

 and

 world

 culture

 for

 centuries

.

 As

 a

 result

,

 it

 has

 developed

 a

 unique

 and

 distinctive

 identity

 that

 is

 both

 cultural

 and

 political

.

 The

 city

 has

 a

 population

 of

 over

 

2

 million



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 constantly

 evolving

,

 with

 potential

 areas

 of

 development

 that

 are

 already

 shaping

 the

 future

 of

 technology

.

 Some

 potential

 areas

 of

 development

 include

:



1

.

 Autonomous

 vehicles

:

 With

 advances

 in

 AI

 and

 robotics

,

 autonomous

 vehicles

 may

 become

 more

 common

 and

 efficient

.

 They

 could

 also

 reduce

 accidents

 and

 improve

 traffic

 flow

.



2

.

 Personal

ized

 healthcare

:

 AI

 could

 be

 used

 to

 create

 more

 personalized

 healthcare

 plans

 based

 on

 an

 individual

's

 genetic

,

 lifestyle

,

 and

 medical

 history

.

 This

 could

 lead

 to

 better

 treatments

 and

 potentially

 reduce

 costs

.



3

.

 Smart

 cities

:

 AI

 could

 be

 used

 to

 improve

 the

 efficiency

 and

 sustainability

 of

 cities

 by

 optimizing

 energy

 use

,

 traffic

 flow

,

 and

 public

 services

.



4

.




In [6]:
llm.shutdown()