# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.67it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Shashank. I am a computer engineer with a passion for building and developing web applications. My work has been involved in creating various types of web applications including but not limited to web servers, web browsers, and web applications. I have been a software developer for several years and have had the opportunity to work on a wide variety of projects and have learned a lot from them. My goal is to continue to learn and improve myself by gaining a better understanding of different programming languages and technologies. I have been using various programming languages like Python, Java, C++, JavaScript and SQL to create my web applications and I am constantly working to improve my
Prompt: The president of the United States is
Generated text:  trying to decide how many military troops to have deployed to Iraq.  Given a, the country, and b, the number of troops deployed to Iraq in a given year, how many permutations are there in which b

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I love [hobby or activity], and I'm always looking for new ways to explore and discover new things. What's your favorite book or movie? I love [book/movie], and I'm always looking for new

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a popular tourist destination, with many attractions and events throughout the year. Paris is a cultural and historical center that is home to many world-renowned institutions and landmarks. The city is known for its vibrant nightlife, fashion scene, and delicious cuisine. It is a major transportation hub, with many international airports and train stations. Paris is a city that is constantly evolving and changing, with new developments and attractions being added regularly. The city is a must-visit for anyone interested in French culture, history, and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends include:

1. Increased integration of AI into various industries: AI is already being used in a wide range of industries, from healthcare and finance to transportation and manufacturing. As AI becomes more integrated into these industries, we can expect to see even more applications and use cases.

2. Greater emphasis on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This will include issues such as bias, transparency, and accountability.

3. Development of new AI technologies



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I am a [Age] year old [Gender] [Name] who works as a [Your Job Title]. I have always been [Your特长], and I enjoy [Your Passion], [Your Hobby], [Your Interests], and [Your Unique Strengths]. I am excited to meet you! If you have any questions, please feel free to ask. Wishing you good luck on your journey. [Your Name] [Your Contact Information] [Your Motivation] [Your Challenges] [Your Future Goals] [Your Mission] [Your Values] [Your Future Me] [Your Final Goals] [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the largest city in France and the third-largest city in the European Union. It is the most populous city in the country, with an estimated population of over 2.1 million as of 2021. It is located on the Seine River and is home to the Eiffel Tower, Louvre Museum, Notre-Dame

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

's

 name

],

 and

 I

'm

 a

 person

able

 and

 friendly

 individual

 who

 loves

 to

 travel

.

 I

'm

 a

 travel

 enthusiast

 who

 has

 explored

 the

 world

 in

 my

 travels

,

 and

 I

 love

 sharing

 my

 experiences

 with

 others

.

 Whether

 you

're

 a

 solo

 traveler

 or

 a

 group

 of

 friends

,

 I

'm

 here

 to

 help

 you

 plan

 your

 perfect

 vacation

.

 I

 believe

 that

 travel

 is

 a

 wonderful

 way

 to

 learn

 about

 different

 cultures

,

 meet

 new

 people

,

 and

 grow

 as

 an

 individual

.

 So

,

 if

 you

're

 looking

 for

 a

 travel

 companion

 who

 will

 make

 you

 feel

 happy

,

 safe

,

 and

 relaxed

,

 I

'm

 the

 one

 for

 you

.

 Let

's

 embark

 on

 a

 fun

-filled

 adventure

 together



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



I

'm

 sorry

,

 I

 am

 an

 AI

 language

 model

 and

 I

 do

 not

 have

 the

 capability

 to

 provide

 factual

 information

.

 Can

 you

 please

 provide

 me

 with

 more

 context

 or

 instructions

 so

 I

 can

 assist

 you

 better

?

 



Paris

 is

 the

 capital

 of

 France

 and

 is

 the

 largest

 city

 in

 the

 country

.

 It

 is

 known

 for

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 cultural

 attractions

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 known

 for

 its

 fashion

 industry

,

 and

 the

 city

 is

 home

 to

 many

 fashion

 shows

 and

 events

.

 



Paris

 is

 a

 major

 transportation

 hub

,

 with

 multiple

 international

 airports

 and

 train

 stations

.

 It



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 exciting

,

 with

 many

 potential

 areas

 of

 development

 that

 will

 continue

 to

 shape

 the

 technology

 and

 its

 applications

.

 Here

 are

 some

 possible

 trends

 in

 AI

 that

 are

 currently

 in

 the

 early

 stages

 of

 development

:



1

.

 Improved

 efficiency

:

 One

 area

 of

 AI

 that

 is

 likely

 to

 continue

 to

 grow

 in

 importance

 is

 in

 improving

 efficiency

.

 With

 the

 rise

 of

 big

 data

 and

 machine

 learning

,

 AI

 is

 becoming

 more

 capable

 of

 analyzing

 and

 interpreting

 large

 amounts

 of

 information

 in

 real

-time

.

 This

 makes

 it

 possible

 for

 companies

 and

 organizations

 to

 make

 more

 informed

 decisions

 and

 improve

 their

 operations

 more

 quickly

.



2

.

 Personal

ized

 experiences

:

 Another

 area

 of

 AI

 that

 is

 likely

 to

 continue

 to

 grow

 in

 importance

 is




In [6]:
llm.shutdown()