# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.85it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.85it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Hannah and I'm a passionate, hardworking 23-year-old women who will always strive to be the best at whatever I do! I have a desire to be a multicultural, inclusive, and well-rounded person who will be able to grow and learn as a whole. I am eager to start my college years and make new friends, but I also want to be able to trust and rely on others to guide me and make me feel supported. I am currently in the process of applying for college and am ready to start at a new institution. I am looking for advice on how to make a good first impression at a new school and
Prompt: The president of the United States is
Generated text:  a public official who serves as the head of the executive branch of the federal government. The position is also the highest office in the United States, and the office has a long and storied history. The position can be held by either an elected or a popularly elected candidate, although it is typically held by a candida

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] at [company name], and I'm excited to be here today. I'm a [job title] at [company name], and I'm excited to be here today. I'm a [job title] at [company name], and I'm excited to be here today. I'm a [job title] at [company name], and I'm excited to be here today. I'm a [job title] at [company name], and I'm excited to be here today. I'm a [job title]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city that hosts the Eiffel Tower and is known for its rich history and culture. It is also the seat of the French government and is home to many of the country's most famous landmarks, including the Louvre Museum and the Notre-Dame Cathedral. Paris is a vibrant and diverse city with a rich cultural heritage that has been shaped by its history and its people. It is a city that is constantly evolving and changing, with new developments and attractions being added to the city's list of attractions. Paris is a city that is a true reflection of France's rich history and culture, and it is a city that is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we interact with technology and the world around us. Here are some of the most likely trends that could shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes more advanced, we can expect to see even more sophisticated applications in healthcare, such as personalized medicine, disease diagnosis, and drug discovery.

2. Increased use of AI in manufacturing: AI is already being used in manufacturing to improve efficiency, reduce costs, and increase productivity. As AI becomes more advanced



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a friendly and approachable individual. I love meeting new people and sharing interesting stories with them. I have a great sense of humor and am always looking for new and exciting things to do. I enjoy trying new foods, drinks, and cultures, and I'm always up for a good laugh. I am always looking for new opportunities to contribute to the community and help people in need. Thank you for taking the time to meet me. What kind of experiences are you looking to contribute to the community? As an AI language model, I don't have personal experiences to contribute to the community, but I can provide

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the 19th largest city in the world and the third most populous city in Europe.

The Paris train station is one of the oldest in the world, having b

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Jane

,

 a

 hard

-working

 and

 reliable

 woman

.

 I

 am

 a

 certified

 public

 accountant

 with

 a

 proven

 track

 record

 of

 delivering

 high

-quality

 work

 and

 maintaining

 a

 strong

 reputation

 in

 my

 industry

.

 I

 have

 a

 keen

 eye

 for

 detail

 and

 am

 always

 prepared

 to

 work

 hard

 to

 exceed

 expectations

.

 My

 commitment

 to

 excellence

 and

 the

 importance

 of

 accuracy

 is

 a

 hallmark

 of

 my

 work

.

 I

 am

 a

 positive

 and

 assert

ive

 person

 who

 is

 always

 ready

 to

 take

 on

 new

 challenges

 and

 tackle

 difficult

 problems

 head

-on

.

 I

 am

 also

 a

 great

 listener

 and

 enjoy

 helping

 others

 understand

 the

 complexities

 of

 their

 financial

 situation

.

 Overall

,

 I

 am

 a

 reliable

,

 competent

 and

 caring

 professional

 with

 a

 passion

 for

 helping

 others

 succeed



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



That

's

 correct

!

 Paris

 is

 the

 capital

 city

 of

 France

.

 Here

 are

 some

 key

 facts

 about

 Paris

:



1

.

 Official

 Name

:

 "

V

ille

 de

 Paris

"

 (

Paris

)

 in

 French

.


2

.

 Capital

:

 The

 city

 has

 been

 the

 capital

 since

 the

 

1

3

th

 century

.


3

.

 Population

:

 As

 of

 

2

0

2

1

,

 the

 city

 has

 a

 population

 of

 approximately

 

2

.

1

 million

 people

.


4

.

 Language

:

 Official

 language

:

 French

.

 It

 is

 also

 the

 working

 language

 in

 the

 French

 departments

.


5

.

 History

:

 Founded

 by

 the

 Gaul

s

,

 Paris

 was

 a

 major

 trade

 center

 during

 the

 Roman

 Empire

.


6

.

 Not

able



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 bright

 and

 diverse

,

 with

 a

 wide

 range

 of

 possibilities

 and

 advancements

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Integration

 of

 AI

 with

 other

 technologies

:

 AI

 will

 continue

 to

 merge

 with

 other

 emerging

 technologies

 like

 machine

 learning

,

 blockchain

,

 and

 quantum

 computing

.

 For

 example

,

 AI

-based

 predictive

 analytics

 can

 improve

 energy

 efficiency

,

 financial

 decision

-making

,

 and

 healthcare

 outcomes

.



2

.

 Personal

ization

 of

 AI

:

 AI

 will

 continue

 to

 improve

 its

 ability

 to

 provide

 personalized

 experiences

,

 making

 it

 easier

 for

 users

 to

 interact

 with

 AI

-powered

 devices

,

 applications

,

 and

 services

.

 Personal

ized

 AI

 can

 also

 be

 used

 to

 improve

 the

 accuracy

 of

 medical

 diagnoses

,

 financial

 predictions

,

 and

 customer

 satisfaction

 surveys




In [6]:
llm.shutdown()