# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.53it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Samantha. I live in the United States, I'm 16. I have a very big heart and I love sports, especially tennis. I play tennis with my friends. I also love animals, and I keep them in my house. My favorite color is purple and I also love to make music. 

What is Samantha's hobby? (If the question contains a typo, fix the typo and test again.) Samantha's hobby is playing tennis. She plays tennis with her friends and also loves animals and makes music. The question should be corrected to "What is Samantha's hobby?" to match the given text. The corrected version is
Prompt: The president of the United States is
Generated text:  a member of the Cabinet, and he is the head of the executive branch of the United States government. Given these facts, what is the head of the executive branch of the United States? The answer to the riddle is the President of the United States. The President of the United States is a member of the Cabinet, which is a group of

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a brief description of your profession or role]. I enjoy [insert a short description of your hobbies or interests]. I'm always looking for new experiences and learning opportunities. What are some of your favorite things to do? I love [insert a short description of your favorite activity or hobby]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite book or movie? I love [insert a short description of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Palace of Versailles. The city is also famous for its fashion industry, with Paris Fashion Week being one of the largest in the world. Paris is a bustling metropolis with a diverse population and a rich cultural heritage. It is a popular tourist destination and a major economic center in Europe. The city is home to many famous museums, theaters, and restaurants,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing it to learn from and adapt to human behavior and decision-making processes.

2. Enhanced privacy and security: As AI becomes more prevalent, there will be a need for increased privacy and security measures to protect the data and personal information that is generated and processed by AI systems.

3. Greater focus on ethical considerations: As AI becomes more advanced, there will



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jane Smith, and I'm a skilled writer and a freelance writer. I love exploring different genres and constantly learning new things. I enjoy writing short stories and novels and have published a few books in the past. I'm very creative and have a natural ability to convey emotions and ideas in writing. I love to write on the go and love to collaborate with other writers. I'm also an avid reader and enjoy reading a variety of genres.
Question: How can I get started with writing my own stories? How do I become a successful writer? Jane Smith: As for getting started with writing your own stories, you can start by gathering your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the largest city in the country and a UNESCO World Heritage site. It is known for its historical landmarks, cuisine, fashion, and annu

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Type

]

 with

 a

 passion

 for

 [

Purpose

].

 I

 thrive

 in

 [

Role

]

 and

 am

 always

 looking

 for

 opportunities

 to

 [

Benef

it

].

 What

 exc

ites

 me

 most

 about

 my

 work

 is

 [

What

 Exc

ites

 Me

 Most

].

 I

 believe

 in

 [

Core

 Values

]

 and

 I

 strive

 to

 [

Why

 I

 Str

ive

 for

 Excellence

].

 I

'm

 always

 learning

 and

 growing

,

 driven

 by

 [

Why

 I

 Am

 A

 Mark

eter

].

 And

 I

'm

 always

 looking

 for

 the

 next

 big

 idea

 and

 a

 chance

 to

 [

What

 I

 Hope

 for

].

 I

 believe

 that

 success

 in

 my

 work

 is

 not

 just

 about

 reaching

 a

 certain

 goal

,

 but

 also

 about

 creating

 meaningful

 impact



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 the

 country

 and

 is

 home

 to

 many

 of

 France

's

 most

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 It

 is

 also

 the

 seat

 of

 the

 French

 government

 and

 is

 one

 of

 the

 most

 important

 cities

 in

 the

 world

 for

 its

 cultural

 and

 historical

 significance

.

 Paris

 has

 a

 diverse

 population

 and

 is

 known

 for

 its

 elaborate

 architecture

 and

 museums

.

 It

 is

 a

 city

 of

 contrasts

,

 with

 its

 towering

 buildings

 and

 narrow

 streets

,

 but

 its

 famous

 landmarks

 and

 museums

 make

 it

 a

 global

 destination

 for

 those

 who

 love

 art

,

 history

,

 and

 culture

.

 Paris

 is

 often

 referred

 to

 as

 "

The

 City

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 see

 an

 increase

 in

 the

 integration

 of

 AI

 into

 various

 industries

,

 including

 healthcare

,

 transportation

,

 and

 manufacturing

.

 AI

 technologies

 such

 as

 machine

 learning

 and

 deep

 learning

 are

 expected

 to

 play

 a

 significant

 role

 in

 these

 areas

,

 as

 they

 can

 help

 automate

 tasks

,

 improve

 efficiency

,

 and

 provide

 more

 accurate

 predictions

.

 Additionally

,

 AI

 will

 continue

 to

 become

 more

 sophisticated

 and

 more

 capable

,

 with

 the

 ability

 to

 learn

 and

 adapt

 to

 new

 situations

,

 which

 will

 make

 it

 easier

 for

 humans

 to

 interact

 with

 AI

 systems

.

 Finally

,

 the

 integration

 of

 AI

 into

 everyday

 life

 will

 continue

 to

 grow

,

 with

 the

 ability

 to

 use

 AI

 to

 assist

 with

 tasks

 such

 as

 grocery

 shopping

 or

 scheduling

 appointments

.

 Overall




In [6]:
llm.shutdown()