# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.13it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  John. I'm 18 years old and I have a lot of friends. We all like to go to the movies and watch funny movies. 

My best friend is Mike. Mike and I like the same movies. We both like to be funny. 

This week, our parents told us that we will go to the movies on Friday night. We're going to go to the movies with Mike. I love the movies and I have a lot of friends who also like the movies. 

So, what would you do if you were a movie goer? Would you go to the movies? Why or why not?

Based on
Prompt: The president of the United States is
Generated text:  represented by a 1/5 doctor and a 3/5 nurse. If the number of doctors is twice the number of nurses, and there are 20 nurses, how many people are in total represented by the presidents? Let's denote the number of doctors as D, the number of nurses as N, and the number of nurses who are doctors as D_nurse, and the number of nurses who are nurses as D_nurse_nurse.

Given:
- The president is represente

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, with a rich history dating back to the Roman Empire and a modern city that has undergone significant development over the centuries. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is home to many famous museums, including the Musée d'Orsay and the Musée Rodin, and is known for its food and wine culture. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation and artificial intelligence: As AI continues to advance, we can expect to see more automation and artificial intelligence in various industries. This could lead to increased efficiency, cost savings, and job displacement.

2. Improved privacy and security: As AI becomes more integrated into our daily lives, there will be a greater emphasis on protecting privacy and security. This could lead to new regulations and standards for AI development and use.

3. Enhanced human-computer interaction:



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [Age] year old aspiring author. I am passionate about writing because I find it a great way to express my thoughts and ideas in a unique and creative way. I like to experiment with different genres and styles to find what works best for me. I love to read and listen to music, and I am always looking for new ways to expand my knowledge and improve myself as an author. What other hobbies or interests do you have? Hi there! I'm a 32-year-old aspiring writer named [Name]. I'm passionate about storytelling and writing. I like to experiment with different writing styles and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is the largest city in the country and one of the most visited cities in the world. The city is home to the seat of government, the Louvre Museum, and a rich cultural

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Sarah

.

 I

'm

 a

 writer

 and

 editor

 with

 a

 passion

 for

 storytelling

 and

 creativity

.

 I

'm

 always

 looking

 for

 new

 ways

 to

 connect

 with

 people

 and

 create

 meaningful

 content

.

 I

'm

 also

 someone

 who

 enjoys

 exploring

 different

 genres

 and

 learning

 about

 the

 art

 of

 writing

.

 I

 believe

 in

 the

 power

 of

 words

 and

 love

 to

 make

 them

 come

 to

 life

 on

 the

 page

.

 How

 can

 I

 get

 started

 with

 my

 writing

 career

?

 Are

 there

 any

 particular

 books

,

 authors

,

 or

 writing

 workshops

 that

 you

 would

 recommend

?

 As

 a

 writer

,

 what

's

 the

 most

 challenging

 aspect

 of

 your

 work

 and

 how

 do

 you

 overcome

 it

?

 Lastly

,

 what

 kind

 of

 experiences

 do

 you

 think

 would

 help

 someone

 starting

 out



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 an

 iconic

 city

 that

 is

 known

 for

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 vibrant

 cultural

 scene

.

 The

 city

 is

 also

 home

 to

 many

 famous

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 a

 bustling

 city

 with

 a

 diverse

 population

,

 and

 it

 has

 a

 strong

 emphasis

 on

 culture

,

 food

,

 and

 entertainment

.

 It

 is

 often

 referred

 to

 as

 "

the

 City

 of

 Light

"

 due

 to

 its

 vibrant

 nightlife

 and

 fashion

 scene

.

Human

:

 Can

 you

 provide

 a

 list

 of

 France

's

 three

 famous

 landmarks

?

 The

 three

 famous

 landmarks

 are

:

 Notre

-D

ame

 Cathedral

,

 E

iff

el

 Tower

,

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 promising

 and

 is

 likely

 to

 continue

 to

 develop

 rapidly

.

 Some

 possible

 trends

 in

 the

 future

 of

 AI

 include

:



1

.

 Adv

ancements

 in

 machine

 learning

:

 As

 machine

 learning

 algorithms

 become

 more

 sophisticated

,

 they

 will

 be

 able

 to

 process

 and

 analyze

 vast

 amounts

 of

 data

 more

 efficiently

 and

 accurately

.

 This

 will

 allow

 AI

 to

 solve

 increasingly

 complex

 problems

 and

 make

 better

 predictions

.



2

.

 AI

 in

 healthcare

:

 AI

 will

 play

 an

 increasingly

 important

 role

 in

 healthcare

,

 with

 the

 ability

 to

 analyze

 medical

 records

,

 detect

 diseases

,

 and

 provide

 personalized

 treatment

 plans

.

 This

 will

 help

 to

 improve

 patient

 outcomes

 and

 reduce

 costs

.



3

.

 AI

 in

 transportation

:

 AI

 will

 be

 used

 to

 optimize

 traffic

 flow

,

 improve

 road




In [6]:
llm.shutdown()