# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.05it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Matt. My age is the product of my sister's age and 10. My sister's age is 7 years. How old am I?

To determine Matt's age, we need to use the information given in the problem. We know that Matt's age is the product of his sister's age and 10, and his sister's age is 7 years. Let's break it down step by step.

1. Identify the age of Matt's sister:
   Matt's sister is 7 years old.

2. Calculate the product of the age of Matt's sister and 10:
   \[
   \
Prompt: The president of the United States is
Generated text:  a popular position, but the president also serves as a role model for the nation and makes important decisions that affect the lives of every American. What are the major responsibilities of the president?

As a role model for the nation, the president serves as an example of how to lead and be a good leader. He or she sets an example by being honest and transparent, and by being open to criticism and making changes to improve upon the

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short, positive, enthusiastic statement about yourself]. I'm always looking for new challenges and opportunities to grow and learn. What do you like to do for fun? I love to read, travel, and explore new places. What's your favorite hobby? I love to cook and bake delicious meals. What's your favorite book? I love to read books that challenge me and inspire me. What's your favorite movie? I love to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a historic city with a rich history and a vibrant culture. It is the largest city in France and is home to many of the country's most famous landmarks, including the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is also known for its fashion industry, art scene, and food culture, and is a popular tourist destination. The city is home to many important institutions, including the French Academy of Sciences, the Musée d'Orsay, and the Louvre Museum. Paris is a city of contrasts, with its modern architecture and cultural attractions

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve and become more integrated into our daily lives, from self-driving cars and robots to personalized medicine and virtual assistants. As AI becomes more integrated into our daily lives, we may see a shift towards more ethical and responsible use of the technology, with greater emphasis on transparency, accountability, and fairness in its implementation. Additionally, AI will likely continue to evolve and adapt to new challenges and opportunities, with new applications and technologies emerging on a regular basis. Overall, the future of AI is likely to be one of



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [name], and I’m a [type of person] with a passion for [career or hobby]. I’m always up for a challenge, and I love spending my free time exploring new places and trying out new food. And of course, I have a knack for storytelling, whether it’s in the form of words or through my work as a writer. I love to share my experiences with others and make them come alive on the page. As a storyteller, I want to take the reader on a journey that will leave them feeling engaged and inspired. What brings you to this world, and what is your ultimate goal? Let me

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is a sprawling, historic city with a rich history and culture. It is the third-largest city in the European Union, home to more than 10 million people and is one of the largest and most populous cities 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 here

 to

 help

 people

 with

 their

 most

 difficult

 problems

.

 I

'm

 here

 to

 listen

,

 to

 understand

,

 and

 to

 provide

 the

 best

 possible

 support

.

 Feel

 free

 to

 ask

 me

 anything

,

 and

 I

'll

 do

 my

 best

 to

 answer

 in

 a

 way

 that

's

 helpful

 and

 informative

.

 Let

 me

 know

 if

 you

 need

 anything

,

 and

 I

'll

 be

 here

 to

 assist

 you

.

 How

 can

 I

 help

 you

 today

?

 [

Name

]

 [

Type

 of

 work

 experience

 [

Year

 of

 experience

]

 in

 [

Field

 of

 work

],

 and

 the

 skills

 you

 have

 developed

 during

 that

 experience

.

 [

Name

]

 [

Type

 of

 work

 experience

 [

Year

 of

 experience

]

 in

 [

Field

 of

 work



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

,

 which

 is

 located

 in

 the

 central

 regions

 of

 the

 country

 and

 is

 the

 largest

 city

 in

 the

 country

.

 It

 is

 home

 to

 the

 iconic

 E

iff

el

 Tower

,

 a

 UNESCO

 World

 Heritage

 Site

,

 and

 is

 the

 seat

 of

 the

 French

 government

.

 Paris

 is

 also

 a

 cultural

 and

 economic

 hub

,

 with

 a

 rich

 history

 dating

 back

 over

 

2

,

 

5

0

0

 years

.

 The

 city

 is

 known

 for

 its

 vibrant

 nightlife

,

 world

-class

 museums

,

 and

 world

-ren

owned

 art

 and

 architecture

.

 Paris

 is

 a

 popular

 tourist

 destination

,

 drawing

 millions

 of

 visitors

 each

 year

.

 The

 city

 is

 also

 home

 to

 many

 important

 historical

 sites

,

 including

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 see

 exponential

 growth

 and

 transformation

 in

 the

 coming

 years

.

 Some

 potential

 trends

 include

:



1

.

 Autonomous

 vehicles

:

 AI

-driven

 autonomous

 vehicles

 will

 become

 more

 widespread

,

 allowing

 for

 faster

,

 safer

,

 and

 more

 efficient

 transportation

.



2

.

 Virtual

 and

 augmented

 reality

:

 AI

 will

 continue

 to

 improve

 the

 quality

 and

 accessibility

 of

 VR

 and

 AR

 experiences

,

 providing

 new

 ways

 to

 learn

,

 work

,

 and

 interact

 with

 the

 world

.



3

.

 Personal

ized

 healthcare

:

 AI

 will

 help

 develop

 more

 accurate

 and

 efficient

 medical

 treatments

,

 and

 improve

 patient

 outcomes

 by

 analyzing

 large

 amounts

 of

 data

 and

 identifying

 patterns

.



4

.

 Smart

 homes

:

 AI

-powered

 devices

 will

 become

 more

 ubiquitous

,

 allowing

 for

 smarter

 home

 automation

,




In [6]:
llm.shutdown()