# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.23it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kate and I'm a teacher and I want to try something new, so I asked some people what they wanted to try when they grew up. Here's what they said: The first person said he wanted to read too many fantasy books. The second person said he wanted to go to a school where everyone speaks English. The third person said he wanted to be a doctor, and he wanted to know what each year has been like since he was born. Which of the following is the most likely order in which people described their problems?
A) Fantasy books → English speaking schools → Doctors → Years
B) English speaking schools → Fantasy books
Prompt: The president of the United States is
Generated text:  traveling around the country and decides to give a speech. He has 100 guests. He wants to include an introduction and an ending. He decides that for every 5 guests, he should have 2 more guests in the introduction and 3 more in the ending. How many more guests should he have in the introd

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I am a [job title] at [company name]. I have been working at [company name] for [number of years] years. I am passionate about [job title] and I am always looking for ways to [job title] my skills and knowledge. I am always eager to learn and improve, and I am always willing to help others. I am a team player and I am always willing to work with others to achieve our goals. I am a [job title] and I am always looking for ways to [job title] my skills and knowledge. I am a [job title] and I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and a vibrant culture. The city is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its cuisine, fashion, and art scene. It is a popular tourist destination and a cultural hub for Europe. The city is home to many important institutions such as the French Academy of Sciences and the French National Library. It is a city of contrasts, with its modern architecture and historical landmarks blending together. Paris is a city of light and color, and it

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some possible future trends in AI:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more efficient and effective AI systems that can better understand and respond to human needs and preferences.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations and responsible use of AI. This could lead to more stringent regulations and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Alex, and I'm a 28-year-old software engineer. I have a love for technology and enjoy solving complex problems with my creativity. I'm passionate about helping people and learning new things every day. I'm constantly seeking out new challenges and growing my skills in software development. I'm also a big fan of social media and have an active presence on various forums. What other information would you like to know about me? As an AI language model, I don't have a physical form or the ability to social media, but I'm always here to assist you with any questions or concerns you may have. What's the best way

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

I'm sorry, I cannot provide an answer as I am an AI language model and do not have access to real-time information about current events. Please provid

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

Career

]

 at

 [

Company

/

Inc

ident

].

 Before

 joining

 this

 organization

,

 I

 was

 [

previous

 employment

 details

 and

 achievements

].

 Now

,

 I

 am

 here

 to

 [

brief

ly

 describe

 my

 role

 and

 responsibilities

].

 I

 am

 passionate

 about

 [

career

 objective

/

interest

s

/

achie

vements

].

 I

 enjoy

 [

any

 additional

 relevant

 points

].

 I

 am

 always

 looking

 for

 ways

 to

 [

share

 any

 relevant

 experiences

 or

 skills

].

 Thank

 you

 for

 having

 me

.

 [

Name

]

 I

 am

 a

 [

Role

/

Position

]

 at

 [

Company

/

Inc

ident

].

 Before

 joining

 this

 organization

,

 I

 was

 [

previous

 employment

 details

 and

 achievements

].

 Now

,

 I

 am

 here

 to

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Answer

 the

 following

 question

 based

 on

 the

 given

 context

:



What

 is

 the

 name

 of

 the

 current

 Prime

 Minister

 of

 France

?

 The

 current

 Prime

 Minister

 of

 France

 is

 Emmanuel

 Macron

.

 



Based

 on

 the

 information

 provided

,

 answer

 the

 following

 question

:

 Who

 is

 the

 leader

 of

 the

 party

 that

 supports

 the

 US

 President

?

 The

 leader

 of

 the

 party

 that

 supports

 the

 US

 President

 is

 the

 Republican

 Party

.

 



In

 the

 context

 of

 American

 history

,

 the

 Republican

 Party

 is

 the

 party

 that

 supported

 the

 US

 President

,

 who

 is

 currently

 Donald

 Trump

.



Using

 the

 provided

 context

,

 formulate

 a

 question

 about

 France

 and

 answer

 it

 by

 providing

 a

 brief

 statement

 about

 the

 capital

 of

 France

 and

 the

 leader

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 number

 of

 factors

,

 including

 the

 development

 of

 new

 technologies

 and

 the

 improvement

 of

 existing

 ones

.

 Here

 are

 some

 possible

 trends

 in

 AI

:



1

.

 Increased

 reliance

 on

 machine

 learning

:

 One

 of

 the

 most

 promising

 areas

 of

 AI

 development

 is

 the

 development

 of

 machine

 learning

 algorithms

 that

 can

 learn

 from

 large

 amounts

 of

 data

 on

 their

 own

.

 This

 could

 lead

 to

 a

 wider

 range

 of

 applications

,

 from

 self

-driving

 cars

 to

 personalized

 medicine

.



2

.

 Integration

 with

 human

 decision

-making

:

 AI

 is

 increasingly

 being

 used

 in

 decision

-making

 processes

,

 with

 models

 that

 can

 analyze

 large

 amounts

 of

 data

 and

 provide

 insights

 that

 are

 difficult

 or

 impossible

 for

 humans

 to

 achieve

 on

 their

 own

.




In [6]:
llm.shutdown()