# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.00it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah and I'm a 15 year old. I need to write a 2 paragraph essay about something I've recently learned about about my parents. The topic is "My Father's Day," and I want to talk about how my dad always likes to read books. I know I should write a 2 paragraph essay but I'm not sure how to begin. Can you help me? Absolutely! Writing an essay about your father's Day and how he likes to read books can be a fantastic way to explore the topic. Here’s a template you can use to help you get started:

---

**Title: My Father's Day:
Prompt: The president of the United States is
Generated text:  trying to arrive at a new slogan to promote his administration. The slogan should not be exactly the same as the one he is currently using, but it should be a clever reference to his administration's policies. To make the slogan more memorable, he decided to use four words. The first word should be related to healthcare, the second word should be related to educa

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason for interest] and I'm always looking for ways to [action or goal]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also the seat of the French government and the country's cultural and political center. Paris is a bustling metropolis with a rich history and a diverse population of over 2 million people. The city is known for its fashion, art, and cuisine, and is a popular tourist destination. It is also home to many famous landmarks and attractions, including the Louvre, the Arc de Triomphe, and the Champs-Élysées. Paris is a city of contrasts, with its

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased automation and robotics: As AI technology continues to advance, we are likely to see an increase in automation and robotics in various industries. This will lead to the development of new jobs and the creation of new industries that require specialized skills.

2. AI-powered healthcare: AI is already being used in healthcare to diagnose and treat diseases, but it has the potential to revolutionize the field. AI-powered healthcare will likely be used to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a/an [职业] at [Company Name]. I am excited to meet you and discuss how I can help you. What can I do for you? Let's get started. [Name]: Hello, my name is [Name] and I am a/an [职业] at [Company Name]. I am excited to meet you and discuss how I can help you. Let's get started. [Name]: [A brief and engaging introduction] [Name]: [A short and respectful introduction] Hey, my name is [Name] and I am a/an [职业] at [Company Name]. I'm

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city famous for its historic landmarks, renowned cuisine, and vibrant nightlife.
Paris is the capital city of France. It is known for its iconic landmarks such as Notre-Dame Cathedral, Eiffel Tower, Louvre Museum, and St. Louis Cathedral. The city is also famous for its cuisine, such as beignets, croissants, and b

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

 have

 been

 an

 advocate

 of

 improving

 human

 interaction

 through

 various

 mediums

 like

 poetry

,

 storytelling

,

 and

 music

.

 In

 addition

 to

 being

 an

 author

,

 I

 have

 a

 passion

 for

 marketing

 and

 have

 been

 marketing

 my

 business

 to

 many

 people

.

 I

 have

 a

 solid

 grasp

 of

 financial

 management

 and

 have

 learned

 to

 make

 sound

 decisions

 that

 will

 lead

 to

 long

-term

 success

.

 I

 am

 a

 constant

 learner

 and

 always

 looking

 for

 new

 ways

 to

 improve

 myself

.

 My

 current

 occupation

 is

 marketing

 and

 I

 am

 passionate

 about

 creating

 positive

 change

 in

 the

 world

.

 Thank

 you

 for

 considering

 me

 for

 a

 potential

 job

.

 



##

 Create

 a

 neutral

 self

-int

roduction

 for

 a

 fictional

 character





Hello

,

 my



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



H

ence

,

 the

 answer

 is

:

 Paris

.

 



The

 statement

 is

 factual

 as

 it

 is

 a

 well

-known

 fact

 about

 Paris

,

 the

 capital

 city

 of

 France

.

 



To

 summarize

 the

 answer

 conc

is

ely

:

 France

's

 capital

 city

 is

 Paris

.

 



The

 French

 term

 for

 "

capital

 city

"

 is

 "

capital

"

 or

 "

首都

",

 both

 of

 which

 refer

 to

 the

 capital

 city

.

 



For

 example

,

 in

 English

,

 we

 might

 say

 "

Paris

 is

 the

 capital

 of

 France

".

 In

 French

,

 we

 would

 say

 "

Paris

 est

 la

 capit

ale

 de

 la

 France

".

 



Since

 the

 French

 city

 of

 Paris

 is

 known

 worldwide

 as

 its

 capital

,

 it

 is

 often

 abbreviated

 as



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 wide

 range

 of

 emerging

 technologies

,

 including

:


1

.

 Deep

 Learning

:

 AI

 that

 can

 learn

 from

 large

 amounts

 of

 data

 and

 make

 complex

 decisions

 without

 explicit

 instructions

.


2

.

 Natural

 Language

 Processing

:

 AI

 that

 can

 understand

,

 interpret

,

 and

 generate

 human

 language

,

 including

 speech

 and

 writing

.


3

.

 Robotics

:

 AI

 that

 can

 perform

 tasks

 such

 as

 assembling

 robots

,

 cleaning

,

 and

 operating

 in

 dangerous

 environments

.


4

.

 Autonomous

 Vehicles

:

 AI

 that

 can

 navigate

 roads

 and

 roads

 of

 its

 own

 accord

,

 as

 well

 as

 make

 decisions

 on

 its

 own

 based

 on

 real

-time

 data

.


5

.

 Virtual

 and

 Aug

mented

 Reality

:

 AI

 that

 can

 create

 and

 manipulate

 virtual

 and

 augmented

 realities




In [6]:
llm.shutdown()