# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.56it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mike. I am a student at a middle school. I love ice cream. I like to eat it with my friends. On the weekends I like to go to the ice cream shop to buy some ice cream. One day I went to the ice cream shop. After I got home, I went to the ice cream shop again. I found a new shop. The ice cream there is much better. I was very happy to eat ice cream there. But when I went to the ice cream shop, it was closed. I was very sad. What do you think about the writer? A) He is happy to eat ice cream.
Prompt: The president of the United States is
Generated text:  a very important person. He or she has many important duties. He or she has the power to make many important decisions. He or she has to be honest. He or she has to be nice. He or she has to always keep his or her promise. The president of the United States is the leader of the country. He or she represents the country in the world. He or she is the head of state of the United States. The preside

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? As an AI language model, I don't have a physical presence, but I'm always ready to assist you with any questions or tasks you may have. How can I help you today? [Name] [Company name] is a leading [industry] company that specializes in [specific product/service]. We're always looking for talented individuals to join our team and help us achieve our goals. If you're interested in joining our team, please feel free to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French Parliament building. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. The city is known for its fashion, art, and cuisine, and is a popular destination for tourists from around the world. It is also home to the French Parliament building, which is the oldest building in the world. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. It is a city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation and artificial general intelligence: As AI continues to advance, we are likely to see more automation and the development of AI that can perform tasks that were previously done by humans. This could lead to the creation of more efficient and cost-effective systems, but it could also lead to job displacement for some workers.

2. Enhanced privacy and security: As AI becomes more integrated into our daily lives, there will be an increased need for privacy and security measures to protect personal data



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I'm a [Your Profession] with [Your Job Title] experience in [Your Industry/Field]. My journey has taken me from [Your Starting Position] to [Your Current Position], and I've always been passionate about [Your Passion]. I believe that every person has the potential to make a difference in the world, and I am committed to using my skills and knowledge to help others achieve their goals. [Your Name] is a [Your Address, Position, or Experience] who is [Your Job or Role] at [Your Company Name]. I strive to be a resource for those who need help and I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city and country’s cultural and political center. 

This statement encapsulates the core facts about Paris, including its role as the nation's capital and its historical importance as a major city 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

career

/

subject

 area

]

 specialist

.



Please

 state

 the

 type

 of

 career

 or

 subject

 area

 your

 self

-int

roduction

 is

 for

.

 Here

 are

 some

 options

 to

 consider

:



-

 Fiction

al

 character




-

 Academic

 or

 professional

 staff




-

 Academic

 or

 professional

 staff




-

 Non

-f

itting





Note

:

 Choose

 one

 option

 that

 best

 fits

 the

 context

 of

 your

 story

 and

 fits

 the

 tone

 you

 want

 to

 convey

.

 



Once

 you

've

 chosen

 your

 career

 or

 subject

 area

,

 provide

 a

 brief

 summary

 of

 what

 you

 do

 or

 are

 qualified

 for

.

 In

 your

 self

-int

roduction

,

 focus

 on

 your

 key

 skills

,

 experiences

,

 and

 achievements

.

 Use

 active

 voice

,

 and



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 Europe

 by

 population

 and

 is

 renowned

 for

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 annual

 cultural

 events

.

 Paris

 is

 also

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 The

 city

 is

 home

 to

 many

 renowned

 artists

,

 writers

,

 and

 musicians

,

 and

 is

 a

 major

 financial

 center

 for

 the

 country

.

 Its

 population

 density

 is

 among

 the

 highest

 in

 Europe

,

 and

 it

 is

 a

 center

 of

 politics

,

 education

,

 and

 business

 in

 France

 and

 the

 world

.

 Paris

 is

 also

 home

 to

 the

 European

 Parliament

 and

 the

 Lou

vre

 Museum

.

 It

 is

 known

 for

 its

 exquisite

 cuisine

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 potential

,

 but

 it

 will

 be

 shaped

 by

 a

 complex

 inter

play

 of

 technological

,

 social

,

 economic

,

 and

 ethical

 factors

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

:



1

.

 Enhanced

 intelligence

:

 AI

 will

 continue

 to

 gain

 in

 intelligence

,

 with

 machines

 becoming

 more

 capable

 at

 tasks

 that

 require

 complex

 decision

-making

 and

 problem

-solving

.



2

.

 Autonomous

 systems

:

 Autonomous

 systems

 will

 continue

 to

 evolve

 and

 become

 more

 sophisticated

,

 with

 machines

 being

 able

 to

 make

 autonomous

 decisions

 and

 take

 on

 a

 variety

 of

 tasks

 without

 human

 intervention

.



3

.

 Personal

ization

:

 AI

 will

 continue

 to

 enable

 more

 personalized

 experiences

,

 with

 machines

 able

 to

 understand

 and

 adapt

 to

 user

 preferences

 and

 behaviors

.



4

.

 Increased




In [6]:
llm.shutdown()