# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.66it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.65it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Dasha and I'm a science fiction fan. I'm currently a student at a university and I enjoy reading science fiction and fantasy books. Currently, I'm in the process of converting my essay into an article. I've noticed that the example I wrote is not fully aligned with the style of science fiction.

Could you please help me improve my essay by outlining the structure and tone of the essay in the style of science fiction?

Sure, I'd be happy to help you improve your essay by outlining its structure and tone in the style of science fiction! Please provide me with the essay and let me know the desired tone of the essay.
Prompt: The president of the United States is
Generated text:  3 feet 6 inches tall. If a person with that height can lift a 120-pound barbell, how many more pounds does the president weigh compared to a person who is 5 feet 10 inches tall and can lift a 240-pound barbell?

To determine how much more weight the president of the United

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your job or profession]. I enjoy [insert a short description of your hobbies or interests]. I'm always looking for new experiences and learning opportunities. What do you like to do in your free time? I enjoy [insert a short description of your hobbies or interests]. I'm always looking for new experiences and learning opportunities. What's your favorite hobby or activity? I'm always looking for new experiences and learning opportunities.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. The city is known for its cuisine, fashion, and art, and is home to many famous landmarks and attractions. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also home to many international organizations and events, making it a hub for global affairs

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks of AI, there will be a greater emphasis on developing AI that is designed to be ethical and responsible. This could involve developing AI that is transparent, accountable, and accountable to human values.

2. AI will become more integrated with other technologies: As AI becomes more integrated with other technologies, such as the Internet of Things (IoT), the Internet of Things (IoT), and the Internet



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I have been in the field of art for over [number] years now, and I have been known for my unique style and approach to the subject matter. I enjoy creating beautiful pieces that express my own personal style and ideas. I like to work with a variety of mediums, including painting, drawing, and sculpture, and I am always looking for new and exciting ideas to bring to the table. I am confident in my abilities and look forward to making a difference in the world through my work. Thank you! Happy to help! How can I assist you today? Your name is [Your Name] and you are

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

The statement is:
Paris is the capital of France. 

This statement accurately conveys that Paris is the largest city in France and serves as the official seat of government for th

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

,

 I

'm

 

2

5

 years

 old

,

 a

 creative

 writing

 student

.

 My

 writing

 style

 is

 informal

 and

 often

 humorous

,

 with

 a

 strong

 focus

 on

 creating

 rel

atable

 characters

.

 I

 enjoy

 exploring

 different

 genres

 and

 experimenting

 with

 narrative

 techniques

.

 I

'm

 always

 looking

 for

 new

 ways

 to

 express

 myself

 and

 experiment

 with

 different

 styles

.

 I

'm

 excited

 to

 share

 my

 work

 with

 others

 and

 help

 them

 find

 their

 own

 voice

.

 Thank

 you

 for

 asking

!

 What

 do

 you

 think

 of

 my

 introduction

?

 Let

 me

 know

 if

 you

 need

 any

 further

 assistance

.

 Write

 a

 short

,

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

.

 My

 name

 is

 Emily

,

 I

'm

 

2

5

 years

 old

,

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Please

 provide

 your

 response

 in

 Spanish

:

 C

ual

 es

 la

 capital

 de

 Franc

ia

?

 El

 capital

 de

 Franc

ia

 es

 Par

ís

.

 



Esta

 respuesta

 se

 trad

uce

 al

 español

 como

:

 "

La

 capital

 de

 Franc

ia

 es

 Par

ís

."

 



Este

 es

 un

 breve

 comentario

 sobre

 el

 lugar

 de

 N

á

po

les

,

 donde

 se

 encuentra

 la

 capital

 fr

ances

a

,

 Paris

.

 



Para

 present

arlo

 de

 manera

 precisa

 en

 español

,

 aquí

 está

 la

 frase

 completa

:



"

La

 capital

 fr

ances

a

 de

 N

á

po

les

 es

 Par

ís

."



Esta

 frase

 proporcion

a

 una

 manera

 fácil

 de

 comunic

arse

 entre

 países

 europe

os

 y

 franc

ó

fon

os

.

 



El

 nombre



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 one

 of

 rapid

 and

 significant

 changes

,

 driven

 by

 advancements

 in

 technology

,

 policy

 changes

,

 and

 human

 values

.

 Some

 potential

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 can

 improve

 the

 accuracy

 and

 speed

 of

 diagnosis

,

 help

 in

 developing

 personalized

 treatment

 plans

,

 and

 assist

 in

 preventing

 and

 managing

 diseases

.



2

.

 AI

 integration

 with

 automation

:

 The

 integration

 of

 AI

 with

 automation

 can

 lead

 to

 significant

 efficiency

 gains

 in

 industries

 such

 as

 manufacturing

,

 transportation

,

 and

 finance

.



3

.

 AI

 integration

 with

 personal

ization

:

 AI

 can

 help

 companies

 to

 better

 understand

 consumer

 behavior

 and

 preferences

,

 enabling

 them

 to

 provide

 personalized

 products

 and

 services

.



4

.

 AI

 in




In [6]:
llm.shutdown()