# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.94it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.93it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  David and I am a lawyer. I specialize in civil litigation. I practice both in the District of Columbia and in the U.S. Supreme Court, and also in a handful of other places where I feel I can help the legal community.
My first case was the first discrimination lawsuit I ever filed, which was about the impact of racial discrimination in the private sector. I was a first-year law student at the University of Pennsylvania Law School at the time. I wrote the brief that was accepted for oral argument before the U.S. Supreme Court. It was the first time I wrote a brief in the Supreme Court.
In this case, I
Prompt: The president of the United States is
Generated text:  a person, and the president of the United States is the head of state. The correct option to fill in the blank is ____
A. President
B. President of the United States
C. Head of State
D. Head of State of the United States
Answer:
C

After nearly 40 years of exploration, the People's Repu

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history and a vibrant culture, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also a major center for art, fashion, and cuisine, and is home to many world-renowned museums, theaters, and restaurants. The city is also known for its annual festivals and events, including the World Cup and the Eiffel Tower Festival. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. Its status as the capital of France

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing for more sophisticated and nuanced decision-making.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, transparency, and accountability.

3. Greater use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, but there is a growing trend towards using AI to assist in diagnosis and treatment, as well as to improve patient care.

4. Greater use of



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jane and I'm a young, ambitious writer who has just completed my first novel. I love to write about new and exciting things, and I'm always looking for new projects to work on. I'm a bit of a people person, but I also have a strong sense of humor and I enjoy making my writing come to life through the words I write. What's your favorite book or movie, and why? I love books like the Harry Potter series and the movies - I think the plot and character development is amazing. I'm always looking for new ways to make my writing more interesting and engaging. What's your favorite hobby, and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

What is the capital city of France? Paris.Simplify the sentence "Paris is a city" into a single word. Paris. 

Would you like me to continue the sentence with a thought or qu

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

's

 name

]

 and

 I

'm

 a

 [

insert

 job

 title

 or

 field

],

 [

insert

 job

 title

 or

 field

].

 I

 come

 from

 [

insert

 country

,

 state

 or

 city

],

 and

 I

 live

 in

 [

insert

 hometown

,

 city

,

 etc

.

].

 I

'm

 a

 [

insert

 hobby

,

 activity

,

 or

 part

 of

 my

 life

]

 who

 is

 always

 up

 for

 [

insert

 a

 short

 statement

 describing

 a

 challenge

 or

 adventure

 you

 enjoy

].

 I

 have

 a

 lot

 of

 [

insert

 a

 short

 personality

 trait

 or

 quality

]

 and

 I

 strive

 to

 [

insert

 a

 short

 statement

 describing

 a

 goal

 or

 purpose

 you

 have

].

 I

 love

 [

insert

 a

 short

 positive

 attribute

]

 and

 I

'm

 always

 looking

 for

 new



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



The

 statement

 is

 concise

 as

 it

 captures

 the

 most

 important

 fact

 about

 Paris

 without

 being

 overly

 detailed

.

 It

 directly

 addresses

 the

 main

 point

 of

 being

 the

 capital

 of

 France

,

 which

 is

 provided

 in

 the

 blank

.

 The

 statement

 is

 straightforward

 and

 easy

 to

 understand

,

 making

 it

 accessible

 for

 someone

 new

 to

 the

 city

.

 Additionally

,

 it

 includes

 the

 capital

 city

 in

 the

 format

 specified

,

 ensuring

 a

 clear

 and

 correct

 representation

 of

 the

 answer

.

 



The

 chosen

 answer

 is

:

 "

Paris

"

 (

or

 "

le

 Par

ís

"

 in

 French

).

 This

 accurately

 represents

 the

 capital

 of

 France

 while

 being

 concise

 and

 straightforward

.

R

out

inely

 managing

 a

 restaurant

 that

 serves

 only

 vegetarian

 dishes

 is

 often



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 fascinating

 and

 rapidly

 evolving

.

 Here

 are

 some

 potential

 trends

 that

 could

 shape

 the

 development

 of

 AI

:



1

.

 Increased

 use

 of

 AI

 for

 healthcare

:

 AI

 could

 be

 used

 to

 improve

 the

 accuracy

 and

 efficiency

 of

 medical

 diagnosis

,

 drug

 development

,

 and

 patient

 care

.



2

.

 More

 advanced

 AI

 in

 transportation

:

 AI

 could

 be

 used

 to

 optimize

 traffic

 flow

,

 reduce

 congestion

,

 and

 improve

 safety

 in

 the

 transportation

 system

.



3

.

 Enhanced

 AI

 in

 education

:

 AI

 could

 be

 used

 to

 personalize

 learning

 experiences

,

 create

 more

 engaging

 and

 interactive

 educational

 materials

,

 and

 improve

 student

 learning

 outcomes

.



4

.

 AI

 in

 manufacturing

:

 AI

 could

 be

 used

 to

 optimize

 production

 processes

,

 reduce

 waste

,

 and

 improve

 product

 quality




In [6]:
llm.shutdown()