# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.74it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.73it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alicia. I'm a senior student at the University of Houston. In my spare time, I like to play with toys. I love watching cartoons and listening to music. And I also really enjoy playing computer games. I have a great sense of humor. I have a lot of friends and I like to play with them. I'm also good at playing the guitar. It is my favorite instrument. I have been playing guitar for about 10 years now. I like to practice every day because it helps me improve my playing skills. I'm planning to go to a music school soon. I'm very excited about this because I think
Prompt: The president of the United States is
Generated text:  running for a second term. He is currently 55 years old. His rival is running for a second term and is 30 years younger than he is. How old will the president be when the rival reaches 30 years old?

To determine how old the president of the United States will be when the rival reaches 30 years old, we need to follow these ste

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [job title] at [company name], and I have [number of years of experience] years of experience in [specific field or industry]. I'm always looking for new opportunities to grow and learn, and I'm always eager to learn new things. What's your favorite hobby or activity to do? I love [mention a hobby or activity]. What's your favorite book or movie to watch? I love [mention a favorite book or

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a cultural and economic hub, with a diverse population of over 2 million people and a rich history dating back to the Roman Empire. It is a popular tourist destination, with millions of visitors annually. The city is known for its cuisine, including French cuisine, and is home to many famous restaurants and cafes. Paris is a city of contrasts, with its modern architecture and historical landmarks blending

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation and artificial intelligence: As AI continues to advance, we can expect to see more automation and AI-driven technologies becoming more prevalent in various industries. This could lead to increased efficiency, productivity, and cost savings for businesses and individuals.

2. AI-powered healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI continues to advance, we can expect to see even more sophisticated AI-powered healthcare solutions being developed.

3. AI-powered education



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [characteristic] person. I'm currently [age] years old and I have a passion for [occupation or hobby]. I love to [describe something that makes you proud or excited about something]. And I'm always [describe something about me that makes me unique].
I hope this self-introduction is helpful! Let me know if you need anything else. 
[Name] 
[Your job] [Your hobby] 
[Your passion] 
[Your age]
[Your profession/occupation/hobby] 
[Your unique trait/curiosity] 
[Your personality trait/characteristic]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as "La Ville Marie."

Analyze the provided sentences and select the one that best represents the overall message conveyed in the text. (a) Capital city of France, (b) A city where people speak French, (c) A city with a population of 2 mil

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

'm

 a

 [

industry

]

 expert

 with

 [

number

 of

 years

]

 years

 of

 experience

 and

 a

 passion

 for

 [

specific

 interest

 or

 hobby

].

 I

'm

 a

 [

number

 of

]

 people

 I

 know

,

 and

 I

'm

 constantly

 learning

 and

 growing

 in

 my

 field

.

 My

 main

 goal

 is

 to

 [

mot

ivation

 for

 career

].

 [

Optional

:

 mention

 a

 notable

 achievement

 or

 accomplishment

].

 As

 a

 [

main

 goal

],

 my

 goal

 is

 to

 [

main

 goal

 achievement

].

 I

 believe

 in

 [

att

itude

 or

 values

 that

 inspire

 me

].

 [

Optional

:

 mention

 a

 personal

 characteristic

 or

 trait

 that

 sets

 me

 apart

].

 Lastly

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 world

-f

amous

 city

 with

 its

 iconic

 E

iff

el

 Tower

 and

 vibrant

 culture

.

 Paris

 is

 also

 a

 major

 tourist

 destination

 and

 home

 to

 numerous

 world

-ren

owned

 museums

,

 landmarks

,

 and

 artists

.

 The

 city

 is

 known

 for

 its

 rich

 history

 and

 cultural

 heritage

,

 with

 many

 museums

 and

 theaters

 showcasing

 the

 region

's

 cultural

 contributions

.

 Paris

 is

 a

 popular

 tourist

 destination

 with

 a

 rich

 history

,

 stunning

 architecture

,

 and

 a

 vibrant

 nightlife

.

 It

 is

 considered

 to

 be

 the

 most

 liv

able

 city

 in

 Europe

.

 The

 city

 is

 also

 home

 to

 the

 headquarters

 of

 major

 global

 companies

 and

 has

 a

 diverse

 population

 of

 over

 

1

0

 million

 people

.

 Paris

 has

 a

 long

 and

 stor

ied

 history



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 complex

,

 with

 many

 unknown

s

 and

 challenges

.

 However

,

 there

 are

 several

 possible

 trends

 that

 have

 been

 identified

 and

 are

 likely

 to

 shape

 the

 AI

 landscape

 in

 the

 coming

 years

:



1

.

 Integration

 of

 AI

 into

 everyday

 life

:

 AI

 is

 already

 becoming

 more

 integrated

 into

 our

 daily

 lives

,

 from

 voice

-

activated

 assistants

 like

 Siri

 and

 Alexa

 to

 autonomous

 vehicles

 and

 smart

 home

 devices

.

 We

 can

 expect

 AI

 to

 continue

 to

 integrate

 into

 our

 daily

 routines

,

 making

 our

 lives

 more

 convenient

 and

 efficient

.



2

.

 Personal

ized

 AI

:

 As

 AI

 technologies

 advance

,

 we

 can

 expect

 to

 see

 more

 personalized

 AI

 systems

 that

 adapt

 to

 individual

 users

 and

 provide

 more

 accurate

 predictions

 and

 recommendations

.

 Personal

ized




In [6]:
llm.shutdown()