# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.33it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.33it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sandy. I am a student in a senior high school. Now I will take a new subject "Chinese" at school. It is very interesting and it's also very useful. I like to learn Chinese because I think it's very fun. I have some friends who know Chinese, so I often chat with them. They are very kind to me and I like to be friendly. I have a big family, but my parents work in the USA. They have a nice house and they like to go out for dinner. I like my family. I hope to learn Chinese, but I am not sure if I can because I have to
Prompt: The president of the United States is
Generated text:  trying to decide how many military bases to have. He has 3 possible options and 4 possible bases per option. However, if the base is military history, he cannot have any modern bases. How many different combinations of bases can the president choose? Let's break down the problem step-by-step. The president has 3 possible options, and for each option, there are 4 possible 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and passions. Let's chat! [Name] [Company Name] is a [brief description of the company]. I'm [age] years old and I'm [job title]. I enjoy [mention a hobby or activity that you enjoy]. I'm always looking for new experiences and learning new things. What's your favorite hobby or activity? [Name] [Company Name] is a [brief description of the company]. I'm [age] years old and I'm [job title]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It is also home to the French Parliament and the French Parliament House. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and the French Revolution. It is a major transportation hub and a major tourist destination. The city is known for its cuisine, fashion, and art, and is a major center for business and commerce. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is a city that is both old and new, and is a symbol

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, reduce costs, and increase efficiency. As AI technology continues to advance, we can expect to see even more widespread use of AI in healthcare, particularly in areas such as diagnosis, treatment planning, and patient monitoring.

2. AI in finance: AI is already being used in finance to improve fraud detection, risk management, and investment decision-making.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Age] year old [Gender] who graduated from [School Name] with a [Degree] in [Field of Study]. I'm passionate about [Personal Interest or Hobby], and I enjoy [Opinion or Action]. How can I be a valuable member of your team? I believe in [Personality Trait or Value]. I'm always ready to learn and grow, and I'm always open to new experiences and challenges. Thank you for considering me for this role. Let me know if you have any questions or need any information. [Name]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city where the Eiffel Tower stands and where the French Revolution took place. It is a vibrant and cosmopolitan metropolis known for its rich history and stunning architecture. Paris has a diverse population of over 10 million people and is home to many famous landmarks s

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

]

 and

 I

'm

 a

 [

insert

 occupation

 or

 hobby

].

 I

 enjoy

 [

insert

 a

 short

 reason

 why

 you

 enjoy

 your

 profession

/h

obby

].

 I

'm

 [

insert

 your

 age

 or

 date

 of

 birth

].

 I

'm

 [

insert

 your

 gender

]

 and

 I

 live

 in

 [

insert

 your

 city

 or

 country

].

 I

'm

 a

 [

insert

 any

 notable

 qualities

 or

 skills

 that

 define

 you

 as

 a

 character

].

 I

 love

 [

insert

 a

 short

 reason

 why

 you

 like

 [

insert

 a

 hobby

 or

 activity

 that

 you

 enjoy

]

 or

 why

 you

're

 passionate

 about

 [

insert

 a

 hobby

 or

 activity

 that

 you

 enjoy

]

].

 I

 love

 [

insert

 a

 short

 reason

 why

 you

 get

 [

insert

 a

 job

 title

 or



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



That

's

 correct

!

 Paris

 is

 the

 capital

 city

 of

 France

,

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 iconic

 landmarks

 like

 the

 Lou

vre

 Museum

,

 and

 a

 rich

 cultural

 scene

 with

 a

 diverse

 range

 of

 art

,

 literature

,

 and

 cuisine

.

 Paris

 is

 also

 renowned

 for

 its

 vibrant

 nightlife

 and

 world

-class

 fashion

 scene

.

 Its

 historical

 significance

 dates

 back

 to

 ancient

 times

,

 making

 it

 a

 city

 steep

ed

 in

 history

 and

 culture

.

 The

 city

 is

 also

 home

 to

 numerous

 museums

,

 art

 galleries

,

 and

 theaters

,

 further

 enhancing

 its

 appeal

 as

 a

 destination

 for

 art

 enthusiasts

 and

 cultural

 travelers

 alike

.

 Paris

 is

 considered

 one

 of

 the

 most

 important

 cities

 in

 the

 world

,

 and

 the

 capital



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 dynamic

 and

 unpredictable

,

 with

 new

 developments

 on

 a

 regular

 basis

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Adv

ancements

 in

 AI

-powered

 autonomous

 vehicles

:

 Autonomous

 vehicles

 are

 becoming

 increasingly

 common

,

 but

 their

 development

 is

 still

 in

 the

 early

 stages

.

 However

,

 AI

 is

 expected

 to

 continue

 improving

 its

 ability

 to

 safely

 drive

 and

 navigate

 on

 the

 roads

,

 making

 autonomous

 driving

 more

 and

 more

 practical

.



2

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 healthcare

 to

 improve

 patient

 outcomes

,

 such

 as

 by

 predicting

 which

 patients

 are

 at

 higher

 risk

 of

 developing

 certain

 diseases

.

 AI

 is

 also

 being

 developed

 to

 assist

 in

 the

 diagnosis

 and

 treatment

 of

 diseases

,




In [6]:
llm.shutdown()