# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.17it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  David and I'm a Senior Designer at VUE Design. We are based in Boston and I am starting a new project to design a digital platform for a new retail store. The idea behind the design is to create a user-friendly experience for the customers through digital platforms, a digital marketplace and a digital store. What are some of the unique features and features that you would like to see implemented in the new platform?
Creating a digital platform for a retail store is a fascinating and ambitious project that brings together several key aspects of online commerce, customer engagement, and digital marketing. Here are some unique features and features that you might consider implementing:

1
Prompt: The president of the United States is
Generated text:  a man. The United States does not have a president because it is a republic. The president of the United States is elected by the people.

Does this course of events satisfy the antecedent condition 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can I expect from our conversation? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can I expect from our conversation? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can I expect from our conversation? [Name] is a [job title] at [company name]. I'm excited to meet you and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Museum of Modern Art. Paris is a bustling city with a rich history and culture, and it is a popular tourist destination. The city is also known for its cuisine, including its famous croissants and its famous French fries. Paris is a city that is both a cultural and historical center of France, and it is a popular destination for tourists and locals alike. The city is known for its beautiful architecture, including its Gothic and Renaissance architecture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more sophisticated and personalized interactions. This could lead to more efficient and effective use of AI in various fields, such as healthcare, finance, and transportation.

2. Enhanced machine learning capabilities: AI is likely to become more capable of learning and adapting to new situations, allowing for more sophisticated and nuanced decision-making. This could lead to more effective and efficient use of AI in various fields, such as healthcare, finance, and transportation.

3. Increased reliance on AI for decision



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm an artificial intelligence designed to assist users in a variety of ways. I'm here to help them with any questions they might have, from learning how to program to answering their questions and providing advice on how to improve their writing skills. I'm also here to help users stay informed about current events and trends in the world. My goal is to be a helpful and informative resource for anyone seeking knowledge or guidance. Thank you for choosing me! [Name] I'm an AI designed to assist users in a variety of ways. I'm here to help them with any questions they might have, from learning how to program

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

That's correct! Paris is the capital city of France, located on the Mediterranean coast in the south of the country. It is known for its i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Character

 Name

]

 and

 I

'm

 a

 [

job

 title

]

 with

 [

number

 of

 years

 of

 experience

]

 years

 of

 experience

 in

 the

 industry

.

 I

 specialize

 in

 [

specific

 skills

 or

 expertise

].

 I

'm

 always

 looking

 for

 opportunities

 to

 grow

 and

 learn

,

 so

 I

'm

 open

 to

 new

 challenges

 and

 experiences

.

 What

 brings

 you

 to

 this

 job

?

 I

 was

 first

 introduced

 to

 [

job

 title

]

 as

 a

 [

skill

 or

 experience

],

 and

 I

 have

 always

 been

 passionate

 about

 [

why

 it

's

 relevant

 to

 the

 company

].

 I

 believe

 in

 [

why

]

 and

 am

 excited

 to

 continue

 growing

 and

 developing

 my

 skills

 in

 this

 field

.

 What

 do

 you

 think

 sets

 you

 apart

 from

 other

 job

 applicants

?



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 Europe

 and

 the

 

1

6

th

-largest

 city

 in

 the

 world

.

 Paris

 is

 known

 for

 its

 historical

 landmarks

,

 vibrant

 culture

,

 and

 fashion

 industry

.

 It

 is

 also

 the

 home

 of

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 The

 city

 is

 famous

 for

 its

 annual

 festivals

,

 such

 as

 the

 E

iff

el

 Tower

 Festival

 and

 the

 Mus

ée

 d

'

Or

say

.

 Paris

 is

 also

 home

 to

 many

 world

-ren

owned

 artists

,

 including

 Pablo

 Picasso

 and

 Rem

brand

t

 van

 R

ijn

.

 The

 city

 is

 also

 known

 for

 its

 rich

 culture

 and

 cuisine

,

 with

 many

 French

 dishes

 that

 are

 popular

 around

 the

 world

.

 With

 its

 many

 historical



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 complex

 and

 evolving

 field

,

 with

 many

 potential

 directions

 that

 could

 shape

 its

 development

.

 Here

 are

 some

 possible

 trends

 in

 AI

 that

 could

 emerge

 in

 the

 coming

 years

:



1

.

 Deep

 learning

:

 Deep

 learning

 is

 the

 use

 of

 large

 neural

 networks

 with

 many

 layers

 to

 process

 and

 analyze

 data

.

 It

 has

 the

 potential

 to

 revolution

ize

 a

 wide

 range

 of

 applications

,

 including

 natural

 language

 processing

,

 computer

 vision

,

 and

 speech

 recognition

.



2

.

 Explain

ability

:

 With

 the

 rise

 of

 machine

 learning

 and

 deep

 learning

,

 it

 is

 becoming

 more

 important

 to

 understand

 how

 AI

 systems

 work

.

 The

 ability

 to

 explain

 the

 reasoning

 behind

 AI

 decisions

 and

 predictions

 could

 lead

 to

 more

 transparent

 and

 accountable

 AI

 systems

.



3




In [6]:
llm.shutdown()