# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.15it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Isabelle, and I have recently started a new job that involves working with an AI system. The AI system is highly advanced and unique in its approach to solving problems. How can I become more skilled in utilizing this AI system? How can I learn more about its capabilities and limitations?
Additionally, could you provide me with some tips on how to effectively communicate with the AI system? It would be greatly appreciated if you could share some examples of how to convey information or request information to the AI system in a clear and concise manner. Lastly, how can I ensure that the AI system is handling sensitive information appropriately? Please provide some best practices for
Prompt: The president of the United States is
Generated text:  a powerful man. He has the power to make and enforce laws. He also holds a lot of important positions in the government. He is like a king or a dictator in most countries. What can the president do? The 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Skill or Hobby] enthusiast. I enjoy [Reason for Hobby]. I'm always looking for new experiences and learning new things. I'm always eager to try new things and make new friends. I'm a [Favorite Activity] lover. I love [Reason for Activity]. I'm always looking for new challenges and adventures. I'm a [Favorite Book or Movie] fan. I love [Reason for Fan]. I'm always looking for new ways to express myself and connect with others. I'm a [Favorite Music] lover. I love

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Museum of Modern Art. Paris is a bustling metropolis with a rich cultural heritage and is a popular tourist destination. The city is known for its diverse cuisine, including French cuisine, and is home to many museums, theaters, and other cultural institutions. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is a city of art, culture, and history that is a must-visit for anyone interested in

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more sophisticated and nuanced interactions between humans and machines. This could lead to more natural and intuitive interactions, as well as more effective problem-solving and decision-making.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be increased emphasis on ethical considerations, such as privacy, fairness, and accountability. This could lead to more robust and transparent AI systems that are designed to be fair and unbiased.

3. Increased use of AI for creative and artistic purposes: AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a/an [Age] year old [Gender] [Race] or [Profession], [Name]. I'm here to [Your Job/Profession] to [Your Goal]. I'm always ready to [Your Priorities], [Your Passion], or [Your Humor]. I'm [Your Characteristic, Personality, or Quality]. I believe in [Your Core Values, Beliefs, or Ideals]. I'm confident in [Your Strengths, Skills, or Abilities]. [Your Character] is [Your Characteristic, Personality, or Quality]. I strive to [Your Goals, Deeds

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its ancient history, diverse culture, and iconic landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is home to many renowned museums, including the Musée d'Orsay and the Centre Pompidou, and a rich culinary scene with dishes like croissants, baguettes, a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

].

 I

'm

 a

 computer

 science

 graduate

 with

 a

 passion

 for

 technology

 and

 innovation

.

 I

 am

 currently

 working

 on

 a

 team

 that

 is

 focused

 on

 developing

 AI

-driven

 solutions

 for

 our

 company

's

 business

 operations

.

 I

 believe

 that

 technology

 can

 make

 our

 company

's

 future

 much

 brighter

 by

 providing

 us

 with

 the

 means

 to

 make

 better

 decisions

 and

 increase

 efficiency

.

 I

'm

 excited

 to

 work

 with

 the

 team

 and

 help

 them

 achieve

 their

 goals

.

 Thank

 you

.

 [

insert

 name

]

 [

insert

 occupation

]

 [

insert

 company

 name

]

 [

insert

 job

 title

]

 [

insert

 job

 location

]

 [

insert

 contact

 information

]

 [

insert

 background

 information

]

 [

insert

 personal

 information

]

 [

insert

 achievements

]

 [

insert



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



If

 you

 need

 any

 further

 clarification

 or

 additional

 information

,

 please

 feel

 free

 to

 ask

.

 



I

 hope

 this

 detailed

 and

 comprehensive

 answer

 meets

 your

 expectations

!

 Let

 me

 know

 if

 you

 need

 any

 other

 assistance

.

 



S

incerely

,


[

Your

 Name

]

  


[

Your

 Position

]

  


[

Your

 Contact

 Information

]

  


[

Your

 Company

 Name

]

  


[

Your

 Company

's

 Website

]

  


[

Your

 Company

's

 Social

 Media

 Handles

]

  


[

Your

 Company

's

 Blog

]

  


[

Your

 Company

's

 Podcast

]

  


[

Your

 Company

's

 YouTube

 Channel

]

  


[

Your

 Company

's

 LinkedIn

 Profile

]

  


[

Your

 Company

's

 Twitter

 Handle

]

  


[

Your

 Company

's

 Facebook

 Page

]



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 but

 there

 are

 a

 few

 potential

 trends

 that

 are

 likely

 to

 shape

 the

 industry

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

:



1

.

 Increased

 Integration

 with

 Everyday

 Products

:

 AI

 is

 expected

 to

 become

 even

 more

 integrated

 into

 our

 daily

 lives

,

 with

 products

 that

 have

 AI

 features

 like

 voice

 recognition

,

 smart

 home

 assistants

,

 and

 self

-driving

 cars

 becoming

 more

 widespread

.



2

.

 Emer

gence

 of

 Ethics

 and

 Safety

 Concern

s

:

 As

 AI

 becomes

 more

 sophisticated

,

 there

 will

 be

 increased

 scrutiny

 of

 how

 it

 is

 used

 and

 how

 it

 affects

 human

 beings

.

 This

 could

 lead

 to

 new

 ethical

 and

 safety

 standards

 for

 AI

,

 such

 as

 privacy

 laws

 and

 regulations

.



3

.

 Shift

 towards

 Automation




In [6]:
llm.shutdown()