# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.26it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex, and I'm a 15-year-old who's really into VR games. I wanted to know about some of the more futuristic games that you guys have been working on. What do you think? As an AI language model, I don't play games, but I can provide you with some information about the latest virtual reality games developed by various studios.

One of the most exciting and futuristic virtual reality games that I can think of is called "Ecco" by Studio Ghibli. It was developed by Studio Ghibli and was released in 2020. "Ecco" is a space-themed video game
Prompt: The president of the United States is
Generated text:  an important leader in the country. Many people in the United States believe that the president is an important leader because the president holds a higher position than any other official in the country, which gives the president a great deal of power. He can issue executive orders, declare war, appoint Supreme Court justices, and take other actions t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your profession or experience here]. I enjoy [insert a short description of your hobbies or interests here]. I'm always looking for new opportunities to grow and learn, and I'm always eager to share my knowledge and experience with others. What do you do for a living? I'm a [insert a short description of your job here]. I'm always looking for ways to improve my skills and knowledge, and I'm

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is the largest city in France and the second-largest city in the European Union, with a population of over 2.7 million people. Paris is known for its rich history, art, and culture, and is a popular tourist destination. It is also home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is a major economic center and a major center of education and research in the world. It is also a hub for business and finance, with many international companies and institutions headquartered there. The city is also known for

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation and robotics: As AI technology continues to advance, we can expect to see more automation and robotics in various industries, including manufacturing, transportation, and healthcare. This will lead to increased efficiency, cost savings, and job displacement, but also create new opportunities for innovation and creativity.

2. Enhanced privacy and security: As AI systems become more sophisticated, we can expect to see increased emphasis on privacy and security. This will require developers to implement stronger encryption



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am an experienced professional with a strong background in [industry]. I have [number] years of experience in [industry] and have led numerous successful projects. My key skills include [list of skills relevant to the industry]. Additionally, I am a [colorful adjective] and [describe your personality traits] personality type. I am [add a brief personal statement] about myself and why I am the right candidate for this position. Thank you. 

[Your name]

---

**Professional Introduction:** Hi, my name is [Your Name]. I am an experienced professional with a strong background in [industry]. I have [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Note: As of 2021, Paris has a population of around 2.7 million people and is the largest city in Europe by land area. It is also the world's 30th-larg

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 computer

 science

 student

 at

 [

University

/

In

stitution

].

 I

 have

 been

 programming

 for

 the

 past

 [

Number

 of

 years

]

 years

,

 and

 I

 am

 always

 up

 to

 date

 with

 the

 latest

 programming

 languages

 and

 tools

.

 I

 enjoy

 experimenting

 with

 new

 technologies

 and

 I

 am

 always

 looking

 for

 ways

 to

 improve

 my

 skills

 and

 stay

 up

-to

-date

 with

 the

 latest

 trends

 in

 the

 field

 of

 computer

 science

.

 I

 also

 enjoy

 spending

 time

 with

 my

 family

 and

 trying

 out

 new

 hobbies

.

 What

 can

 you

 tell

 me

 about

 yourself

?

 I

'm

 a

 computer

 science

 student

 at

 [

University

/

In

stitution

].

 I

 have

 been

 programming

 for

 the

 past

 [

Number

 of

 years

]

 years

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Therefore

,

 the

 answer

 is

 Paris

.

 



To

 expand

 further

,

 Paris

 is

 the

 largest

 city

 and

 the

 most

 populous

 city

 of

 France

,

 and

 it

 is

 also

 the

 capital

 of

 France

.

 It

 is

 known

 for

 its

 picturesque

 architecture

,

 cultural

 and

 historical

 significance

,

 and

 important

 political

 offices

.

 Paris

 is

 a

 major

 financial

 center

 and

 is

 the

 world

's

 second

-largest

 metropolitan

 area

.

 It

 is

 also

 home

 to

 several

 world

-class

 museums

,

 art

 galleries

,

 theaters

,

 and

 opera

 houses

.

 The

 city

 is

 also

 famous

 for

 its

 fashion

 and

 food

 scenes

,

 and

 it

 hosts

 numerous

 festivals

,

 events

,

 and

 art

 exhibitions

 throughout

 the

 year

.

 Additionally

,

 Paris

 has

 a

 rich

 history

,

 including

 being

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 but

 here

 are

 some

 possible

 trends

 that

 could

 shape

 the

 direction

 of

 the

 technology

 in

 the

 coming

 years

:



1

.

 Increased

 autonomy

:

 With

 the

 advent

 of

 machine

 learning

 algorithms

,

 AI

 systems

 could

 become

 more

 autonomous

 in

 their

 decision

-making

 processes

.

 This

 could

 lead

 to

 more

 advanced

 AI

 systems

 that

 can

 make

 decisions

 without

 human

 intervention

.



2

.

 Enhanced

 creativity

 and

 innovation

:

 AI

 could

 become

 even

 more

 capable

 of

 generating

 new

 ideas

 and

 solutions

 to

 complex

 problems

,

 opening

 up

 new

 avenues

 for

 innovation

 and

 creativity

.



3

.

 Greater

 integration

 with

 human

 emotions

:

 AI

 systems

 could

 learn

 to

 interpret

 and

 respond

 to

 human

 emotions

,

 providing

 more

 emotional

 intelligence

 in

 the

 workplace

 and

 social

 interactions

.



4

.




In [6]:
llm.shutdown()