# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.22it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.13it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.11it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Pam, and I am the school librarian at Meridian High School in Idaho. I am thrilled to be a part of the library community and I am excited to connect with other librarians through this blog.
I have been a librarian for 5 years now, and I have to say that it is a dream come true. I have always been passionate about reading and learning, and I feel incredibly fortunate to be able to make a difference in the lives of my students every day.
I love that as a librarian, I get to be a part of so many different aspects of the school. I get to help students find books that they love
Prompt: The president of the United States is
Generated text:  elected through the Electoral College system. Candidates run in each state, and the candidate who wins the most votes in a state gets all of that state’s electoral votes (except in Maine and Nebraska, which allocate their electoral votes proportionally). The candidate who wins the majority of the electoral votes,

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Elianore Quasar. I'm a 25-year-old astrophysicist who works at a research facility on the outskirts of a small town. I enjoy reading about the history of space exploration and collecting antique telescopes. I'm currently working on a project to develop a more efficient method for detecting exoplanets. That's me in a nutshell. What do you think? Is there anything you'd like to add or change?
I think your self-introduction is clear and concise. It gives a good sense of who Elianore is and what they do. However, it might be a bit dry and lacking in personality

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. The city is known for its rich history, art, fashion, and cuisine. Paris is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city is also a major hub for international business, finance, and culture. Paris is a popular tourist destination and is considered one of the most beautiful and romantic cities in the world. The city has a population of over 2.1 million people and is the largest city in France. Paris is a global center for fashion,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by the convergence of multiple factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased Adoption in Industries: AI will continue to be adopted in various industries, including healthcare, finance, transportation, and education. This will lead to increased efficiency, productivity, and innovation.
2. Advancements in Machine Learning: Machine learning will continue to be a key area of focus in AI research, with advancements in deep learning, natural language processing, and computer vision.
3. Rise of Explainable AI: As AI becomes more pervasive, there will be a growing need for



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Amaranth Jones. I work as a part-time librarian and a freelance writer. I enjoy reading and learning about various topics, and I find joy in sharing that knowledge with others.
What is a neutral self-introduction?
A neutral self-introduction is a brief statement that presents a person's identity, profession, and interests in a factual and non-judgmental way. It does not convey strong emotions, biases, or personal opinions. The goal is to introduce oneself in a clear and concise manner, without revealing too much or trying to impress others. A neutral self-introduction is often used in formal or professional settings, such as

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is situated in the northern part of the country, along the Seine River. It is the largest city in France, home to over 2 milli

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Em

ilia

 Wild

er

.

 I

'm

 a

 

25

-year

-old

 bot

an

ist

 living

 in

 the

 small

 town

 of

 Ravens

hire

.

 I

'm

 currently

 working

 on

 a

 project

 to

 document

 the

 local

 flora

 and

 have

 a

 particular

 interest

 in

 rare

 plant

 species

.

 I

 enjoy

 hiking

,

 gardening

,

 and

 reading

 about

 science

 and

 history

.

 That

's

 me

 in

 a

 nutshell

.


This

 is

 a

 fairly

 standard

 neutral

 self

-int

roduction

,

 providing

 some

 basic

 information

 about

 the

 character

.

 The

 inclusion

 of

 specific

 details

,

 such

 as

 the

 character

's

 age

 and

 the

 location

 of

 their

 work

,

 adds

 a

 touch

 of

 realism

.

 The

 mention

 of

 hobbies

 helps

 to

 give

 a

 glimpse

 into

 the

 character

's

 personality

 and

 interests

.

 Overall

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 capital

 of

 France

 is

 Paris

.


Paris

 is

 known

 for

 its

 iconic

 landmarks

 and

 historic

 architecture

.

 It

 is

 a

 popular

 tourist

 destination

 and

 a

 hub

 for

 art

,

 fashion

,

 and

 culture

.


The

 city

 has

 a

 rich

 history

 dating

 back

 to

 the

 Middle

 Ages

 and

 has

 been

 the

 seat

 of

 power

 for

 several

 monarch

s

 and

 em

per

ors

 throughout

 its

 history

.

 Today

,

 it

 is

 a

 major

 business

 and

 financial

 center

,

 home

 to

 many

 international

 companies

 and

 organizations

.


Paris

 is

 also

 famous

 for

 its

 cuisine

,

 including

 dishes

 such

 as

 esc

arg

ots

,

 cro

iss

ants

,

 and

 mac

ar

ons

.

 The

 city

 is

 known

 for

 its

 romantic

 atmosphere

,

 with

 its

 picturesque

 streets

,

 charming

 cafes



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 topic

 of

 ongoing

 debate

 and

 speculation

.

 However

,

 here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 

 

Increased

 use

 of

 AI

 in

 various

 industries

:

 AI

 is

 likely

 to

 be

 used

 more

 widely

 in

 various

 industries

 such

 as

 healthcare

,

 finance

,

 transportation

,

 and

 education

.


2

.

 

 

Adv

ances

 in

 natural

 language

 processing

 (

N

LP

):

 AI

 systems

 will

 become

 more

 proficient

 in

 understanding

 and

 generating

 human

 language

,

 leading

 to

 better

 chat

bots

,

 virtual

 assistants

,

 and

 language

 translation

 systems

.


3

.

 

 

More

 emphasis

 on

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 more

 ubiquitous

,

 there

 will

 be

 a

 growing

 need

 for

 AI

 systems

 to

 be

 more

 transparent

 and

 explain

able




In [6]:
llm.shutdown()