# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.20it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.13it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.11it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Dr. Amro Al-Akkad, and I'm a board-certified neurosurgeon specializing in complex spine surgery and minimally invasive spine procedures. I'm honored to have the opportunity to help you navigate your spine health journey.
As a neurosurgeon, I've had the privilege of treating patients with a wide range of spinal conditions, from herniated discs and spinal stenosis to spondylolisthesis and spinal tumors. My goal is always to provide compassionate, evidence-based care that is tailored to each patient's unique needs and circumstances.
My approach is centered on a deep understanding of the latest advancements in spinal
Prompt: The president of the United States is
Generated text:  scheduled to visit Israel in May to promote a peace deal between the Israelis and Palestinians.
As the president prepares for his trip, his administration is making diplomatic efforts to secure Israeli and Palestinian support for a peace deal.
The administration has alread

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and editor living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new coffee shops. I'm a bit of a introvert and prefer to spend my free time alone, but I'm always up for a good conversation or a quiet evening with friends. I'm a bit of a perfectionist, which can sometimes make me come across as stubborn or critical, but I'm working on being more open-minded and flexible. I'm excited to meet new people and learn more about their interests and experiences.
This self-introduction is neutral because it doesn

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about the population of France’s capital city. The population of Paris is approximately 2.1 million people.
Provide a concise factual statement about the location of France’s capital city. Paris is located in the northern part of France, in the Île-de-France region.
Provide a concise factual statement about the climate of France’s capital city. Paris has a temperate oceanic climate, characterized by mild winters and cool summers.
Provide a concise factual statement about the economy of France’s capital city. Paris is a major economic hub, with a diverse economy that includes finance, fashion, and tourism.
Provide

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems may be able to analyze medical images, identify patterns in patient data, and provide personalized treatment recommendations.
2. Widespread adoption of AI in customer service: AI-powered chatbots and virtual assistants are likely to become more common in customer service, helping to answer customer queries, resolve issues, and provide personalized support.
3.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kaida Akatsuki. I'm a 17-year-old high school student from a middle-class family. I live in the suburbs of Tokyo with my parents and younger brother. My hobbies include playing the violin and reading science fiction novels. I enjoy playing basketball and spending time outdoors. I'm currently taking advanced classes in mathematics and science. I'm a bit of a perfectionist and strive to achieve my goals in an efficient manner. What I like most about myself is my ability to stay focused and composed under pressure.
Here are some things I'd like to change or add:
Change "middle-class family" to "typical Japanese household

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Write a short descriptive passage about France’s capital city. Located on the Seine River in northern France, the City of Light, as Paris i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Alex

 Chen

.

 I

'm

 a

 

25

-year

-old

 student

 at

 the

 University

 of

 California

,

 Berkeley

.

 I

 study

 computer

 science

 and

 enjoy

 playing

 video

 games

 and

 reading

 science

 fiction

 novels

 in

 my

 free

 time

.

 I

'm

 a

 bit

 of

 a

 intro

vert

,

 but

 I

'm

 working

 on

 being

 more

 outgoing

 and

 meeting

 new

 people

.

 I

'm

 looking

 forward

 to

 learning

 and

 growing

 in

 college

.


This

 self

-int

roduction

 is

 neutral

 because

 it

 doesn

't

 reveal

 any

 personal

 biases

 or

 opinions

 about

 the

 character

.

 It

 provides

 basic

 information

 about

 Alex

 Chen

's

 background

,

 interests

,

 and

 personality

 traits

 in

 a

 straightforward

 and

 factual

 manner

.

 A

 neutral

 self

-int

roduction

 is

 suitable

 for

 a

 variety

 of

 situations

,

 such

 as



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


The

 city

 of

 Paris

 is

 located

 in

 the

 northern

 part

 of

 France

 on

 the

 banks

 of

 the

 Se

ine

 River

.

 The

 Se

ine

 River

 runs

 through

 the

 heart

 of

 the

 city

 and

 its

 islands

.

 It

 flows

 through

 Paris

 from

 its

 source

 in

 Burg

undy

,

 passes

 through

 the

 city

 of

 Paris

 and

 eventually

 empt

ies

 into

 the

 English

 Channel

 near

 the

 Norm

andy

 region

.


Paris

 is

 the

 most

 populous

 city

 in

 France

 with

 a

 population

 of

 over

 

2

 million

 people

.

 The

 metropolitan

 area

 of

 Paris

 has

 a

 population

 of

 over

 

12

 million

 people

,

 making

 it

 the

 largest

 metropolitan

 area

 in

 France

.


Paris

 is

 a

 major

 business

 and

 financial

 center

,

 hosting

 many

 of

 France

’s

 most

 important

 institutions



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 and

 it

's

 difficult

 to

 predict

 exactly

 what

 will

 happen

.

 However

,

 based

 on

 current

 trends

 and

 developments

,

 here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:


1

.

 Increased use

 of

 AI

 in

 various

 industries

:

 AI

 is

 already

 being

 used

 in

 various

 industries

 such

 as

 healthcare

,

 finance

,

 education

,

 and

 transportation

.

 As

 AI

 technology

 improves

,

 we

 can

 expect

 to

 see

 its

 use

 expand

 to

 more

 industries

 and

 applications

.


2

.

 Adv

ancements

 in

 natural

 language

 processing

 (

N

LP

):

 N

LP

 is

 a

 key

 area

 of

 AI

 research

,

 and

 significant

 advancements

 are

 expected

 in

 the

 future

.

 This

 could

 lead

 to

 more

 effective

 chat

bots

,

 virtual

 assistants

,

 and

 language

 translation

 systems

.





In [6]:
llm.shutdown()