# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.31it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.20it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.18it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.43it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Vanessa and I am a 25 year old Australian woman. I am a passionate about making the world a better place and have been working in the not for profit sector for the past 6 years. I have a strong background in program management and project management and I am eager to continue working in this field.
My expertise lies in project management, program management, community development, community engagement and volunteer management. I have a strong understanding of social impact and am passionate about creating positive change.
I am highly organised, a strong communicator and have excellent problem solving skills. I am also highly adaptable and able to work effectively in fast-paced environments.
In
Prompt: The president of the United States is
Generated text:  responsible for signing bills into law, which they can do by either signing them with their name or by vetoing them. If the president vetoes a bill, Congress can then try to override the veto

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, hiking, and trying new foods. I'm currently working on a novel and experimenting with different writing styles. I'm a bit of a introvert, but I love meeting new people and hearing their stories. I'm always looking for new experiences and learning opportunities. I'm a bit of a perfectionist, but I'm working on being more flexible and open-minded. I'm excited to see where life takes me next.
This self-introduction is neutral because it doesn't reveal too much about Kaida's personality, background, or motivations.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a population of over 2.1 million people and is a major hub for international business, finance, and tourism. Paris is also known for its romantic atmosphere, beautiful parks, and vibrant nightlife. The city has a long history dating back to the 3rd century BC and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in AI:
1. Increased Adoption in Industries: AI is expected to become increasingly adopted in various industries, including healthcare, finance, transportation, and education. This will lead to improved efficiency, productivity, and decision-making.
2. Advancements in Machine Learning: Machine learning, a subset of AI, is expected to continue to advance, enabling AI systems to learn from data and improve their performance over time.
3. Rise of Explainable AI: As AI becomes more pervasive, there will be a growing need to understand



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jaxon Lee, and I'm a 22-year-old college student studying environmental science. I enjoy hiking and playing guitar in my free time. I'm a bit of a introvert, but I'm working on coming out of my shell. That's me in a nutshell.
The key points in this introduction are:
    - A simple greeting
    - The person's name and age
    - Their major or field of study
    - A hobby or interest
    - A personal trait or quirk
    - A bit of self-awareness or personal growth

    - The introduction should be short and concise
    -

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is one of the most famous cities in the world, known for its beautiful architecture, world-class museums, and vibrant culture.
It is located in the north-central part of France, along the River Seine. Paris is often called the City of 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Z

ara

 and

 I

’m

 a

 freelance

 writer

.

 I

 enjoy

 writing

 about

 science

,

 technology

,

 and

 philosophy

.

 When

 I

’m

 not

 writing

,

 you

 can

 find

 me

 reading

 a

 book

 or

 exploring

 the

 outdoors

.

 I

'm

 interested

 in

 meeting

 new

 people

 and

 learning

 about

 their

 experiences

 and

 perspectives

.


Z

ara

 is

 a

 freelance

 writer

 who

 specializes

 in

 science

,

 technology

,

 and

 philosophy

.

 She

 spends

 most

 of

 her

 free

 time

 reading

 and

 exploring

 the

 outdoors

.

 Z

ara

 is

 looking

 to

 expand

 her

 professional

 network

 and

 learn

 from

 others

.


Z

ara

 is

 a

 freelance

 writer

 who

 specializes

 in

 science

,

 technology

,

 and

 philosophy

.

 She

 spends

 most

 of

 her

 free

 time

 reading

 and

 exploring

 the

 outdoors

.

 Z

ara

 is



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 situated

 in

 the

 northern

 part

 of

 France

.

 It

 is

 known

 for

 its

 cultural

 and

 artistic

 significance

,

 as

 well

 as

 its

 historic

 landmarks

 such

 as

 the

 E

iff

el Tower

 and

 the

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 home

 to

 many

 museums

,

 including

 the

 Lou

vre

,

 which

 houses

 the

 famous

 Mona

 Lisa

 painting

.

 The

 city

 is

 also

 known

 for

 its

 fashion

 and

 cuisine

,

 with

 popular

 dishes

 such

 as

 cro

iss

ants

,

 bag

uet

tes

,

 and

 esc

arg

ots

.

 Paris

 is

 a

 popular

 tourist

 destination

 and

 is

 often

 considered

 the

 most

 romantic

 city

 in

 the

 world

.

 Its

 famous

 landmarks

,

 rich

 history

,

 and

 vibrant

 culture

 make

 it

 a

 must

-

visit

 destination

 for

 many



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 advances

 in

 various

 fields

,

 including

 machine

 learning

,

 natural

 language

 processing

,

 computer

 vision

,

 and

 robotics

.

 Some

 potential

 future

 trends

 in

 AI

 include

:


More

 focus

 on

 explain

ability

 and

 transparency

:

 As

 AI

 becomes

 increasingly

 pervasive

,

 there

 is

 a

 growing

 need

 to

 understand

 how

 AI

 systems

 make

 decisions

 and

 predictions

.

 Future

 AI

 systems

 may

 be

 designed

 to

 provide

 more

 transparent

 and

 explain

able

 decision

-making

 processes

.


Increased

 use

 of

 edge

 AI

:

 The

 proliferation

 of

 IoT

 devices

 and

 the

 need

 for

 real

-time

 processing

 and

 decision

-making

 are

 driving

 the

 adoption

 of

 edge

 AI

,

 which

 involves

 processing

 AI

 models

 on

 device

 rather

 than

 in

 the

 cloud

.


Growing

 emphasis

 on

 human

-A

I

 collaboration

:




In [6]:
llm.shutdown()