# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.09s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.14s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.14s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.04it/s]



  tensor_data = torch.ByteTensor(


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Emily and I am excited to share my love for the creative process with you. As a textile designer, I have spent years honing my skills in design, illustration, and pattern making. My passion is bringing my imagination to life through color, texture, and shape.
My design aesthetic is a mix of modern, bohemian, and eclectic influences. I draw inspiration from nature, vintage textiles, and the unique beauty of everyday objects. I am constantly experimenting with new techniques, materials, and technologies to push the boundaries of textile design.
Through my work, I aim to create textiles that tell stories, evoke emotions, and bring people together
Prompt: The president of the United States is
Generated text:  the leader of the United States government. The president is both the head of state and head of government of the country. The president is elected by the Electoral College system and serves a four-year term. The president has many important 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in Tokyo. I enjoy exploring the city's hidden corners and trying new foods. I'm a bit of a introvert, but I'm always up for a good conversation. I'm currently working on a novel and a graphic novel, and I'm excited to see where my creative projects take me. I'm looking forward to meeting new people and making connections in the city. That's me in a nutshell. How would you describe Kaida? What are her strengths and weaknesses? What kind of story could she be a part of? Kaida is a creative and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris.  The city is located in the northern part of the country, along the Seine River.  It is known for its beautiful architecture, art museums, and romantic atmosphere.  The Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral are some of the most famous landmarks in the city.  Paris is a popular tourist destination and is often referred to as the "City of Light."  It is also a major cultural and economic center, with a rich history dating back to the Middle Ages.  The city has a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. While it's difficult to predict exactly what the future holds, here are some possible trends that could shape the development and impact of artificial intelligence:
1. Increased Adoption in Everyday Life: AI will become increasingly integrated into our daily lives, from virtual assistants like Siri and Alexa to more sophisticated applications in healthcare, finance, and education.
2. Advancements in Machine Learning: Machine learning, a subset of AI, will continue to improve, enabling AI systems to learn from data and adapt to new situations more effectively.
3. Rise of Explainable AI: As AI becomes more pervasive, there will be a growing



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Aria Flynn, and I'm a 24-year-old freelance writer and artist living in New York City.
Aria Flynn is a 24-year-old freelance writer and artist living in New York City. She’s passionate about her work, and her creativity is evident in everything she does. With a background in fine arts, Aria is skilled in painting, drawing, and sculpture. She has also honed her writing skills, penning short stories, poetry, and articles for various publications. Aria is a curious and adventurous person, always looking for new experiences and inspiration to fuel her creative pursuits. She is an avid reader, a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. The city is famous for its beautiful architecture and art museums. The most notable landmark in the city is the Eiffel Tower, which was built for the 1889 World’s Fair

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ethan

 Wright

.

 I

 work

 as

 a

 freelance

 journalist

 in

 the

 city

 of

 Ash

wood

,

 covering

 a

 variety

 of

 topics

 from

 local

 politics

 to

 entertainment

 events

.

 I

'm

 a

 bit

 of

 a

 night

 owl

,

 often

 finding

 my

 best

 ideas

 and

 inspiration

 in

 the

 early

 hours

 of

 the

 morning

.

 When

 I

'm

 not

 working

,

 you

 can

 find

 me

 exploring

 the

 city

's

 hidden

 corners

 and

 trying

 new

 restaurants

.


Here

 are

 a

 few

 things

 to

 consider

 when

 writing

 a

 neutral

 self

-int

roduction

:


 

 

1

.

 Start

 with

 a

 simple

 greeting

,

 such

 as

 "

Hello

"

 or

 "

Hi

,

 my

 name

 is

."


 

 

2

.

 Keep

 the

 introduction

 brief

 and

 to

 the

 point

.

 Aim

 for

 a

 few



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 of

 Paris

 is

 located

 in

 the

 north

 central

 part

 of

 the

 country

.

 It

 is

 situated

 at

 the

 Se

ine

 River

.

 Paris

 is

 one

 of

 the

 most

 populated

 cities

 in

 the

 world

.

 Paris

 is

 also

 the

 second

 largest

 city

 in

 the

 European

 Union

 after

 London

.

 Paris

 is

 a

 major

 center

 for

 fashion

,

 art

,

 and

 culture

.

 The

 city

 is

 home

 to

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 a

 popular

 tourist

 destination

.

 The

 city

 is

 known

 for

 its

 romantic

 atmosphere

,

 historic

 architecture

,

 and

 culinary

 delights

.

 The

 city

 has

 a

 rich

 history

 dating

 back

 to

 the

 Roman

 Empire

. It

 has been

 the capital

 of France



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast,

 with many

 possibilities for

 growth,

 and various

 areas where

 AI will

 impact our

 lives.

 AI

 could

 potentially take

 on

 more

 complex

 tasks

,

 become

 more

 integrated

 into

 various

 industries

,

 and

 change

 the

 way

 we

 interact

 with

 technology

.


Possible Future

 Trends

 in Artificial

 Intelligence

:


1

.

 Increased

 Adoption

 in

 Industries

:


Art

ificial

 intelligence

 will

 be

 adopted

 in

 more

 industries

,

 such

 as

 healthcare

,

 finance

,

 education

,

 and

 transportation

.

 AI

 will

 help

 improve

 efficiency

,

 accuracy

,

 and

 decision

-making

 in

 these

 sectors

.


2

.

 Adv

ancements

 in

 Robotics

:


Robot

ics

 will

 become

 more

 advanced

,

 with

 AI

-powered

 robots

 performing

 complex

 tasks

,

 such

 as

 surgery

,

 manufacturing

,

 and

 customer

 service

.

 AI

 will

 enable




In [6]:
llm.shutdown()