# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.41it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.37it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.37it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.88it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.66it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rachel. I am a third year student at the University of Limerick, studying Business and Law. I am currently completing an internship with the Intellectual Property Office of Ireland, working on various projects including the preparation of evidence for infringement actions, the analysis of trademark applications and the research of IP case law. My interests include the music industry and the role that copyright law plays within it. I am also passionate about social justice and the protection of human rights, which is evident in the work of non-governmental organisations such as Amnesty International. I am excited to be a part of the Irish IP law blog and to contribute to the discussion on intellectual
Prompt: The president of the United States is
Generated text:  expected to meet with his Ukrainian counterpart to discuss the ongoing conflict between Ukraine and Russia, but it remains unclear if Joe Biden will make a major policy announcement or

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Elianore Quasar. I'm a 25-year-old astrophysicist who works at the Galactic Research Institute. I'm currently studying the properties of black holes and their potential applications in interstellar travel. I'm a bit of a introvert, but I enjoy collaborating with my colleagues and learning from their diverse perspectives. When I'm not working, you can find me reading about the history of space exploration or practicing my piano skills. I'm looking forward to meeting new people and sharing my knowledge with others.
This is a good start, but there are a few things you can do to make it more engaging and effective:


Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city. The capital of France is Paris. This statement is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or context. The statement is also accurate, as Paris is indeed the capital of France. This type of statement is useful for providing a quick and easy-to-understand answer to a question, and can be used in a variety of contexts, such as in a trivia game or as a reference in a research paper. The statement is also neutral and objective, without any emotional or biased language. Overall,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems can analyze large amounts of medical data, identify patterns, and make predictions, leading to more accurate diagnoses and personalized treatment plans.
2. Advancements in natural language processing: Natural language processing (NLP) is a subset of AI that enables computers to understand and generate human language. Future advancements in NLP could lead to more sophisticated chat



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Elara. I'm a 22-year-old woman living in the city of Elyria, working as a librarian at the city's central library. I'm interested in reading and learning about history, especially the ancient civilizations of the world. I enjoy quiet evenings spent reading or listening to music, and I'm an avid collector of rare books and manuscripts. That's a bit about me. What do you want to know next? ( neutral, just facts) Feel free to ask follow-up questions! 
Ask Elara a follow-up question: What do you like about the ancient civilizations of the world, and is there a particular one

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Note: This answer is 10 words long. This response is factual, concise, and accurate. It directly answers the question without any unnecessary elaboration. It meets the required length and

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Akira

 Nak

amura

.

 I

'm

 a

 

25

-year

-old

 freelance

 writer

 living

 in

 Tokyo

.

 I

 enjoy

 writing

 short

 stories

 and

 poetry

,

 and

 I

'm

 currently

 working

 on

 a

 novel

.

 When

 I

'm

 not

 writing

,

 you

 can

 find

 me

 exploring

 the

 city

,

 reading

 at

 a

 local

 cafe

,

 or

 practicing

 yoga

.

 That

's

 me

 in

 a

 nutshell

.


Is

 the

 text

 a

 neutral

 self

-int

roduction

?


Yes

,

 the

 text

 is

 a

 neutral

 self

-int

roduction

.


Can

 you

 identify

 the

 character

's

 personality

 traits

 from

 the

 text

?


The

 character

 is

 likely

 to

 be

 intro

verted

,

 creative

,

 and

 possibly

 reserved

.

 They

 value

 their

 alone

 time

 and

 enjoy

 engaging

 in

 solo

 activities

 like

 reading

 and

 writing



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 located

 in

 the

 north

-central

 part

 of

 the

 country

.

 Paris

 is

 situated

 on

 the

 Se

ine

 River

 and

 is

 home

 to

 numerous

 iconic

 landmarks

 and

 historical

 monuments

.


Paris

,

 the

 capital

 city

 of

 France

,

 is

 known

 for

 its

 rich

 history

,

 cultural

 significance

,

 and

 picturesque

 landscapes

.

 Some

 of

 its

 famous

 landmarks

 include

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Arc

 de

 Tri

omp

he

.

 The

 city

 is

 also

 famous

 for

 its

 art

 museums

,

 fashion

,

 and

 cuisine

.


Paris

 is

 a

 popular

 tourist

 destination

,

 attracting

 millions

 of

 visitors

 each

 year

.

 The

 city

 is

 known

 for

 its

 romantic

 atmosphere

,

 beautiful

 parks

,

 and

 charming

 neighborhoods

.

 Visitors

 can

 enjoy



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 and

 the

 possibilities

 are

 endless

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

:


1

.

 Increased

 use

 of

 AI

 in

 various

 industries

:

 As

 AI

 becomes

 more

 pervasive

,

 it

's

 likely

 to

 be

 used

 in

 a

 wider

 range

 of

 industries

,

 such

 as

 healthcare

,

 finance

,

 education

,

 and

 transportation

.

 AI

 can

 help

 automate

 tasks

,

 improve

 decision

-making

,

 and

 enhance

 customer

 experiences

.


2

.

 Adv

ancements

 in

 natural

 language

 processing

 (

N

LP

):

 N

LP

 is

 a

 subset

 of

 AI

 that

 enables

 computers

 to

 understand

 and

 generate

 human

 language

.

 Future

 advancements

 in

 N

LP

 could

 lead

 to

 more

 sophisticated

 chat

bots

,

 virtual

 assistants

,

 and

 language

 translation

 tools

.


3

.

 Rise

 of

 explain

able




In [6]:
llm.shutdown()