# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.31it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.24it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.24it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.70it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.50it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Laila and I am a 22-year-old university student. I am a non-practicing Muslim, but I still maintain a deep respect and love for my religion and culture. I have been feeling increasingly disconnected from my faith lately and I'm hoping to find a community and support system that can help me get back on track.
I've been attending a local mosque, but I've found that the culture and atmosphere there is quite conservative, and I often feel like an outsider. I'm not sure if I'm ready to commit to a specific branch of Islam or follow all of the traditional practices, but I do want to explore
Prompt: The president of the United States is
Generated text:  the leader of the federal government and the head of state of the United States. The president is elected through the Electoral College and serves a four-year term. The president has various powers and responsibilities, including commanding the armed forces, conducting foreign policy, and making appoi

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student who enjoys reading and playing video games in my free time. I'm a bit of a introvert, but I'm working on being more outgoing. I'm a junior, so I'm trying to balance schoolwork and extracurricular activities. I'm not really sure what I want to do with my life yet, but I'm exploring different options. That's me in a nutshell. What do you think? Is this a good self-introduction for a character?
This is a good start, but it's a bit too straightforward and lacks some depth. Consider adding

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris. 
This statement is a concise factual statement about France’s capital city. It provides a clear and direct answer to the question, without any additional information or context. It is a simple and straightforward statement that can be used as a fact or a piece of trivia. 
Note: This response is a direct answer to the question and does not require any additional information or analysis. It is a simple and concise statement that provides a clear and direct answer.  The tone is neutral and objective, providing a factual statement without any emotional or persuasive language.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even larger role in healthcare, with applications such as:
a. Predictive analytics: AI will be used to analyze large amounts of medical data to predict patient outcomes and identify high-risk patients.
b. Personalized medicine: AI will be used to develop personalized treatment plans based on a patient's genetic profile, medical history, and lifestyle.
c. Robotics



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Juliana Patel. I'm a 25-year-old communications specialist with a background in marketing and public relations. I work for a small creative agency in Chicago, helping businesses develop their brand identities and online presence. Outside of work, I enjoy practicing yoga, reading contemporary fiction, and exploring local art galleries. I'm an organized and driven individual who values creativity and community. That's me in a nutshell!  
What can be improved in this self-introduction?
A. The self-introduction is too short and lacks details about the character's background.
B. The self-introduction is too long and includes unnecessary personal details.
C. The

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
This response is a simple and factual statement about the capital of France. There is no need for ad

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ak

ira

 K

ats

ur

agi

.

 I

'm

 a

 

17

-year

-old

 high

 school

 student

 from

 Tokyo

,

 Japan

.

 I

've

 lived

 in

 this

 city

 all

 my

 life

,

 but

 I

 don

't

 really

 know

 it

 that

 well

.

 My

 family

 owns

 a

 small,

 family

-run

 shop

 in

 a

 quiet

 neighborhood

,

 where

 I

 spend

 most

 of

 my

 free

 time

.

 When

 I

'm

 not

 studying

 or

 helping

 out

 at

 the

 shop

,

 I

 enjoy

 reading

,

 hiking

,

 and

 listening

 to

 music

.

 I

'm

 a

 bit

 of

 a

 intro

vert

,

 but

 I

'm

 always

 up

 for

 a

 quiet

 conversation

.


Name

:

 Ak

ira

 K

ats

ur

agi




Age

:

 

17




Location

:

 Tokyo

,

 Japan




Family



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Provide

 two

 to

 three

 sentences

 explaining

 the

 importance

 of

 the

 city

 in

 the

 context

 of

 the

 country

’s

 history

 and

 culture

.

 Paris

,

 the

 capital

 of

 France

,

 has

 been

 an

 important

 center

 of

 politics

,

 culture

,

 and

 education

 for

 centuries

.

 It

 has

 been

 home

 to

 many

 influential

 artists

,

 writers

,

 and

 thinkers

 throughout

 history

,

 and

 has

 been

 the

 site

 of

 numerous

 significant

 historical

 events

,

 including

 the

 French

 Revolution

.

 The

 city

's

 rich

 cultural

 heritage

 has

 had

 a

 lasting

 impact

 on

 the

 country

's

 identity

 and

 continues

 to

 shape

 the

 nation

's

 artistic

,

 literary

,

 and

 intellectual

 pursuits

.


The

 following

 text

 about

 France

's

 capital

 city

 is

 accurate

 but

 lacks

 sufficient

 supporting

 evidence

 and

 detail

 to

 provide



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 constantly

 evolving

,

 with

 new

 technologies

 and

 applications

 emerging

 all

 the

 time

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

 that

 are

 worth

 considering

:


   

 

1

.

 Increased

 adoption

 of

 AI

 in

 industries

:

 AI

 is

 expected

 to

 become

 more

 prevalent

 in

 industries

 such

 as

 healthcare

,

 finance

,

 and

 transportation

,

 leading

 to

 increased

 efficiency

 and

 productivity

.


   

 

2

.

 Adv

ancements

 in

 natural

 language

 processing

 (

N

LP

):

 N

LP

 is

 a

 key

 area

 of

 AI

 research

,

 and

 it

 is

 expected

 to

 improve

 significantly

 in

 the

 coming

 years

,

 enabling

 more

 sophisticated

 human

-com

puter

 interaction

.


   

 

3

.

 Rise

 of

 edge

 AI

:

 Edge

 AI

 refers

 to

 the

 processing

 of

 AI

 models

 at

 the

 edge




In [6]:
llm.shutdown()