# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## SPECIAL WARNING!!!!

**To launch the offline engine in your python scripts,** `__main__` **condition is necessary, since we use** `spawn` **mode to create subprocesses. Please refer to this simple example**:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/launch_engine.py

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.07it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.69it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.35it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mrs. Jones and I am a 4th grade teacher at Timber Creek Elementary. I have been teaching for over 20 years and I have to say that I still love every minute of it! I am excited to be a part of the Timber Creek community and I am looking forward to getting to know each of my students and their families.
My classroom is a fun and engaging place to learn, where students are encouraged to take risks, try new things, and learn from their mistakes. We will have a lot of fun learning this year, and I am excited to see each of my students grow and succeed.
When I am not teaching
Prompt: The president of the United States is
Generated text:  directly accountable to the American people, not to the corporate interests that seek to exploit the country's resources and undermine its sovereignty.
As part of our foreign policy, the president has the authority to impose tariffs on foreign goods to protect American workers and industries from unfair competition.

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 25-year-old freelance writer living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new coffee shops. I'm currently working on a novel and trying to get my writing career off the ground. I'm a bit of a introvert, but I love meeting new people and hearing their stories. I'm always up for a good conversation and a cup of coffee. That's me in a nutshell. How would you describe Kaida? What are some possible character traits that could be inferred from this introduction? Based on this introduction, what are some possible directions for Kaida

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris.
Provide a concise factual statement about France’s capital city.
The capital of France is Paris.  Paris is a city located in the northern part of France, along the Seine River. It is known for its iconic landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. Paris is also a major cultural and economic center, with a rich history dating back to the Middle Ages. The city is home to many famous artists, writers, and intellectuals, and is often referred to as the "City of Light."  Paris is a popular tourist destination, attracting millions of visitors each year.  The

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased Adoption in Healthcare: AI is expected to play a significant role in healthcare, particularly in areas such as medical diagnosis, personalized medicine, and patient care. AI-powered systems will help doctors diagnose diseases more accurately and provide personalized treatment plans.
2. Rise of Explainable AI: As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. Explainable AI (XAI) will become increasingly important to ensure transparency and accountability in AI decision-making.
3.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Maya Blackwood. I'm a 28-year-old librarian who works at the local library in a small town in the Pacific Northwest. I spend my free time reading, hiking, and practicing yoga. I'm a bit of a introvert, but I enjoy connecting with others and learning about their interests.
Use at least two of the following words in your introduction: quiet, book-loving, nature-obsessed, thoughtful, reserved
Here is the revised self-introduction: Hi, I'm Maya Blackwood, a 28-year-old librarian who works at the local library in a small town surrounded by lush forests and scenic trails. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is located in the northern part of the country on the Seine River. Paris is one of the world’s most populous cities with over 2.1 million residents within the city limits. However, the me

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Ethan

 Martin

,

 and

 I

'm

 a

 

22

-year

-old

 graphic

 designer

 living

 in

 the

 city

.

 I

 enjoy

 hiking

 and

 reading

 in

 my

 free

 time

.


This

 response

 is

 neutral

 because

 it

 doesn

't

 express

 a

 personal

 preference

 or

 attitude

 towards

 graphic

 design

,

 hiking

,

 or

 reading

.

 It

 simply

 states

 facts

 about

 the

 character

.


Next

,

 I

'd

 like

 to

 introduce

 the

 character

 in

 a

 more

 creative

 way

.

 Here

's

 a

 possible

 example

:


As

 I

 tr

udge

 through

 the

 city

 streets

,

 my

 sketch

book

 cl

ut

ched

 tightly

 in

 my

 hand

,

 I

 feel

 most

 at

 home

.

 That

's

 because

 I

'm

 Ethan

 Martin

,

 a

 

22

-year

-old

 graphic

 designer

 with

 a

 passion

 for

 bringing

 ideas

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 located

 in

 the

 northern

 part

 of

 the

 country

.

 The

 city

 is

 built

 along

 the

 Se

ine

 River

.

 Paris

 is

 a

 major

 economic

,

 political

,

 and

 cultural

 center

.

 The

 city

 is

 famous

 for

 its

 iconic

 landmarks

,

 art

 museums

,

 and

 fashion

 industry

.


Here

 is

 a

 summary

 of

 the

 main

 points

 about

 France

’s

 capital

 city

:

 Paris

 is

 the

 capital

 of

 France

,

 located

 in

 the

 north

 of

 the

 country

,

 built

 along

 the

 Se

ine

 River

,

 and

 is

 a

 major

 economic

,

 political

,

 and

 cultural

 center

.

 The

 city

 is

 famous

 for

 its

 iconic

 landmarks

,

 art

 museums

,

 and

 fashion

 industry

.


The

 main

 arguments

 for

 Paris

 being

 the

 capital

 of

 France

 are

:


1



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 very

 dynamic

 and

 is

 expected

 to

 evolve

 significantly

 over

 the

 coming

 years

.

 As

 AI

 continues

 to

 advance

,

 we

 can

 expect

 to

 see

 many

 new

 trends

 and

 innovations

 that

 will

 shape

 the

 future

 of

 AI

.


Possible

 Future

 Trends

 in

 Artificial

 Intelligence

:


1

.

 Increased

 Adoption

 of

 Edge

 AI

:


Edge

 AI

 is

 a

 technique

 of

 processing

 AI

 models

 in

 real

-time

,

 directly

 on

 devices

 and

 sensors

.

 This

 approach

 is

 becoming

 increasingly

 popular

 due

 to

 its

 ability

 to

 provide

 faster

 and

 more

 efficient

 AI

 processing

,

 reducing

 latency

 and

 improving

 performance

.

 We

 can

 expect

 to

 see

 more

 widespread

 adoption

 of

 edge

 AI

 in

 various

 industries

,

 such

 as

 healthcare

,

 finance

,

 and

 manufacturing

.


2

.

 Growing

 Use

 of

 Explain

able

 AI




In [6]:
llm.shutdown()