# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.06it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.70it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.35it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.19it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:24,  1.09s/it]

  9%|▊         | 2/23 [00:01<00:12,  1.63it/s]

 13%|█▎        | 3/23 [00:01<00:09,  2.22it/s]

 17%|█▋        | 4/23 [00:01<00:07,  2.71it/s]

 22%|██▏       | 5/23 [00:02<00:05,  3.10it/s]

 26%|██▌       | 6/23 [00:02<00:05,  3.22it/s]

 30%|███       | 7/23 [00:02<00:04,  3.41it/s]

 35%|███▍      | 8/23 [00:02<00:04,  3.57it/s]

 39%|███▉      | 9/23 [00:03<00:03,  3.75it/s]

 43%|████▎     | 10/23 [00:03<00:03,  3.91it/s]

 48%|████▊     | 11/23 [00:03<00:03,  3.99it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  3.93it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  4.03it/s]

 61%|██████    | 14/23 [00:04<00:02,  4.19it/s] 65%|██████▌   | 15/23 [00:04<00:01,  4.43it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.29it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  4.24it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  4.23it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.07it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  3.89it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  3.69it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  3.41it/s]

100%|██████████| 23/23 [00:06<00:00,  3.31it/s]100%|██████████| 23/23 [00:06<00:00,  3.39it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lexie, and I am a 3-year-old, 22-pound, adorable, and playful Chihuahua. I was born and raised in a loving home with my sister, Lily. My mom is a Chihuahua, and my dad is a mix of Chihuahua and Pomeranian. I have a big personality for a small dog. I love to play with toys, go on walks, and cuddle with my humans.
I am a quick learner and have mastered basic obedience commands, including "sit," "stay," "come," and "shake." I also know "down" and "roll
Prompt: The president of the United States is
Generated text:  not the head of the Federal Reserve. The president appoints the head of the Federal Reserve, known as the Chairman of the Federal Reserve, subject to Senate confirmation. The Chairman of the Federal Reserve serves a 14-year term, though is not automatically reappointed. The Chairman of the Federal Reserve is responsible for setting monetary policy and overseeing the operation of the Federal Reserve, the central bank of the United States

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  Kaida. I'm a 17-year-old high school student who enjoys playing the guitar and listening to music. I'm a bit of a bookworm and like to read in my free time. I'm a bit shy, but I'm working on being more outgoing. I'm a junior at Springdale High School. That's me in a nutshell.
This is a good start, but it's a bit too straightforward and lacks some personality. Let's try to add a bit more flair to it. Here's a revised version: Hi, I'm Kaida. I'm a junior at Springdale High School, and when I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country, near the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a population of over 2.1 million people and is a major center for business, education, and tourism. Paris is also known for its romantic atmosphere and is often referred to as the City of Light. The city has a rich history dating back to the 3rd century BC and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  expected to be shaped by several factors, including advancements in machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems can analyze medical images, identify patterns, and make predictions about patient outcomes.
2. Rise of Explainable AI (XAI): As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. XAI aims to provide transparency and interpretability into AI decision-making processes, which will be essential for



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Bridget Cranshaw and I'm a 25-year-old freelance writer living in a small town on the outskirts of New York City. I've been working on various writing projects for several years now, including articles, short stories, and even a novel in progress. When I'm not writing, I enjoy hiking, trying out new local breweries, and spending time with my family and friends. That's a bit about me! I'm looking forward to seeing what this online community has to offer. How do you like the self-introduction?
The self-introduction is neutral, which is good for an online community. It provides a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the north-central part of the country. It is situated on the Seine River and is known for its rich history, art museums, and architectural landmarks. The Eiffel Tower, No

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Cyrus

 Black

wood

,

 and

 I

'm

 an

 un

assuming

 twenty

-five

-year

-old

 man

 living

 in

 a

 quiet

 neighborhood

 in

 the

 city

.

 I

 work

 as

 a

 data

 analyst

 at

 a

 small

 firm

 and

 enjoy

 collecting

 vinyl

 records

 and

 playing

 chess

 on

 my

 weekends

.

 I

'm

 often

 lost

 in

 thought

,

 ponder

ing

 the

 intric

acies

 of

 life

 and

 the

 mysteries

 of

 the

 universe

.

 I

'm

 not

 particularly

 outgoing

,

 but

 I

 find

 comfort

 in

 the

 solitude

 and

 quiet

 reflections

 that

 allow

 me

 to

 recharge

.


This

 self

-int

roduction

 is

 neutral

 because

 it

 doesn

't

 reveal

 much

 about

 Cyrus

's

 personality

,

 values

,

 or

 motivations

.

 It

 presents

 him

 as

 a

 rel

atable

,

 ordinary

 person

 without

 any

 exceptional

 characteristics

 or

 traits

 that



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 one

 of

 the

 world

’s

 largest

 and

 most

 famous

 cities

,

 known

 for

 its

 art

,

 fashion

,

 cuisine

,

 and

 historic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 situated

 in

 the

 north

-central

 part

 of

 the

 country

 and

 is

 the

 center

 of

 French

 culture

,

 politics

,

 and

 economy

.


Here

 is

 a

 more

 detailed

 factual

 statement

 about

 Paris

:


Paris

,

 the

 capital

 city

 of

 France

,

 is

 a

 global

 center

 for

 art

,

 fashion

,

 gastr

onomy

,

 and

 entertainment

.

 It

 is

 situated

 in

 the

 north

-central

 part

 of

 the

 country

,

 on

 the

 Se

ine

 River

.

 With

 a

 population

 of

 over

 

2

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 a

 mix

 of

 the

 good

 and

 the

 bad

.

 While

 AI

 has

 the

 potential

 to

 revolution

ize

 numerous

 industries

 and

 improve

 lives

,

 it

 also

 poses

 significant

 risks

 and

 challenges

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:


1

.

 Increased

 adoption

 in

 various

 industries

:


AI

 is

 expected

 to

 become

 ubiquitous

 in

 various

 industries

,

 including

 healthcare

,

 finance

,

 education

,

 and

 transportation

.

 Its

 applications

 will

 range

 from

 personalized

 medicine

 and

 financial

 forecasting

 to

 intelligent

 tutoring

 systems

 and

 autonomous

 vehicles

.


2

.

 Adv

ancements

 in

 natural

 language

 processing

:


Natural

 Language

 Processing

 (

N

LP

)

 will

 continue

 to

 improve

,

 enabling

 more

 sophisticated

 human

-com

puter

 interaction

.

 This

 will

 lead

 to

 better

 chat

bots

,

 voice

 assistants




In [6]:
llm.shutdown()