# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.77it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.78it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.35it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.36it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:22,  1.02s/it]

  9%|▊         | 2/23 [00:01<00:11,  1.86it/s] 13%|█▎        | 3/23 [00:01<00:07,  2.64it/s]

 17%|█▋        | 4/23 [00:01<00:05,  3.29it/s] 22%|██▏       | 5/23 [00:01<00:04,  3.79it/s]

 26%|██▌       | 6/23 [00:02<00:04,  4.06it/s] 30%|███       | 7/23 [00:02<00:03,  4.34it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.60it/s] 39%|███▉      | 9/23 [00:02<00:02,  4.80it/s]

 43%|████▎     | 10/23 [00:02<00:02,  4.90it/s] 48%|████▊     | 11/23 [00:02<00:02,  5.00it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  5.00it/s] 57%|█████▋    | 13/23 [00:03<00:01,  5.05it/s]

 61%|██████    | 14/23 [00:03<00:01,  5.11it/s] 65%|██████▌   | 15/23 [00:03<00:01,  5.15it/s]

 70%|██████▉   | 16/23 [00:03<00:01,  5.19it/s] 74%|███████▍  | 17/23 [00:04<00:01,  5.22it/s]

 78%|███████▊  | 18/23 [00:04<00:00,  5.20it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  5.13it/s] 87%|████████▋ | 20/23 [00:04<00:00,  5.16it/s]

 91%|█████████▏| 21/23 [00:04<00:00,  5.17it/s] 96%|█████████▌| 22/23 [00:05<00:00,  5.17it/s]

100%|██████████| 23/23 [00:05<00:00,  5.15it/s]100%|██████████| 23/23 [00:05<00:00,  4.35it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Emma and I am a teacher. I'm here today to talk about the importance of promoting literacy in our schools and communities.
Reading is a fundamental skill that is essential for academic success and personal growth. It's not just about reading books, but also about developing critical thinking, problem-solving, and communication skills.
As a teacher, I've seen firsthand the impact that literacy can have on a child's life. A child who struggles with reading may feel frustrated and disengaged, while a child who is a confident reader may feel empowered and motivated.
That's why it's so important for us to promote literacy in our schools and communities.
Prompt: The president of the United States is
Generated text:  a key figure in the country's political landscape, with both significant powers and great responsibility. The president serves as the head of state and the head of government, and is both the commander-in-chief of the military and the ch

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Rach

ael

 and

 I

 am

 the

 current

 editor

-in

-chief

 of

 the

 C

JP

 student

 newspaper

.

 My

 goal

 is

 to

 ensure

 that

 the

 students

 at

 H

udd

ers

field

 Royal

 Inf

ir

mary

 have

 a

 voice

 and

 are

 represented

 fairly

 through

 the

 newspaper

.

 The

 C

JP

 is

 a

 student

-led

 organization

 that

 aims

 to

 provide

 a

 platform

 for

 students

 to

 express

 their

 views

,

 experiences

,

 and

 concerns

.

 We

 strive

 to

 create

 engaging

 content

,

 covering

 a

 wide

 range

 of

 topics

 relevant

 to

 the

 student

 body

,

 including

 healthcare

,

 education

,

 and

 lifestyle

.


I

 am

 a

 nursing

 student

 and

 have

 been

 involved

 with

 the

 C

JP

 for

 a

 few

 years

 now

,

 starting

 out

 as

 a

 contributor

 and

 then

 working

 my

 way

 up



Prompt: The capital of France is
Generated text: 

 a

 city

 of

 eternal

 beauty

 and

 a

 city

 of

 vibrant

 culture

.

 It

 is

 a

 city

 that

 has

 been

 at

 the

 forefront

 of

 history

 and

 art

 for

 centuries

.

 From

 the

 E

iff

el

 Tower

 to

 the

 Lou

vre

 Museum

,

 the

 City

 of

 Light

 has

 something

 to

 offer

 to

 everyone

.


Paris

 is

 a

 city

 of

 many

 m

oods

.

 It

 can

 be

 dark

 and

 mysterious

,

 like

 the

 Se

ine

 River

,

 or

 bright

 and

 cheerful

,

 like

 the

 E

iff

el

 Tower

.

 It

 can

 be

 elegant

 and

 refined

,

 like

 the

 Palace

 of

 Vers

ailles

,

 or

 bo

hem

ian

 and

 artistic

,

 like

 Mont

mart

re

.


The

 city

 has

 a

 wide

 range

 of

 attractions

 and

 activities

 that

 cater

 to

 all

 interests

 and



Prompt: The future of AI is
Generated text: 

 bright

,

 but

 it

's

 not

 all

 about

 the

 technology




The

 future

 of

 AI

 is

 bright

,

 but

 it

's

 not

 all

 about

 the

 technology




Art

ificial

 intelligence

 (

AI

)

 is

 no

 longer

 a

 futuristic

 concept

.

 It

’s

 here

 and

 now

,

 transforming

 various

 industries

 and

 aspects

 of

 our

 lives

.

 While

 the

 technology

 itself

 is

 indeed

 exciting

,

 its

 future

 success

 largely

 depends

 on

 factors

 beyond

 the

 code

 and

 algorithms

.


Here

 are

 some

 non

-

technical

 aspects

 that

 will

 shape

 the

 future

 of

 AI

:


1

.

 Data

 governance

 and

 regulation

:

 AI

 relies

 heavily

 on

 data

,

 which

 raises

 concerns

 about

 data

 privacy

,

 security

,

 and

 ownership

.

 As

 AI

 becomes

 more

 pervasive

,

 governments

 and

 organizations

 will

 need

 to




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Laura and I am a 25-year-old graduate student in pursuit of a master’s degree in marine science. I am originally from Florida, where I grew up surrounded by the ocean and developed a deep passion for the marine ecosystem. During my undergraduate studies, I became fascinated with the study of marine ecology, particularly in the realm of coral reefs and their associated biodiversity.
My current research focuses on the effects of climate change on coral reefs, specifically in the Caribbean. I am investigating how increased sea surface temperature and ocean acidification impact coral growth and recruitment. My ultimate goal is to contribute to the development of effective conservation strategies for coral reefs, which

Prompt: The capital of France is
Generated text:  one of the most popular tourist destinations in the world, and for good reason. The City of Light, as it is known, is famous for its stunning architecture, rich history, and vibrant

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Sean

 (

aka

 Lux

)

 and

 I

 am

 a

 

23

 year

 old

 american

 living

 in

 Tokyo

.

 I

've

 been

 living

 here

 for

 over

 

5

 years

 now

 and

 I

 have

 to

 say

,

 it

's

 one

 of

 the

 most

 amazing

 places

 I

've

 ever

 been

 to

.

 The

 food

,

 the

 culture

,

 the

 people

,

 the

 scenery

...

 everything

 is

 just

 so

 unique

 and

 fascinating

.


When

 I

 first

 arrived

,

 I

 was

 a

 bit

 overwhelmed

 by

 the

 language

 barrier

 and

 the

 sheer

 amount

 of

 people

,

 but

 I

 quickly

 adapted

 and

 began

 to

 explore

 the

 city

 on

 my

 own

.

 I

 fell

 in

 love

 with

 the

 city

's

 vibe

 and

 the

 food

,

 especially

 the

 ramen

 and

 u

don

 noodles

.

 I

 also

 became



Prompt: The capital of France is
Generated text: 

 a

 must

-

visit

 destination

 for

 anyone

 who

 loves

 history

,

 culture

,

 art

,

 fashion

,

 and

 food

.

 Here

 are

 the

 top

 

10

 things

 to

 do

 in

 Paris

:


1

.

 Visit

 the

 E

iff

el

 Tower

:

 The

 iconic

 iron

 lady

 of

 Paris

,

 the

 E

iff

el

 Tower

 is

 a

 must

-

visit

 attraction

.

 You

 can

 take

 the

 stairs

 or

 elevator

 to

 the

 top

 for

 breathtaking

 views

 of

 the

 city

.


2

.

 Explore

 the

 Lou

vre

 Museum

:

 The

 world

-f

amous

 museum

 is

 home

 to

 an

 impressive

 collection

 of

 art

 and

 artifacts

,

 including

 the

 Mona

 Lisa

.

 Don

't

 miss

 the

 stunning

 glass

 pyramid

 entrance

 designed

 by

 I

.M

.

 Pe

i

.


3

.

 Walk

 along

 the

 Se



Prompt: The future of AI is
Generated text: 

 increasingly

 being

 shaped

 by

 edge

 AI

,

 which

 brings

 AI

 computing

 closer

 to

 the

 end

-user

 devices

.

 The

 result

 is

 reduced

 latency

,

 lower

 power

 consumption

,

 and

 increased

 efficiency

.

 Here

’s

 an

 overview

 of

 edge

 AI

,

 its

 applications

,

 and

 the

 key

 benefits

.


What

 is

 Edge

 AI

?


Edge

 AI

 refers

 to

 the

 deployment

 of

 AI

 algorithms

 and

 models

 on

 edge

 devices

 such

 as

 smartphones

,

 smart

 home

 devices

,

 IoT

 sensors

,

 and

 industrial

 equipment

.

 These

 devices

 are

 typically

 connected

 to

 the

 cloud

 or

 a

 local

 network

,

 but

 the

 AI

 processing

 occurs

 on

 the

 device

 itself

,

 rather

 than

 in

 the

 cloud

.


The

 main

 goal

 of

 edge

 AI

 is

 to

 reduce

 the

 latency

 and

 bandwidth

 requirements

 associated

 with

 sending




In [6]:
llm.shutdown()