# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling to prevent OOM errors for large batches. For details on this cache-aware scheduling algorithm, see our [paper](https://arxiv.org/pdf/2312.07104).

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

[2024-11-03 05:05:53] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, host='127.0.0.1', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=711836784, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_

[2024-11-03 05:06:08 TP0] Init torch distributed begin.


[2024-11-03 05:06:09 TP0] Load weight begin. avail mem=78.59 GB


[2024-11-03 05:06:09 TP0] lm_eval is not installed, GPTQ may not be usable


INFO 11-03 05:06:10 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.23it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.12it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.11it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.33it/s]

[2024-11-03 05:06:13 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.50 GB
[2024-11-03 05:06:14 TP0] Memory pool end. avail mem=8.37 GB
[2024-11-03 05:06:14 TP0] Capture cuda graph begin. This can take up to several minutes.


[2024-11-03 05:06:21 TP0] max_total_num_tokens=442913, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

[2024-11-03 05:06:21 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-03 05:06:21 TP0] Prefill batch. #new-seq: 3, #new-token: 17, #cached-token: 3, cache hit rate: 11.54%, token usage: 0.00, #running-req: 1, #queue-req: 0


[2024-11-03 05:06:22 TP0] Decode batch. #running-req: 4, #token: 181, token usage: 0.00, gen throughput (token/s): 195.51, #queue-req: 0


[2024-11-03 05:06:22 TP0] Decode batch. #running-req: 4, #token: 341, token usage: 0.00, gen throughput (token/s): 550.13, #queue-req: 0


[2024-11-03 05:06:22 TP0] Decode batch. #running-req: 4, #token: 501, token usage: 0.00, gen throughput (token/s): 545.61, #queue-req: 0


### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Donna

 and

 I

'm

 a

 Program

 Associate

 for

 the

 General

 Education

 Program

 at

 West

 Virginia

 University

.


[2024-11-03 05:06:22 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 5, cache hit rate: 25.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


The

 General

 Education

 Program

 at

 WV

U

 has

 a

 mission

 to

 provide

 a

 liberal

 education

 that

 fost

ers

 intellectual

 curiosity

,

 creativity

,

 critical

 thinking

,

 and

 a

 commitment

 to

 lifelong

 learning

.

 I

 work

 with

 our

 general

 education

 faculty

 to

 develop

 curriculum

,

 policies

[2024-11-03 05:06:22 TP0] Decode batch. #running-req: 1, #token: 39, token usage: 0.00, gen throughput (token/s): 190.06, #queue-req: 0


,

 and

 programming

 that

 support

 this

 mission

.


In

 my

 free

 time

,

 I

 enjoy

 hiking

,

 reading

,

 and

 trying

 out

 new

 recipes

 in

 the

 kitchen

.

 I

'm

 also

 a

 proud

 WV

U

 alum

na

,

 and

[2024-11-03 05:06:23 TP0] Decode batch. #running-req: 1, #token: 79, token usage: 0.00, gen throughput (token/s): 140.06, #queue-req: 0


 I

 love

 cheering

 on

 the

 Mount

aine

ers

 on

 game

 days

!

 I

'm

 excited

 to

 be

 a

 part

 of

 this

 community

 and

 look

 forward

 to

 connecting




[2024-11-03 05:06:23 TP0] Decode batch. #running-req: 1, #token: 119, token usage: 0.00, gen throughput (token/s): 138.81, #queue-req: 0


Generated text: 

 Paris

,

 the

 most

 visited

 city

 in

 the

 world

 and

 the

 City

 of

[2024-11-03 05:06:23 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 5, cache hit rate: 34.21%, token usage: 0.00, #running-req: 0, #queue-req: 0


 Light

,

 famous

 for

 its

 stunning

 architecture

,

 fashion

,

 art

 museums

,

 and

[2024-11-03 05:06:23 TP0] Decode batch. #running-req: 1, #token: 32, token usage: 0.00, gen throughput (token/s): 133.97, #queue-req: 0


 romantic

 atmosphere

.


Paris

 is

 a

 must

-

visit

 destination

 for

 anyone

 interested

 in

 history

,

 culture

,

 art

,

 fashion

,

 and

 romance

.

 Here

 are

 some

 of

 the

 top

 attractions

 and

 experiences

 to

 explore

:


1

.

 The

 E

iff

el

 Tower

:

 The

 iconic

 symbol

 of

 Paris

 and

 one

 of

 the

 most

 recognizable

 landmarks

 in

 the

 world

.


2

.

 The

 Lou

vre

 Museum

[2024-11-03 05:06:24 TP0] Decode batch. #running-req: 1, #token: 72, token usage: 0.00, gen throughput (token/s): 140.02, #queue-req: 0


:

 Home

 to

 the

 Mona

 Lisa

 and

 an

 impressive

 collection

 of

 art

 and

 artifacts

 from

 around

 the

 world

.


3

.

 Notre

-D

ame

 Cathedral

:

 A

 beautiful

 and

 historic

 Gothic

 church

 that

 has




[2024-11-03 05:06:24 TP0] Decode batch. #running-req: 1, #token: 112, token usage: 0.00, gen throughput (token/s): 139.03, #queue-req: 0


Generated text: 

 bright

,

 but

 we

 need

 to

[2024-11-03 05:06:24 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 5, cache hit rate: 40.91%, token usage: 0.00, #running-req: 0, #queue-req: 0


 ensure

 it

's

 also

 fair




The

 development

 and

 application

 of

 Artificial

 Intelligence

 (

AI

)

 is

 rapidly

 changing

 the

 world

[2024-11-03 05:06:24 TP0] Decode batch. #running-req: 1, #token: 25, token usage: 0.00, gen throughput (token/s): 133.43, #queue-req: 0


.

 AI

 has

 been

 used

 in

 healthcare

 to

 help

 diagnose

 diseases

 more

 accurately

 and

 quickly

,

 in

 finance

 to

 detect

 and

 prevent

 fraud

,

 and

 in

 education

 to

 personalize

 learning

 experiences

 for

 students

.

 AI

 is

 increasingly

 being

 used

 in

 various

 sectors

 to

 improve

 efficiency

,

 productivity

,

 and

 decision

-making

.

 However

,

 with

 the

 increasing

 adoption

 of

 AI

[2024-11-03 05:06:24 TP0] Decode batch. #running-req: 1, #token: 65, token usage: 0.00, gen throughput (token/s): 140.20, #queue-req: 0


,

 there

 are

 concerns

 about

 its

 potential

 impact

 on

 society

 and

 the

 need

 to

 ensure

 its

 development

 and

 application

 are

 fair

 and

 equitable

.


Bias

 in

 AI

 systems




One

 of

 the

 significant

 challenges

 in

 AI

 development

 is

 bias

.

[2024-11-03 05:06:25 TP0] Decode batch. #running-req: 1, #token: 105, token usage: 0.00, gen throughput (token/s): 139.29, #queue-req: 0


 AI




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

[2024-11-03 05:06:25 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 5, cache hit rate: 46.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-03 05:06:25 TP0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 10, cache hit rate: 53.23%, token usage: 0.00, #running-req: 1, #queue-req: 0
[2024-11-03 05:06:25 TP0] Decode batch. #running-req: 3, #token: 51, token usage: 0.00, gen throughput (token/s): 201.91, #queue-req: 0


[2024-11-03 05:06:25 TP0] Decode batch. #running-req: 3, #token: 171, token usage: 0.00, gen throughput (token/s): 414.48, #queue-req: 0


[2024-11-03 05:06:26 TP0] Decode batch. #running-req: 3, #token: 291, token usage: 0.00, gen throughput (token/s): 411.44, #queue-req: 0


### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 María

 Eug

enia

 García

,

 I

 am

 a

 Psych

ologist

,

 Coach

,

 and

 International

 Coach

 Federation

 (

IC

F

)

 certified

 coach

.

 I

 have

 

[2024-11-03 05:06:26 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 5, cache hit rate: 55.88%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-03 05:06:26 TP0] Decode batch. #running-req: 1, #token: 11, token usage: 0.00, gen throughput (token/s): 343.71, #queue-req: 0


15

 years

 of

 experience

 in

 the

 field

 of

 Psychology

 and

 

5

 years

 of

 experience

 as

 a

 coach

.

 My

 expertise

 is

 in

 personal

 growth

,

 well

-being

,

 and

 performance

.

 I

 have

 helped

 numerous

 individuals

 and

 organizations

 to

 achieve

 their

 goals

 and

 improve

 their

[2024-11-03 05:06:26 TP0] Decode batch. #running-req: 1, #token: 51, token usage: 0.00, gen throughput (token/s): 140.99, #queue-req: 0


 overall

 well

-being

.


My

 approach

 is

 holistic

,

 person

-centered

,

 and

 solution

-focused

.

 I

 use

 a

 variety

 of

 techniques

,

 including

 cognitive

-be

h

avior

al

 therapy

,

 positive

 psychology

,

 and

 mindfulness

 to

 help

 my

 clients

[2024-11-03 05:06:27 TP0] Decode batch. #running-req: 1, #token: 91, token usage: 0.00, gen throughput (token/s): 139.23, #queue-req: 0


 achieve

 their

 goals

.

 I

 believe

 that

 each

 individual

 has

 the

 potential

 to

 grow

 and




[2024-11-03 05:06:27 TP0] Decode batch. #running-req: 1, #token: 131, token usage: 0.00, gen throughput (token/s): 138.73, #queue-req: 0


Generated text: 

 home

 to

 a

 rich

 cultural

 heritage

 and

 stunning

 architecture

.

 The

 city

 has

 a

 history

 dating

 back

 thousands

 of

 years

,

 with

 various

 civilizations

 contributing

[2024-11-03 05:06:27 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 5, cache hit rate: 58.11%, token usage: 0.00, #running-req: 0, #queue-req: 0


 to

 its

 development

.

 Here

’s

 a

 brief

 overview

 of

 the

 history

 of

 Paris

,

 the

 City

 of

 Light

:


Anc

ient

 Paris

 (

300

 BC

 –

 

486

 AD

)


The

 earliest

 known

 settlement

 in

 Paris

 was

 established

 by

 the

 Cel

[2024-11-03 05:06:27 TP0] Decode batch. #running-req: 1, #token: 44, token usage: 0.00, gen throughput (token/s): 134.84, #queue-req: 0


ts

 around

 

300

 BC

.

 The

 city

 was

 later

 conquered

 by

 the

 Romans

,

 who

 renamed

 it

 L

ut

et

ia

.

 Under

 Roman

 rule

,

 Paris

 became

 an

 important

 center

 for

 trade

 and

 commerce

.


Med

ieval

[2024-11-03 05:06:27 TP0] Decode batch. #running-req: 1, #token: 84, token usage: 0.00, gen throughput (token/s): 139.96, #queue-req: 0


 Paris

 (

486

 –

 

987

 AD

)


With

 the

 fall

 of

 the

 Roman

 Empire

,

 Paris

 was

 conquered

 by

 the

 Fr




[2024-11-03 05:06:28 TP0] Decode batch. #running-req: 1, #token: 124, token usage: 0.00, gen throughput (token/s): 138.95, #queue-req: 0


Generated text: 

 not

 just

 about

 computing

 power

,

 but

 also

 about

 how

 humans

 interact

 with

 machines

.

 We

're

 witnessing

[2024-11-03 05:06:28 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 5, cache hit rate: 60.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


 a

 surge

 in

 convers

ational

 AI

,

 which

 is

 re

def

ining

 how

 we

 engage

 with

 technology

.


Con

vers

ational

 AI

,

 also

 known

 as

 chat

bots

 or

 convers

ational

 interfaces

,

 is

 a

 type

 of

 artificial

 intelligence

 that

 enables

 humans

[2024-11-03 05:06:28 TP0] Decode batch. #running-req: 1, #token: 37, token usage: 0.00, gen throughput (token/s): 134.47, #queue-req: 0


 to

 communicate

 with

 machines

 through

 natural

 language

.

 This

 technology

 has

 been

 around

 for

 decades

,

 but

 recent

 advancements

 in

 machine

 learning

 and

 natural

 language

 processing

 (

N

LP

)

 have

 made

 it

 more

 accessible

,

 affordable

,

 and

[2024-11-03 05:06:28 TP0] Decode batch. #running-req: 1, #token: 77, token usage: 0.00, gen throughput (token/s): 140.20, #queue-req: 0


 user

-friendly

.


Con

vers

ational

 AI

 has

 numerous

 applications

 across

 industries

,

 from

 customer

 service

 and

 support

 to

 healthcare

,

 finance

,

 and

 education

.

 It

 can

 help




[2024-11-03 05:06:29 TP0] Decode batch. #running-req: 1, #token: 117, token usage: 0.00, gen throughput (token/s): 139.10, #queue-req: 0


In [6]:
llm.shutdown()