# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling to prevent OOM errors for large batches. For details on this cache-aware scheduling algorithm, see our [paper](https://arxiv.org/pdf/2312.07104).

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-05 08:17:07 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.28it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.15it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.14it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.37it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Emily

!


I

'm

 a

 

24

-year

-old

 student

 from

 London

,

 UK

.

 I

'm

 currently

 studying

 journalism

 and

 media

,

 and

 I

'm

 hoping

 to

 become

 a

 sports

 journalist

 or

 a

 TV

 presenter

 in

 the

 future

.


I

've

 always

 been

 passionate

 about

 sports

,

 especially

 football

,

 and

 I

 love

 following

 the

 latest

 news

 and

 developments

 in

 the

 world

 of

 football

.

 I

'm

 also

 a

 huge

 fan

 of

 The

 F

ader

,

 Complex

 and

 i

-D

,

 and

 I

 love

 keeping

 up

 with

 the

 latest

 music

,

 fashion

 and

 pop

 culture

 trends

.


When

 I

'm

 not

 studying

 or

 writing

 for

 my

 university

 newspaper

,

 I

 love

 spending

 time

 with

 my

 friends

 and

 family

,

 trying

 out

 new

 restaurants

 and

 cafes

 in

 London




Generated text: 

 a

 must

-

visit

 destination

 for

 any

 traveler

.

 With

 its

 rich

 history

,

 stunning

 architecture

,

 and

 vibrant

 culture

,

 Paris

 has

 something

 to

 offer

 for

 every

 kind

 of

 visitor

.

 Here

 are

 some

 tips

 and

 recommendations

 to

 help

 you

 plan

 your

 trip

 to

 Paris

 and

 make

 the

 most

 of

 your

 time

 in

 this

 beautiful

 city

.


Paris

 is

 a

 large

 city

,

 and

 getting

 around

 can

 be

 time

-consuming

.

 The

 city

 has

 an

 extensive

 public

 transportation

 system

,

 including

 the

 metro

,

 buses

,

 and

 tr

ams

.

 You

 can

 buy

 a

 Carn

et

 of

 

10

 tickets

 or

 a

 Paris

 Vis

ite

 Pass

,

 which

 allows

 you

 unlimited

 travel

 on

 public

 transportation

 for

 a

 set

 period

.


Paris

 is

 famous

 for

 its

 food

,




Generated text: 

 a

 partnership

 between

 humans

 and

 machines




by

 Frank

 Chen

,

 CEO

 at

 Clar

if

ai




Art

ificial

 intelligence

 (

AI

)

 is

 often

 seen

 as

 a

 replacement

 for

 humans

,

 but

 the

 future

 of

 AI

 is

 about

 collaboration

,

 not

 replacement

.

 The

 rise

 of

 AI

 has

 raised

 questions

 about

 its

 potential

 impact

 on

 the

 job

 market

,

 particularly

 in

 fields

 where

 machines

 can

 perform

 tasks

 with

 greater

 speed

 and

 accuracy

 than

 humans

.

 However

,

 these

 concerns

 overlook

 the

 significant

 benefits

 that

 AI

 can

 bring

 when

 used

 in

 conjunction

 with

 human

 expertise

.


The

 current

 state

 of

 AI

 is

 characterized

 by

 narrow

,

 task

-specific

 models

 that

 excel

 in

 one

 area

 but

 struggle

 with

 others

.

 These

 models

 are

 often

 trained

 on

 large

 datasets

 and




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Ben

 and

 I

'm

 excited

 to

 be

 your

 guide

 on

 this

 journey

 through

 the

 world

 of

 sci

-fi

 and

 fantasy

 literature

.

 I

've

 been

 an

 avid

 reader

 and

 fan

 of

 the

 genre

 for

 many

 years

,

 and

 I

'm

 always

 looking

 to

 discover

 new

 authors

,

 books

,

 and

 worlds

 to

 explore

.


I

'll

 be

 sharing

 my

 thoughts

 and

 recommendations

 on

 a

 wide

 range

 of

 books

,

 from

 classic

 authors

 like

 As

im

ov

 and

 Tolkien

 to

 modern

 best

s

ellers

 like

 George

 R

.R

.

 Martin

 and

 Patrick

 Roth

f

uss

.

 I

'll

 also

 be

 highlighting

 some

 of

 the

 lesser

-known

 gems

 of

 the

 genre

,

 as

 well

 as

 introducing

 you

 to

 new

 authors

 and

 series

 that

 you

 might

 not

 have

 tried

 before

.






Generated text: 

 a

 city

 steep

ed

 in

 history

 and

 culture

,

 with

 iconic

 landmarks

 and

 breathtaking

 architecture

.

 Paris

 is

 a

 place

 where

 romance

 is

 alive

 and

 well

,

 with

 picturesque

 streets

,

 charming

 cafes

 and

 an

 arts

 scene

 that

 is

 second

 to

 none

.

 Whether

 you

’re

 looking

 to

 explore

 the

 city

’s

 famous

 museums

,

 indulge

 in

 delicious

 French

 cuisine

,

 or

 simply

 soak

 up

 the

 atmosphere

,

 Paris

 has

 something

 for

 everyone

.


Must

-

see

 attractions




The

 E

iff

el

 Tower

:

 This

 iconic

 iron

 lady

 is

 one

 of

 the

 most

 recognizable

 landmarks

 in

 the

 world

,

 and

 offers

 stunning

 views

 of

 the

 city

 from

 its

 top

 level

.


The

 Lou

vre

:

 One

 of

 the

 world

’s

 greatest

 museums

,

 the

 Lou

vre

 is




Generated text: 

 being

 shaped

 by

 developments

 in

 Machine

 Learning

 (

ML

)

 and

 Deep

 Learning

 (

DL

).

 These

 technologies

 have

 been

 successful

 in

 various

 applications

,

 such

 as

 image

 and

 speech

 recognition

,

 natural

 language

 processing

,

 and

 predictive

 analytics

.

 However

,

 they

 also

 raise

 important

 concerns

 around

 bias

,

 transparency

,

 and

 accountability

.


Adv

ancements

 in

 AI

 have

 created

 new

 opportunities

 for

 applications

 in

 healthcare

,

 education

,

 and

 finance

,

 but

 also

 raise

 questions

 about

 data

 ownership

 and

 control

,

 and

 the

 potential

 for

 job

 displacement

.


This

 research

 group

 focuses

 on

 the

 development

 and

 evaluation

 of

 AI

 and

 ML

 systems

,

 as

 well

 as

 the

 ethical

 and

 societal

 implications

 of

 these

 technologies

.

 Our

 goal

 is

 to

 contribute

 to

 the

 responsible

 development

 and

 use




In [6]:
llm.shutdown()