# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling to prevent OOM errors for large batches. For details on this cache-aware scheduling algorithm, see our [paper](https://arxiv.org/pdf/2312.07104).

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-03 11:59:11 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.15it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.03it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:01,  1.00s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.20it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Tim

 Brown

,

 and

 I

 am

 a

 candidate

 for

 the

 U

.S

.

 Senate

 in

 

202

4

.

 I

 am

 running

 on

 a

 platform

 of

 progressive

 values

 and

 a

 commitment

 to

 representing

 the

 interests

 of

 the

 working

 class

.


As

 a

 third

-generation

 Ar

izon

an

,

 I

 understand

 the

 unique

 challenges

 and

 opportunities

 facing

 our

 state

.

 I

 believe

 that

 every

 Ar

izon

an

 deserves

 access

 to

 quality

 education

,

 affordable

 healthcare

,

 and

 a

 living

 wage

.

 I

 will

 fight

 tirelessly

 to

 ensure

 that

 our

 state

's

 resources

 are

 used

 to

 benefit

 all

 Ar

izon

ans

,

 not

 just

 the

 wealthy

 and

 well

-connected

.


I

 am

 proud

 to

 have

 been

 endorsed

 by

 a

 wide

 range

 of

 organizations

,

 including

 the

 Arizona

 AFL

-C

IO




Generated text: 

 home

 to

 many

 wonderful

 things

,

 including

,

 of

 course

,

 the

 famous

 E

iff

el

 Tower

.

 But

 there

's

 more

 to

 Paris

 than

 just

 this

 iconic

 landmark

.

 Here

 are

 some

 more

 must

-

visit

 spots

 in

 the

 City

 of

 Light

:


The

 Lou

vre

 Museum

:

 This

 world

-f

amous

 museum

 is

 a

 must

-

visit

 for

 art

 lovers

,

 with

 a

 collection

 that

 includes

 the

 Mona

 Lisa

.

 Be

 sure

 to

 explore

 the

 beautiful

 glass

 pyramid

 entrance

 and

 the

 stunning

 Tu

il

eries

 Garden

.


Not

re

-D

ame

 Cathedral

:

 This

 beautiful

 Gothic

 cathedral

 is

 one

 of

 the

 most

 famous

 landmarks

 in

 Paris

.

 Although

 it

 suffered

 a

 devastating

 fire

 in

 

201

9

,

 it

 is

 still

 worth

 visiting

 to

 see

 its




Generated text: 

 all

 about

 human

-centric

 design




Art

ificial

 Intelligence

 (

AI

)

 is

 changing

 the

 world

,

 but

 its

 impact

 will

 be

 greatest

 when

 it

 is

 designed

 to

 serve

 humanity

.

 As

 AI

 becomes

 increasingly

 ubiquitous

,

 its

 design

 and

 development

 must

 be

 grounded

 in

 a

 deep

 understanding

 of

 human

 needs

,

 values

,

 and

 behaviors

.

 In

 this

 article

,

 we

 explore

 the

 importance

 of

 human

-centric

 design

 in

 AI

 and

 its

 potential

 to

 drive

 positive

 change

.


What

 is

 human

-centric

 design

?


Human

-centric

 design

 is

 a

 design

 philosophy

 that

 priorit

izes

 the

 needs

,

 goals

,

 and

 well

-being

 of

 humans

.

 It

 involves

 understanding

 the

 complexities

 of

 human

 behavior

,

 emotions

,

 and

 cognition

,

 and

 using

 that

 knowledge

 to

 create

 products

,




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Anthony

,

 I

’m

 a

 freelance

 writer

,

 and

 I

’m

 here

 to

 help

 you

 with

 your

 writing

 needs

.

 I

 have

 extensive

 experience

 writing

 articles

,

 blog

 posts

,

 social

 media

 content

,

 and

 more

.


I

 specialize

 in

 writing

 engaging

 and

 informative

 content

 that

 meets

 the

 needs

 of

 your

 audience

.

 I

 take

 the

 time

 to

 understand

 your

 goals

 and

 target

 audience

 to

 ensure

 that

 my

 writing

 accurately

 reflects

 your

 brand

 and

 reson

ates

 with

 your

 readers

.


I

’m

 a

 quick

 learner

,

 and

 I

 have

 a

 proven

 track

 record

 of

 producing

 high

-quality

 content

 on

 a

 tight

 deadline

.

 Whether

 you

 need

 help

 with

 a

 one

-time

 project

 or

 ongoing

 content

 creation

,

 I

’m

 here

 to

 help

.


Some

 of

 the

 services

 I




Generated text: 

 a

 city

 of

 unparalleled

 beauty

,

 rich

 history

,

 and

 vibrant

 culture

.

 Paris

 is

 the

 perfect

 destination

 for

 anyone

 looking

 to

 experience

 the

 essence

 of

 France

.

 From

 the

 iconic

 E

iff

el

 Tower

 to

 the

 artistic

 treasures

 of

 the

 Lou

vre

,

 Paris

 has

 something

 for

 every

 taste

 and

 interest

.


Must

-

see

 attractions

 in

 Paris

 include

:


The

 E

iff

el

 Tower

:

 The

 most

 iconic

 landmark

 in

 Paris

,

 the

 E

iff

el

 Tower

 is

 a

 must

-

visit

 attraction

.

 You

 can

 take

 a

 lift

 to

 the

 top

 for

 breathtaking

 views

 of

 the

 city

.


The

 Lou

vre

 Museum

:

 One

 of

 the

 world

's

 largest

 and

 most

 famous

 museums

,

 the

 Lou

vre

 is

 home

 to

 an

 incredible

 collection

 of

 art




Generated text: 

 here

,

 and

 it

's

 being

 shaped

 by

 the

 energy

 sector




Art

ificial

 intelligence

 (

AI

)

 is

 transforming

 industries

,

 and

 the

 energy

 sector

 is

 no

 exception

.

 From

 optimizing

 energy

 consumption

 to

 predicting

 equipment

 failures

,

 AI

 is

 being

 used

 to

 make

 energy

 production

,

 transmission

,

 and

 consumption

 more

 efficient

,

 reliable

,

 and

 sustainable

.

 In

 this

 article

,

 we

'll

 explore

 the

 ways

 in

 which

 AI

 is

 shaping

 the

 future

 of

 the

 energy

 sector

.


Predict

ive

 maintenance

:

 AI

-powered

 sensors

 and

 algorithms

 are

 being

 used

 to

 predict

 when

 equipment

 will

 fail

,

 allowing

 for

 proactive

 maintenance

 and

 reducing

 downtime

.


Energy

 forecasting

:

 AI

 can

 analyze

 historical

 data

 and

 real

-time

 weather

 forecasts

 to

 predict

 energy

 demand

 and

 supply

,




In [6]:
llm.shutdown()