# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling to prevent OOM errors for large batches. For details on this cache-aware scheduling algorithm, see our [paper](https://arxiv.org/pdf/2312.07104).

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-04 09:43:10 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.14it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.02it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:01,  1.00s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.20it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Bl

aine

 and

 I

 am

 a

 bird

 enthusiast

.

 I

'm

 an

 avid

 bir

der

 and

 nature

 lover

 with

 a

 passion

 for

 learning

 about

 all

 things

 related

 to

 birds

.

 When

 I

'm

 not

 working

 as

 a

 wildlife

 rehab

ber

 or

 volunteering

 at

 a

 local

 nature

 center

,

 you

 can

 find

 me

 out

 in

 the

 field

,

 bin

ocular

s

 in

 hand

,

 searching

 for

 my

 next

 bird

ing

 adventure

.


I

 am

 an

 active

 member

 of

 the

 e

Bird

 community

,

 contributing

 to

 the

 global

 database

 of

 bird

 sightings

.

 I

 also

 maintain

 a

 personal

 blog

 where

 I

 share

 my

 bird

ing

 adventures

,

 stories

,

 and

 photos

.

 I

 am

 always

 eager

 to

 connect

 with

 fellow

 bird

ers

 and

 learn

 from

 their

 experiences

.


What

 draws




Generated text: 

 the

 only

 city

 in

 the

 world

 that

 does

 not

 have

 a

 municipal

 police

 department

.

 The

 capital

 city

 of

 France

 is

 home

 to

 many

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

 Dame

 Cathedral

.

 Many

 countries

 have

 some

 form

 of

 municipal

 police

 force

,

 but

 Paris

 is

 the

 only

 major

 city

 that

 does

 not

 have

 one

.

 Paris

 is

 divided

 into

 twenty

 arr

ond

isse

ments

 (

district

s

)

 and

 has

 a

 police

 department

 that

 is

 responsible

 for

 maintaining

 order

 and

 enforcing

 the

 law

 throughout

 the

 city

.

 The

 police

 department

 is

 headed

 by

 the

 Pref

ect

 of

 Police

,

 who

 is

 appointed

 by

 the

 central

 government

.

 Despite

 not

 having

 a

 municipal

 police

 department

,




Generated text: 

 being

 driven

 by

 advancements

 in

 several

 key

 areas

,

 including

 computer

 vision

,

 natural

 language

 processing

,

 and

 deep

 learning

.

 These

 areas

 are

 experiencing

 rapid

 innovation

 and

 will

 have

 a

 significant

 impact

 on

 various

 industries

.


Computer

 vision

,

 for

 instance

,

 is

 being

 used

 to

 develop

 more

 sophisticated

 image

 and

 video

 recognition

 systems

.

 These

 systems

 have

 a

 wide

 range

 of

 applications

,

 including

 surveillance

,

 healthcare

,

 and

 autonomous

 vehicles

.


Natural

 language

 processing

 (

N

LP

)

 is

 being

 used

 to

 develop

 more

 sophisticated

 language

 understanding

 systems

.

 These

 systems

 have

 a

 wide

 range

 of

 applications

,

 including

 chat

bots

,

 virtual

 assistants

,

 and

 language

 translation

.


Deep

 learning

,

 which

 is

 a

 type

 of

 machine

 learning

 that

 uses

 artificial

 neural

 networks

,




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Janet

 and

 I

'm

 a

 first

-year

 student

 at

 Columbia

 University

.

 I

'm

 excited

 to

 be

 inter

ning

 with

 the

 New

 York

 City

 Department

 of

 Parks

 and

 Recreation

 this

 summer

.


In

 my

 free

 time

,

 I

 love

 hiking

,

 reading

,

 and

 trying

 out

 new

 restaurants

.

 I

'm

 also

 an

 avid

 soccer

 player

 and

 enjoy

 playing

 on

 the

 Columbia

 University

 women

's

 soccer

 team

.


Growing

 up

,

 I

 was

 always

 passionate

 about

 environmental

 issues

 and

 wanted

 to

 pursue

 a

 career

 in

 sustainability

.

 My

 experiences

 volunteering

 at

 a

 local

 community

 garden

 and

 inter

ning

 with

 a

 non

-profit

 organization

 focused

 on

 environmental

 conservation

 have

 deep

ened

 my

 understanding

 of

 the

 importance

 of

 sustainable

 practices

.


During

 my

 time

 at

 Columbia

,

 I

've

 been




Generated text: 

 a

 city

 that

 is

 steep

ed

 in

 history

,

 art

,

 fashion

,

 and

 culture

.

 From

 the

 iconic

 E

iff

el

 Tower

 to

 the

 stunning

 Notre

-D

ame

 Cathedral

,

 Paris

 is

 a

 city

 that

 is

 full

 of

 iconic

 landmarks

 and

 breathtaking

 beauty

.

 Here

 are

 some

 of

 the

 top

 things

 to

 see

 and

 do

 in

 Paris

:


1

.

 The

 E

iff

el

 Tower

:

 The

 E

iff

el

 Tower

 is

 a

 must

-

see

 attraction

 in

 Paris

,

 and

 one

 of

 the

 most

 iconic

 landmarks

 in

 the

 world

.

 Built

 for

 the

 

188

9

 World

’s

 Fair

,

 the

 tower

 stands

 

324

 meters

 tall

 and

 offers

 stunning

 views

 of

 the

 city

 from

 its

 observation

 decks

.


2

.

 The

 Lou

vre

 Museum

:




Generated text: 

 here

 –

 and

 it

’s

 all

 about

 user

 experience

!


AI

 is

 now

 a

 part

 of

 our

 lives

,

 and

 its

 impact

 is

 being

 felt

 across

 industries

 and

 sectors

.

 But

 what

 does

 the

 future

 of

 AI

 hold

 for

 us

,

 and

 how

 will

 it

 shape

 our

 user

 experience

?


As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 we

 can

 expect

 to

 see

 significant

 improvements

 in

 the

 way

 we

 interact

 with

 technology

.

 From

 personalized

 recommendations

 and

 intuitive

 interfaces

 to

 predictive

 maintenance

 and

 automated

 decision

-making

,

 AI

 is

 poised

 to

 revolution

ize

 the

 way

 we

 experience

 the

 world

 around

 us

.


Here

 are

 some

 exciting

 trends

 and

 developments

 that

 will

 shape

 the

 future

 of

 AI

 and

 user

 experience

:


1

.

 **

Personal

ization

**:




In [6]:
llm.shutdown()