# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling to prevent OOM errors for large batches. For details on this cache-aware scheduling algorithm, see our [paper](https://arxiv.org/pdf/2312.07104).

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-04 06:37:12 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.28it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.17it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.15it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Gareth

 and

 I

 am

 a

 web

 developer

.


I

'm

 a

 self

-t

a

ught

 web

 developer

 with

 a

 passion

 for

 building

 modern

,

 fast

 and

 responsive

 web

 applications

.

 I

 have

 a

 strong

 background

 in

 front

-end

 development

,

 but

 also

 have

 experience

 with

 back

-end

 development

 and

 database

 management

.


I

'm

 proficient

 in

 a

 range

 of

 web

 development

 technologies

,

 including

 HTML

5

,

 CSS

3

,

 JavaScript

,

 Node

.js

,

 and

 MongoDB

.

 I

 have

 also

 worked

 with

 various

 frameworks

 such

 as

 React

,

 Angular

 and

 Vue

.js

.



Some

 of

 my

 key

 skills

 include

:



*

 Front

-end

 development

 with

 HTML

5

,

 CSS

3

 and

 JavaScript




*

 Back

-end

 development

 with

 Node

.js

 and

 Express




*

 Database

 management

 with




Generated text: 

 known

 for

 its

 famous

 landmarks

,

 museums

,

 and

 romantic

 atmosphere

,

 which

 is

 why

 it

's

 a

 popular

 destination

 for

 tourists

.

 However

,

 it

's

 also

 a

 city

 with

 a

 rich

 history

 and

 culture

,

 which

 is

 reflected

 in

 its

 cuisine

.

 French

 cuisine

 is

 known

 for

 its

 sophistication

 and

 elegance

,

 but

 it

's

 also

 hearty

 and

 comforting

,

 often

 featuring

 rich

 flavors

 and

 sauces

.


One

 of

 the

 most

 famous

 French

 dishes

 is

 the

 cro

que

-m

ons

ieur

,

 a

 grilled

 ham

 and

 cheese

 sandwich

 that

's

 often

 served

 with

 bé

ch

amel

 sauce

.

 Another

 classic

 dish

 is

 the

 bou

ill

ab

ais

se

,

 a

 fish

 soup

 originating

 from

 the

 port

 city

 of

 Marseille

.

 And

 of

 course

,

 no

 trip




Generated text: 

 shaped

 by

 the

 present

.

 How

 are

 we

 harness

ing

 the

 power

 of

 AI

 to

 solve

 the

 world

’s

 most

 pressing

 challenges

 today

?

 We

’re

 speaking

 with

 the

 innov

ators

,

 researchers

,

 and

 leaders

 pushing

 the

 boundaries

 of

 AI

 to

 drive

 meaningful

 impact

.


AI

 for

 Social

 Good

:

 How

 AI

 Can

 Be

 a

 Force

 for

 Change




In

 this

 episode

,

 we

 explore

 the

 intersection

 of

 AI

 and

 social

 good

,

 highlighting

 initiatives

 and

 projects

 that

 are

 harness

ing

 the

 power

 of

 AI

 to

 drive

 positive

 change

 in

 the

 world

.


AI

 for

 Social

 Good

:

 How

 AI

 Can

 Be

 a

 Force

 for

 Change




AI

 has

 the

 potential

 to

 address

 some

 of

 the

 world

's

 most

 pressing

 challenges

,

 from

 climate

 change

 to

 healthcare




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 S

ures

h

 and

 I

 am

 a

 senior

 systems

 engineer

 with

 over

 

15

 years

 of

 experience

 in

 IT

 industry

.

 I

 have

 worked

 in

 various

 domains

 such

 as

 Finance

,

 Health

Care

 and

 Manufacturing

.

 My

 expertise

 includes

 architecture

 design

,

 implementation

,

 and

 support

 of

 complex

 IT

 systems

.

 I

 am

 skilled

 in

 a

 variety

 of

 technologies

 including

 Unix

,

 Windows

,

 Oracle

,

 Java

,

 and

 Python

.


In

 my

 free

 time

,

 I

 enjoy

 reading

 books

,

 learning

 new

 programming

 languages

,

 and

 experimenting

 with

 new

 technologies

.

 I

 am

 also

 an

 avid

 photographer

 and

 love

 capturing

 the

 beauty

 of

 nature

 and

 the

 world

 around

 me

.


This

 is

 my

 personal

 blog

 where

 I

 share

 my

 thoughts

,

 ideas

,

 and

 experiences

 related

 to




Generated text: 

 famous

 for

 its

 beautiful

 parks

 and

 gardens

.

 Among

 the

 many

 beautiful

 parks

 and

 gardens

 in

 Paris

,

 one

 of

 the

 most

 famous

 is

 the

 Luxembourg

 Gardens

.

 The

 Luxembourg

 Gardens

 are

 one

 of

 the

 most

 beautiful

 parks

 in

 Paris

,

 and

 they

 offer

 a

 variety

 of

 activities

 and

 attractions

 for

 visitors

.


The

 Luxembourg

 Gardens

 were

 created

 in

 the

 

17

th

 century

 by

 Queen

 Marie

 de

 Med

ici

,

 who

 built

 the

 Pal

ais

 du

 Luxembourg

 on

 the

 site

.

 The

 gardens

 were

 designed

 to

 be

 a

 peaceful

 retreat

 from

 the

 hustle

 and

 bust

le

 of

 city

 life

,

 and

 they

 have

 been

 a

 popular

 destination

 for

 Paris

ians

 and

 visitors

 ever

 since

.


The

 Luxembourg

 Gardens

 are

 a

 beautiful

 example

 of

 French

 garden

 design

.




Generated text: 

 in

 the

 hands

 of

 the

 people




In

 the

 latest

 episode

 of

 the

 AI

 Alignment

 Podcast

,

 Stuart

 Russell

 and

 guests

 discuss

 the

 challenges

 and

 opportunities

 of

 align

ing

 AI

 with

 human

 values

.


What

 is

 the

 future

 of

 AI

,

 and

 how

 can

 we

 ensure

 that

 it

 align

s

 with

 human

 values

?


St

uart

 Russell

,

 a

 renowned

 AI

 researcher

 and

 professor

 at

 the

 University

 of

 California

,

 Berkeley

,

 has

 been

 at

 the

 forefront

 of

 the

 discussion

 on

 AI

 alignment

.

 He

 has

 written

 extensively

 on

 the

 subject

 and

 has

 even

 argued

 that

 AI

 could

 become

 a

 existential

 threat

 to

 humanity

 if

 not

 designed

 with

 human

 values

 in

 mind

.


In

 the

 latest

 episode

 of

 the

 AI

 Alignment

 Podcast

,

 Russell

 is

 joined

 by

 guests




In [6]:
llm.shutdown()