# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling to prevent OOM errors for large batches. For details on this cache-aware scheduling algorithm, see our [paper](https://arxiv.org/pdf/2312.07104).

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-03 22:22:04 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.32it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.16it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.15it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.39it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Sab

ine

 St

achel

,

 and

 I

 am

 a

 young

 and

 ambitious

 professional

 in

 the

 field

 of

 financial

 services

.

 I

 am

 proud

 to

 have

 been

 chosen

 to

 be

 a

 Global

 Sh

aper

 at

 the

 World

 Economic

 Forum

 in

 Dav

os

,

 Switzerland

.


As

 a

 Global

 Sh

aper

,

 I

 am

 excited

 to

 be

 part

 of

 a

 community

 that

 is

 dedicated

 to

 making

 a

 positive

 impact

 on

 the

 world

.

 The

 Global

 Sh

apers

 Community

 is

 a

 network

 of

 young

 leaders

 who

 are

 committed

 to

 driving

 change

 and

 creating

 a

 better

 future

 for

 all

.


My

 passion

 for

 financial

 services

 and

 my

 desire

 to

 make

 a

 positive

 impact

 on

 the

 world

 led

 me

 to

 join

 the

 Global

 Sh

apers

 Community

.

 I

 am

 excited

 to

 be

 part




Generated text: 

 Paris

.

 It

 is

 a

 city

 of

 great

 beauty

,

 architecture

,

 and

 history

.

 Paris

 is

 known

 for

 its

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

 Dame

 Cathedral

.

 The

 city

 is

 also

 famous

 for

 its

 fashion

,

 cuisine

,

 and

 romantic

 atmosphere

.


Paris

 is

 the

 capital

 of

 France

,

 located

 in

 the

 northern

 part

 of

 the

 country

.

 It

 is

 situated

 along

 the

 Se

ine

 River

 and

 has

 a

 population

 of

 over

 

2

.

1

 million

 people

.

 The

 city

 is

 known

 for

 its

 rich

 history

,

 cultural

 attractions

,

 and

 iconic

 landmarks

.

 Paris

 is

 a

 popular

 tourist

 destination

 and

 a

 hub

 for

 fashion

,

 art

,

 and

 cuisine

.


Paris

 has

 a




Generated text: 

 filled

 with

 promise

 and

 potential

,

 but

 it

 also

 raises

 a

 plethora

 of

 ethical

 concerns

.

 In

 this

 article

,

 we

'll

 explore

 some

 of

 the

 key

 challenges

 and

 considerations

 that

 need

 to

 be

 addressed

 in

 the

 development

 and

 deployment

 of

 AI

 systems

.


The

 rise

 of

 AI

 has

 led

 to

 significant

 advances

 in

 various

 industries

,

 including

 healthcare

,

 finance

,

 and

 transportation

.

 However

,

 the

 increasing

 use

 of

 AI

 also

 raises

 concerns

 about

 accountability

,

 bias

,

 and

 job

 displacement

.

 As

 AI

 systems

 become

 more

 autonomous

 and

 complex

,

 it

's

 essential

 to

 consider

 the

 ethical

 implications

 of

 their

 development

 and

 deployment

.


Some

 of

 the

 key

 ethical

 challenges

 associated

 with

 AI

 include

:


1

.

 Accountability

 and

 Responsibility

:

 As

 AI

 systems

 become




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Kristen

 and

 I

 am

 so

 glad

 you

 found

 me

!

 I

'm

 a

 photographer

 based

 in

 Denver

,

 Colorado

 and

 I

 specialize

 in

 capturing

 life

's

 precious

 moments

 through

 my

 lens

.

 I

 am

 honored

 to

 be

 a

 part

 of

 your

 special

 day

 and

 work

 with

 you

 to

 create

 memories

 that

 will

 last

 a

 lifetime

.


My

 style

 is

 natural

,

 timeless

,

 and

 authentic

.

 I

 believe

 that

 your

 wedding

 day

 is

 a

 once

-in

-a

-l

ifetime

 celebration

 of

 love

 and

 commitment

,

 and

 I

 am

 here

 to

 help

 you

 capture

 its

 beauty

.

 I

 strive

 to

 make

 you

 feel

 comfortable

 and

 relaxed

 in

 front

 of

 my

 camera

,

 allowing

 you

 to

 be

 your

 true

 selves

 and

 shine

.


From

 the

 first

 glance

,

 to

 the




Generated text: 

 a

 city

 like

 no

 other

.

 Ste

ep

ed

 in

 history

,

 culture

 and

 romance

,

 Paris

 is

 the

 epit

ome

 of

 elegance

 and

 sophistication

.

 From

 the

 iconic

 E

iff

el

 Tower

 to

 the

 world

-ren

owned

 Lou

vre

 Museum

,

 there

's

 no

 shortage

 of

 iconic

 landmarks

 to

 explore

.

 But

 beyond

 the

 tourist

 traps

,

 Paris

 has

 a

 whole

 lot

 more

 to

 offer

.


Start

 your

 day

 with

 a

 leisure

ly

 stroll

 along

 the

 Se

ine

 River

,

 taking

 in

 the

 city

's

 picturesque

 bridges

 and

 picturesque

 streets

.

 Visit

 the

 beautiful

 Notre

 Dame

 Cathedral

,

 a

 stunning

 example

 of

 Gothic

 architecture

 that

's

 over

 

850

 years

 old

.

 Or

 head

 to

 the

 Mus

ée

 d

'

Or

say

,

 home

 to

 an

 impressive




Generated text: 

 shaped

 by

 innovation

,

 collaboration

,

 and

 societal

 need




Adv

ances

 in

 Artificial

 Intelligence

 (

AI

)

 are

 rapidly

 transforming

 industries

 and

 resh

aping

 society

.

 From

 automated

 healthcare

 and

 personalized

 education

 to

 intelligent

 transportation

 and

 secure

 data

 management

,

 AI

's

 potential

 impact

 is

 vast

 and

 multif

ac

eted

.

 However

,

 its

 development

 is

 not

 solely

 the

 domain

 of

 tech

 giants

;

 instead

,

 it

 involves

 a

 global

,

 multi

-disc

iplinary

 community

 of

 researchers

,

 practitioners

,

 policymakers

,

 and

 stakeholders

.


While

 there

 are

 many

 areas

 where

 AI

 has

 improved

 significantly

,

 there

 are

 also

 challenges

 to

 be

 addressed

.

 Here

 are

 some

 of

 the

 key

 themes

 that

 will

 shape

 the

 future

 of

 AI

:


Coll

abor

ation

 and

 Knowledge

 Sharing




The




In [6]:
llm.shutdown()