# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-08 02:25:09 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:05,  1.71s/it]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:03<00:03,  1.52s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:04<00:01,  1.46s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.11s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.26s/it]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 T

anya

 and

 I

 am

 a

 friend

 of

 Jane

’s

 and

 I

 have

 been

 asked

 to

 help

 her

 get

 this

 website

 up

 and

 running

.

 I

 am

 not

 a

 very

 technical

 person

,

 so

 please

 bear

 with

 me

 as

 I

 learn

 and

 figure

 things

 out

.


Jane

 has

 asked

 me

 to

 start

 by

 adding

 some

 new

 content

 to

 the

 website

,

 including

 a

 photo

 gallery

 and

 a

 list

 of

 services

 she

 offers

.

 I

 will

 try

 to

 get

 this

 done

 as

 soon

 as

 possible

.


I

 would

 also

 like

 to

 add

 a

 contact

 form

 to

 the

 website

,

 so

 that

 people

 can

 get

 in

 touch

 with

 Jane

 easily

.

 I

 have

 heard

 that

 this

 can

 be

 a

 bit

 tricky

 to

 set

 up

,

 but

 I

 am

 willing

 to




Generated text: 

 in

 a

 state

 of

 dis

array

,

 and

 no

 one

 knows

 how

 to

 fix

 it

.

 In

 the

 midst

 of

 the

 crisis

,

 the

 city

’s

 residents

 are

 beginning

 to

 band

 together

 to

 try

 and

 find

 a

 solution

.


After

 the

 French

 government

 officially

 announced

 the

 city

 would

 be

 un

-g

over

ning

 itself

,

 residents

 of

 Paris

 took

 it

 upon

 themselves

 to

 establish

 a

 new

 way

 of

 managing

 the

 city

.


The

 residents

 are

 putting

 their

 differences

 aside

 to

 find

 common

 ground

 and

 come

 up

 with

 a

 plan

 to

 keep

 the

 city

 running

 smoothly

.


With

 the

 government

 absent

,

 residents

 are

 stepping

 up

 to

 take

 on

 roles

 like

 trash

 collection

,

 public

 safety

 and

 transportation

.


A

 group

 of

 residents

,

 calling

 themselves

 the

 "

Paris




Generated text: 

 in

 the

 cloud




The

 future

 of

 AI

 is

 in

 the

 cloud




We

 are

 living

 in

 an

 era

 of

 unprecedented

 technological

 advancements

,

 with

 AI

 being

 at

 the

 forefront

 of

 innovation

.

 The

 development

 and

 deployment

 of

 AI

 are

 rapidly

 evolving

,

 and

 one

 of

 the

 most

 significant

 trends

 shaping

 the

 industry

 is

 the

 shift

 to

 cloud

-based

 AI

.


Cloud

-based

 AI

 offers

 numerous

 benefits

 over

 traditional

 on

-pre

m

ises

 AI

 solutions

.

 These

 include

 scalability

,

 flexibility

,

 cost

-effect

iveness

,

 and

 the

 ability

 to

 access

 advanced

 AI

 capabilities

 on

-demand

.


Here

 are

 some

 key

 factors

 driving

 the

 adoption

 of

 cloud

-based

 AI

:


1

.

 Scal

ability

:

 Cloud

-based

 AI

 allows

 organizations

 to

 scale

 their

 AI

 capabilities

 up

 or

 down




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Suzanne

,

 and

 I

'm

 a

 body

work

 therapist

.

 I

 help

 people

 relax

,

 feel

 better

,

 and

 get

 back

 on

 track

 with

 their

 lives

.


I

've

 been

 a

 licensed

 massage

 therapist

 since

 

200

2

,

 and

 I

've

 worked

 with

 all

 sorts

 of

 clients

,

 from

 athletes

 to

 seniors

,

 from

 expect

ant

 mothers

 to

 new

 parents

.

 I

've

 also

 worked

 in

 hospitals

,

 clinics

,

 and

 private

 practice

 settings

.


I

 believe

 that

 the

 body

 has

 a

 natural

 ability

 to

 heal

 itself

,

 and

 that

 sometimes

 all

 it

 needs

 is

 a

 little

 help

 getting

 back

 on

 track

.

 That

's

 where

 I

 come

 in

.


I

 offer

 a

 range

 of

 body

work

 modal

ities

,

 including

 Swedish

 massage

,

 deep

 tissue

 massage




Generated text: 

 the

 City

 of

 Light

,

 and

 a

 popular

 destination

 for

 both

 locals

 and

 international

 tourists

.

 The

 city

 is

 famous

 for

 its

 art

,

 history

,

 fashion

,

 and

 architecture

.

 Let

’s

 explore

 some

 of

 the

 top

 attractions

 in

 Paris

!


The

 E

iff

el

 Tower

 is

 a

 must

-

visit

 in

 Paris

.

 The

 iconic

 iron

 tower

 is

 a

 symbol

 of

 the

 city

 and

 offers

 stunning

 views

 of

 the

 city

 from

 its

 top

 level

.

 You

 can

 take

 the

 elevator

 to

 the

 top

 or

 climb

 the

 stairs

 for

 a

 more

 adventurous

 experience

.


The

 Lou

vre

 Museum

 is

 one

 of

 the

 world

’s

 largest

 and

 most

 famous

 museums

.

 It

 is

 home

 to

 an

 impressive

 collection

 of

 art

 and

 artifacts

 from

 around

 the

 world

,

 including




Generated text: 

 here

 today

.

 As

 AI

 continues

 to

 evolve

 and

 improve

,

 we

 can

 expect

 to

 see

 significant

 advancements

 in

 its

 capabilities

 and

 impact

 on

 various

 industries

.

 Here

 are

 some

 of

 the

 key

 trends

 and

 predictions

 for

 the

 future

 of

 AI

:


1

.

 Increased

 Adoption

 in

 Healthcare

:

 AI

 will

 play

 a

 more

 significant

 role

 in

 healthcare

,

 improving

 diagnosis

 accuracy

,

 personalized

 medicine

,

 and

 patient

 outcomes

.


2

.

 Rise

 of

 Edge

 AI

:

 As

 the

 Internet

 of

 Things

 (

Io

T

)

 expands

,

 AI

 will

 be

 deployed

 at

 the

 edge

 of

 the

 network

,

 enabling

 real

-time

 processing

 and

 analysis

 of

 data

 from

 various

 devices

 and

 sensors

.


3

.

 Growing

 Importance

 of

 Explain

ability

:

 As

 AI

 becomes

 more

 pervasive

,

 there




In [6]:
llm.shutdown()