# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-22 23:12:55 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.85it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.62it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.63it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  2.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.96it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Derek

 and

 I

'm

 a

 

35

-year

-old

 accountant

 from

 New

 York

.

 I

've

 been

 married

 to

 my

 lovely

 wife

,

 Rachel

,

 for

 

10

 years

 and

 we

 have

 two

 adorable

 kids

,

 Emma

 and

 Max

.

 In

 my

 free

 time

,

 I

 enjoy

 playing

 golf

,

 watching

 sports

,

 and

 trying

 out

 new

 restaurants

 in

 the

 city

.


I

'm

 a

 bit

 of

 a

 numbers

 guy

 and

 love

 solving

 puzzles

 and

 brain

 teas

ers

.

 I

'm

 also

 a

 big

 fan

 of

 sci

-fi

 and

 fantasy

 novels

,

 and

 have

 a

 huge

 collection

 of

 books

 by

 my

 favorite

 authors

,

 including

 George

 R

.R

.

 Martin

 and

 J

.R

.R

.

 Tolkien

.



When

 I

'm

 not

 working

 or

 spending

 time

 with

 my

 family




Generated text: 

 a

 city

 like

 no

 other

,

 full

 of

 history

,

 art

,

 fashion

,

 and

 romance

.

 Paris

,

 or

 the

 City

 of

 Light

,

 is

 a

 destination

 that

 has

 capt

ivated

 travelers

 for

 centuries

.

 From

 the

 iconic

 E

iff

el

 Tower

 to

 the

 world

-class

 museums

,

 including

 the

 Lou

vre

 and

 Or

say

,

 Paris

 has

 endless

 attractions

 that

 cater

 to

 all

 interests

.


In

 this

 article

,

 we

'll

 explore

 the

 top

 things

 to

 do

 in

 Paris

,

 including

 its

 must

-

visit

 landmarks

,

 cultural

 experiences

,

 and

 hidden

 gems

.


Must

-

visit

 landmarks

:


The

 E

iff

el

 Tower

 (

La

 Tour

 E

iff

el

):

 The

 most

 iconic

 symbol

 of

 Paris

,

 the

 E

iff

el

 Tower

 is

 a

 must




Generated text: 

 intertwined

 with

 the

 future

 of

 work

.

 As

 automation

 and

 machine

 learning

 advance

,

 jobs

 will

 be

 displaced

,

 and

 new

 opportunities

 will

 emerge

.

 As

 we

 navigate

 this

 transformation

,

 it

's

 essential

 to

 prioritize

 human

-centered

 values

 and

 principles

 in

 the

 design

 of

 AI

 systems

.

 Here

 are

 some

 key

 principles

 to

 consider

:


1

.

 **

Trans

parency

 and

 Explain

ability

**:

 AI

 systems

 should

 be

 transparent

 in

 their

 decision

-making

 processes

,

 and

 their

 outcomes

 should

 be

 explain

able

 to

 users

.

 This

 is

 crucial

 for

 trust

-building

 and

 accountability

.


2

.

 **

Fair

ness

 and

 Bias

 Mit

igation

**:

 AI

 systems

 should

 be

 designed

 to

 minimize

 bias

 and

 ensure

 fairness

 in

 their

 decision

-making

 processes

.

 This

 requires

 ongoing

 monitoring

 and

 evaluation




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Mark

,

 and

 I

 am

 the

 owner

 of

 Precision

 Auto

 Detail

ing

 in

 North

 York

.

 I

 specialize

 in

 providing

 high

-quality

 car

 wash

 and

 detailing

 services

 to

 the

 residents

 of

 North

 York

 and

 surrounding

 areas

.

 With

 years

 of

 experience

 and

 a

 passion

 for

 cars

,

 I

 am

 confident

 in

 my

 ability

 to

 provide

 you

 with

 a

 level

 of

 service

 that

 is

 second

 to

 none

.


At

 Precision

 Auto

 Detail

ing

,

 I

 take

 pride

 in

 every

 vehicle

 that

 comes

 through

 my

 bay

.

 From

 a

 quick

 exterior

 wash

 and

 dry

,

 to

 a

 full

 interior

 detailing

 and

 exterior

 paint

 correction

,

 I

 use

 only

 the

 best

 equipment

 and

 products

 to

 ensure

 that

 every

 vehicle

 leaves

 my

 shop

 looking

 like

 new

.


Whether

 you

 drive

 a

 luxury




Generated text: 

 Paris

.

 The

 capital

 of

 France

 is

 the

 largest

 city

 in

 France

,

 which

 is

 the

 most

 populous

 country

 in

 Europe

.

 Paris

 is

 famous

 for

 its

 stunning

 architecture

,

 museums

,

 art

 galleries

,

 fashion

,

 and

 cuisine

.

 The

 city

 is

 home

 to

 some

 of

 the

 world

’s

 most

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 the

 Lou

vre

 Museum

,

 and

 the

 Arc

 de

 Tri

omp

he

.

 The

 city

 is

 also

 a

 hub

 for

 international

 business

,

 finance

,

 and

 politics

.


Paris

 is

 a

 popular

 tourist

 destination

,

 attracting

 over

 

23

 million

 visitors

 in

 

201

9

.

 The

 city

 has

 a

 rich

 history

,

 with

 evidence

 of

 human

 hab

itation

 dating

 back

 to

 the




Generated text: 

 happening

 in

 your

 backyard




In

novation

 is

 all

 around

 us

,

 and

 it

's

 often

 hidden

 in

 plain

 sight

.

 In

 this

 episode

 of

 The

 Future

 of

 Tech

,

 we

're

 exploring

 how

 advancements

 in

 artificial

 intelligence

 are

 transforming

 the

 way

 we

 live

 and

 work

,

 and

 the

 surprising

 places

 where

 this

 technology

 is

 being

 developed

 and

 deployed

.


Imagine

 a

 world

 where

 robots

 are

 helping

 us

 maintain

 our

 gardens

,

 detect

 diseases

 in

 plants

,

 and

 even

 harvest

 crops

 more

 efficiently

.

 Sounds

 like

 science

 fiction

,

 right

?

 But

,

 in

 the

 United

 States

 alone

,

 there

 are

 over

 

2

 million

 farms

 and

 ranch

es

,

 and

 the

 use

 of

 AI

 and

 robotics

 is

 becoming

 increasingly

 common

.


One

 of

 the

 pioneers

 in

 this




In [6]:
llm.shutdown()