# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-08 03:25:09 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.13it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.02it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:01,  1.01s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.19it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Lori

 and

 I

'm

 a

 bit

 of

 a

 wine

 newbie

.

 I

've

 just

 started

 exploring

 the

 world

 of

 wine

 and

 I

'm

 excited

 to

 learn

 more

 about

 it

.

 I

've

 been

 reading

 a

 lot

 of

 blogs

 and

 articles

,

 but

 I

 was

 hoping

 to

 get

 some

 advice

 from

 someone

 who

 has

 been

 around

 the

 block

 a

 few

 times

.


I

'm

 particularly

 interested

 in

 learning

 more

 about

 wine

 and

 food

 pairing

,

 as

 I

'm

 a

 bit

 of

 a

 food

ie

 at

 heart

.

 I

 love

 trying

 new

 recipes

 and

 experimenting

 with

 different

 flavors

 and

 ingredients

.

 I

've

 heard

 that

 wine

 can

 be

 a

 great

 addition

 to

 many

 dishes

,

 but

 I

'm

 not

 sure

 where

 to

 start

.


Can

 you

 recommend

 any

 good

 resources




Generated text: 

 one

 of

 the

 most

 romantic

 cities

 in

 the

 world

,

 known

 for

 its

 beautiful

 streets

,

 stunning

 architecture

,

 and

 rich

 history

.

 While

 many

 people

 visit

 Paris

 as

 a

 couple

,

 it

 can

 also

 be

 a

 great

 destination

 for

 solo

 travelers

.

 Here

 are

 some

 of

 the

 top

 things

 to

 do

 in

 Paris

 for

 solo

 travelers

:


1

.

 Explore

 the

 Lou

vre

 Museum

:

 The

 Lou

vre

 is

 one

 of

 the

 world

's

 largest

 and

 most

 famous

 museums

,

 and

 it

's

 a

 must

-

visit

 for

 any

 art

 lover

.

 With

 over

 

35

,

000

 works

 of

 art

 on

 display

,

 you

 can

 easily

 spend

 a

 whole

 day

 exploring

 the

 museum

's

 vast

 collections

.


2

.

 Visit

 the

 E

iff

el

 Tower

:




Generated text: 

 inherently

 linked

 to

 the

 concept

 of

 digital

 transformation

.

 Many

 organizations

 today

 face

 the

 challenge

 of

 navigating

 digital

 transformation

 while

 developing

 and

 implementing

 AI

 strategies

.

 In

 this

 context

,

 understanding

 the

 future

 of

 AI

 and

 its

 role

 in

 digital

 transformation

 is

 crucial

 for

 businesses

,

 governments

,

 and

 individuals

 alike

.

 Here

 are

 some

 key

 insights

 and

 future

 trends

 to

 consider

:



###

 

1

.

 **

Increased

 Adoption

 of

 Explain

able

 AI

 (

X

AI

)**





Ex

plain

able

 AI

 is

 expected

 to

 gain

 more

 traction

 as

 organizations

 face

 pressure

 to

 understand

 and

 explain

 the

 decisions

 made

 by

 AI

 models

.

 This

 trend

 is

 crucial

 for

 building

 trust

 in

 AI

 and

 ensuring

 regulatory

 compliance

.



###

 

2

.

 **

Expansion

 of

 Edge

 Computing

 and




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 E

rika

 and

 I

 am

 a

 fourth

-year

 student

 at

 Harvard

 University

.

 I

 am

 an

 undergraduate

 student

 studying

 sociology

,

 with

 a

 focus

 on

 family

 and

 education

.

 I

 am

 so

 excited

 to

 be

 a

 part

 of

 the

 amazing

 experience

 of

 the

 Harvard

 Extension

 School

's

 student

 ambassador

 program

.

 I

 have

 always

 been

 passionate

 about

 education

 and

 making

 it

 accessible

 to

 everyone

,

 and

 I

 believe

 that

 this

 program

 will

 give

 me

 the

 opportunity

 to

 make

 a

 positive

 impact

 on

 the

 students

 of

 the

 Harvard

 Extension

 School

.


As

 a

 student

 ambassador

,

 I

 hope

 to

 be

 able

 to

 connect

 with

 current

 and

 prospective

 students

,

 provide

 them

 with

 information

 and

 resources

 about

 the

 university

,

 and

 help

 them

 navigate

 the

 process

 of

 applying

 and

 en




Generated text: 

 Paris

 and

 it

 is

 also

 a

 capital

 of

 the

 Î

le

-de

-F

rance

 region

.

 Paris

 has

 a

 population

 of

 approximately

 

2

.

2

 million

 people

 within

 its

 city

 limits

,

 however

 the

 urban

 area

 has

 a

 population

 of

 approximately

 

12

.

2

 million

 people

.

 Paris

 is

 known

 for

 its

 beauty

,

 fashion

,

 and

 art

,

 and

 is

 one

 of

 the

 world

’s

 most

 popular

 tourist

 destinations

.


Paris

 is

 home

 to

 many

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 the

 Arc

 de

 Tri

omp

he

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Pont

 des

 Arts

.

 The

 city

 has

 a

 rich

 history

 and

 has

 been

 influenced

 by

 many

 different

 cultures

,

 including

 the




Generated text: 

 not

 just

 about

 creating

 intelligent

 machines

,

 but

 also

 about

 understanding

 the

 societal

 implications

 of

 these

 technologies

.

 This

 is

 why

 we

 at

 the

 Future

 of

 Humanity

 Institute

 (

F

HI

)

 are

 launching

 a

 new

 research

 project

 on

 the

 Ethics

 of

 Artificial

 Intelligence

.


Our

 goal

 is

 to

 explore

 the

 potential

 risks

 and

 benefits

 of

 AI

,

 and

 to

 develop

 principles

 and

 guidelines

 for

 the

 development

 and

 deployment

 of

 AI

 that

 align

 with

 human

 values

.

 We

 will

 be

 working

 with

 a

 diverse

 team

 of

 researchers

 from

 various

 disciplines

,

 including

 philosophy

,

 computer

 science

,

 economics

,

 and

 law

.


One

 of

 the

 key

 challenges

 we

 face

 is

 the

 need

 to

 develop

 a

 comprehensive

 understanding

 of

 the

 complex

 relationships

 between

 humans

 and

 machines

.

 We

 need




In [6]:
llm.shutdown()