# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-08 00:04:01 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.13it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.00it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.02s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.17it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 L

yle

!


I

 am

 a

 second

-year

 student

 in

 the

 Master

 of

 Science

 program

 in

 Nutrition

 at

 the

 University

 of

 G

uel

ph

.

 My

 undergraduate

 degree

 was

 also

 in

 Nutrition

,

 and

 I

 am

 passionate

 about

 learning

 more

 about

 the

 complex

 relationships

 between

 food

,

 health

,

 and

 disease

.


When

 I

'm

 not

 studying

,

 you

 can

 find

 me

 experimenting

 with

 new

 recipes

 in

 the

 kitchen

,

 hiking

 with

 my

 family

,

 or

 practicing

 yoga

.

 I

'm

 also

 an

 avid

 reader

 and

 enjoy

 staying

 up

-to

-date

 on

 the

 latest

 nutrition

 research

 and

 trends

.


This

 blog

 will

 be

 a

 space

 for

 me

 to

 share

 my

 thoughts

,

 experiences

,

 and

 knowledge

 about

 nutrition

,

 health

,

 and

 wellness

.

 I

 hope

 to

 provide




Generated text: 

 known

 as

 Paris

,

 a

 city

 famous

 for

 its

 fashion

,

 cuisine

,

 art

,

 and

 romance

.

 It

's

 a

 must

-

visit

 destination

 for

 travelers

 from

 all

 over

 the

 world

.

 Here

's

 a

 brief

 guide

 to

 help

 you

 plan

 your

 trip

 to

 Paris

.


Paris

 is

 a

 city

 with

 a

 rich

 history

 that

 spans

 thousands

 of

 years

.

 It

 was

 founded

 by

 the

 Gaul

s

,

 a

 Celtic

 tribe

,

 in

 the

 

3

rd

 century

 BC

.

 Over

 the

 centuries

,

 the

 city

 has

 been

 conquered

 by

 various

 em

pires

 and

 rulers

,

 including

 the

 Romans

,

 Fr

anks

,

 and

 Norm

ans

.

 Today

,

 Paris

 is

 a

 cosm

opolitan

 city

 with

 a

 population

 of

 over

 

2

 million

 people

.


Paris

 is




Generated text: 

 here

 –

 and

 it

’s

 personal




Art

ificial

 intelligence

 (

AI

)

 has

 been

 gaining

 momentum

 in

 recent

 years

,

 and

 its

 applications

 are

 becoming

 increasingly

 personalized

.

 From

 virtual

 assistants

 like

 Siri

 and

 Alexa

 to

 AI

-powered

 chat

bots

,

 AI

 is

 changing

 the

 way

 we

 interact

 with

 technology

 and

 each

 other

.

 But

 what

 does

 the

 future

 of

 AI

 hold

,

 and

 how

 will

 it

 impact

 our

 daily

 lives

?


According

 to

 G

artner

,

 by

 

202

2

,

 

85

%

 of

 customer

 service

 interactions

 will

 be

 handled

 by

 AI

-powered

 chat

bots

.

 This

 is

 because

 AI

 can

 provide

 

24

/

7

 support

,

 reduce

 response

 times

,

 and

 improve

 customer

 satisfaction

.

 AI

-powered

 chat

bots

 can

 also

 help

 with




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 J

Z

 and

 I

 am

 a

 senior

 in

 the

 integrated

 marketing

 communications

 program

 at

 Emerson

 College

.

 I

 am

 also

 a

 resident

 of

 the

 beautiful

 city

 of

 Boston

.

 I

 am

 excited

 to

 be

 joining

 this

 community

 of

 bloggers

 and

 sharing

 my

 thoughts

 and

 experiences

 with

 all

 of

 you

.


As

 a

 resident

 of

 Boston

,

 I

 have

 grown

 accustomed

 to

 the

 fast

-paced

 and

 vibrant

 atmosphere

 of

 the

 city

.

 From

 the

 historic

 Freedom

 Trail

 to

 the

 trendy

 restaurants

 and

 bars

 in

 the

 North

 End

,

 there

 is

 always

 something

 new

 to

 see

 and

 do

 in

 this

 amazing

 city

.


As

 a

 student

,

 I

 am

 also

 constantly

 learning

 and

 growing

.

 My

 major

 in

 integrated

 marketing

 communications

 has

 given

 me

 a

 well

-rounded

 education

 in

 the




Generated text: 

 a

 city

 like

 no

 other

.

 Known

 for

 its

 grand

 architecture

,

 rich

 history

,

 and

 vibrant

 cultural

 scene

,

 Paris

 is

 a

 must

-

visit

 destination

 for

 anyone

 interested

 in

 exploring

 the

 world

's

 greatest

 cities

.


From

 the

 iconic

 E

iff

el

 Tower

 to

 the

 artistic

 treasures

 of

 the

 Lou

vre

,

 Paris

 has

 a

 wealth

 of

 attractions

 that

 will

 leave

 you

 in

 awe

.

 Take

 a

 stroll

 along

 the

 Se

ine

,

 visit

 the

 stunning

 Notre

 Dame

 Cathedral

,

 and

 explore

 the

 charming

 streets

 of

 Mont

mart

re

 to

 discover

 the

 city

's

 unique

 character

.


Paris

 is

 also

 a

 city

 of

 fashion

,

 food

,

 and

 wine

,

 with

 world

-ren

owned

 designers

,

 Mich

elin

-star

red

 restaurants

,

 and

 some

 of

 the




Generated text: 

 being

 shaped

 by

 the

 way

 we

 interact

 with

 it




The

 way

 we

 interact

 with

 artificial

 intelligence

 (

AI

)

 will

 have

 a

 profound

 impact

 on

 how

 it

 develops

 and

 evolves

 in

 the

 future

.

 As

 AI

 becomes

 more

 ubiquitous

 and

 integrated

 into

 our

 daily

 lives

,

 our

 interactions

 with

 it

 will

 influence

 its

 capabilities

,

 limitations

,

 and

 even

 its

 values

.

 Here

 are

 some

 ways

 in

 which

 the

 future

 of

 AI

 will

 be

 shaped

 by

 the

 way

 we

 interact

 with

 it

:


1

.

 **

Natural

 Language

 Understanding

**:

 The

 way

 we

 communicate

 with

 AI

 through

 natural

 language

 will

 shape

 its

 ability

 to

 understand

 and

 respond

 to

 human

 language

.

 As

 we

 interact

 with

 AI

 through

 voice

 assistants

,

 chat

bots

,

 and

 other

 interfaces

,




In [6]:
llm.shutdown()