# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-06 14:07:21 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.17it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.05it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.02it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.22it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Laura

,

 and

 I

 am

 a

 junior

 at

 the

 University

 of

 Delaware

.

 I

 am

 a

 double

 major

 in

 business

 administration

 and

 psychology

.

 After

 college

,

 I

 hope

 to

 pursue

 a

 career

 in

 the

 field

 of

 business

,

 possibly

 as

 an

 international

 business

 consultant

 or

 a

 marketing

 manager

.

 My

 long

-term

 goal

 is

 to

 start

 my

 own

 business

.


I

 am

 excited

 to

 share

 my

 experiences

 and

 thoughts

 on

 business

 and

 entrepreneurship

 with

 you

 through

 this

 blog

.

 I

 have

 had

 the

 opportunity

 to

 participate

 in

 various

 business

-related

 projects

 and

 intern

ships

,

 which

 I

 will

 be

 sharing

 with

 you

 in

 the

 coming

 weeks

.


In

 my

 free

 time

,

 I

 enjoy

 traveling

,

 reading

,

 and

 trying

 new

 foods

.

 I

 also

 love

 hiking




Generated text: 

 a

 must

-

visit

 destination

 for

 any

 traveler

.

 Paris

 is

 a

 city

 that

 is

 steep

ed

 in

 history

,

 culture

,

 and

 romance

,

 and

 it

's

 a

 place

 where

 you

 can

 experience

 the

 best

 of

 the

 French

 lifestyle

.

 From

 the

 iconic

 E

iff

el

 Tower

 to

 the

 stunning

 Notre

 Dame

 Cathedral

,

 there

's

 no

 shortage

 of

 famous

 landmarks

 to

 see

 in

 Paris

.


The

 city

 is

 also

 home

 to

 a

 wide

 range

 of

 art

 museums

,

 including

 the

 Lou

vre

,

 which houses

 some

 of

 the world

's most

 famous paintings

, including

 the Mona

 Lisa.

 And,

 of course

, no

 trip to

 Paris would

 be complete

 without a

 visit to

 the famous

 Montmart

re neighborhood

, which

 is famous

 for




Generated text: 

 not just

 about the

 technology

,

 but

 about

 the

 principles

 and

 values

 that

 guide

 its

 development

 and

 use

.


When

 you

 think

 about

 the

 future

 of

 AI

,

 you

 might

 imagine

 a

 world

 where

 machines

 are

 smarter

 and

 more

 capable

 than

 ever

 before

,

 but

 do

 you

 ever

 stop

 to

 think

 about

 the

 ethics

 and

 values

 that

 will

 guide

 their

 development

 and

 use

?


As

 AI

 becomes

 increasingly

 integrated

 into

 our

 daily

 lives

,

 it

's

 crucial

 to

 consider

 the

 principles

 and

 values

 that

 will

 shape

 its

 impact

 on

 society

.

 This

 is

 what

 we

'll

 be

 exploring

 in

 this

 article

.


The

 

4

 principles

 of

 AI

 ethics




In

 

201

9

,

 the

 European

 Commission

 published

 a

 set

 of

 guidelines

 for

 trustworthy

 AI

,

 which




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Lee

 Harris

,

 and

 I

 am

 a

 professional

 speaker

,

 facilit

ator

,

 and

 coach

 who

 specializes

 in

 helping

 people

 navigate

 significant

 change

.

 With

 a

 background

 in

 psycho

therapy

,

 counseling

,

 and

 leadership

 development

,

 I

 have

 a

 deep

 understanding

 of

 how

 to

 support

 individuals

 and

 teams

 through

 challenging

 times

.


My

 approach

 is

 grounded

 in

 a

 compassionate

 and

 non

-j

ud

gment

al

 mindset

,

 and

 I

 am

 passionate

 about

 empowering

 people

 to

 tap

 into

 their

 inner

 wisdom

 and

 potential

.

 I

 have

 a

 knack

 for

 creating

 a

 safe

 and

 supportive

 environment

 where

 people

 feel

 comfortable

 sharing

 their

 concerns

 and

 exploring

 new

 perspectives

.


I

 have

 worked

 with

 a

 wide

 range

 of

 clients

,

 from

 entrepreneurs

 and

 small

 business

 owners

 to

 corporate

 leaders

 and

 teams




Generated text: 

 Paris

.

 The

 largest

 city

 in

 France

 is

 Paris

.

 Paris

 is

 the

 most

 populous

 city

 in

 France

.


What

 is

 the

 capital

 of

 France

?


The

 capital

 of

 France

 is

 Paris

.

 The

 largest

 city

 in

 France

 is

 also

 Paris

.

 Paris

 is

 the

 most

 populous

 city

 in

 France

.


What

 is

 the

 largest

 city

 in

 France

?


The

 largest

 city

 in

 France

 is

 Paris

.

 The

 capital

 of

 France

 is

 also

 Paris

.

 Paris

 is

 the

 most

 populous

 city

 in

 France

.


What

 is

 the

 most

 populous

 city

 in

 France

?


The

 most

 populous

 city

 in

 France

 is

 Paris

.

 The

 capital

 of

 France

 is

 also

 Paris

.

 Paris

 is

 the

 largest

 city

 in

 France

.

 


What

 is

 the

 capital

,

 largest

 city

,

 and




Generated text: 

 not

 just

 about

 computers

,

 but

 about

 people




Art

ificial

 intelligence

 has

 the

 potential

 to

 significantly

 improve

 the

 human

 experience

.

 However

,

 the

 way

 we

 think

 about

 AI

 often

 neglect

s

 the

 most

 important

 factor

:

 people

.


Read

 more

 about

 The

 future

 of

 AI

 is

 not

 just

 about

 computers

,

 but

 about

 people




The

 Impact

 of

 AI

 on

 Human

 Employment

 and

 Skills




The

 increasing

 use

 of

 artificial

 intelligence

 (

AI

)

 in

 various

 industries

 has

 sparked

 concerns

 about

 the

 potential

 impact

 on

 human

 employment

 and

 skills

.

 The

 effects

 of

 AI

 on

 employment

 are

 multif

ac

eted

 and

 have

 both

 positive

 and

 negative

 aspects

.


Read

 more

 about

 The

 Impact

 of

 AI

 on

 Human

 Employment

 and

 Skills




AI

 for

 Social

 Good

:




In [6]:
llm.shutdown()