# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-22 21:07:37 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.23it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.07it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.04it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.36it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Am

ie

 Sch

ae

fer

,

 and

 I

'm

 a

 singer

,

 songwriter

,

 and

 composer

 based

 in

 Cincinnati

,

 Ohio

.

 I

 have

 been

 performing

 and

 recording

 music

 for

 many

 years

,

 and

 I

've

 had

 the

 privilege

 of

 sharing

 my

 music

 with

 audiences

 in

 the

 United

 States

 and

 abroad

.


My

 music

 is

 a

 unique

 blend

 of

 folk

,

 pop

,

 and

 classical

 influences

,

 with

 lyrics

 that

 explore

 themes

 of

 love

,

 hope

,

 and

 resilience

.

 I

 draw

 inspiration

 from

 my

 own

 life

 experiences

,

 as

 well

 as

 the

 world

 around

 me

,

 and

 I

 strive

 to

 create

 music

 that

 is

 both

 authentic

 and

 uplifting

.


As

 a

 composer

,

 I

 have

 written

 music

 for

 film

,

 theater

,

 and

 live

 events

,

 including




Generated text: 

 a

 popular

 destination

 for

 both

 leisure

 and

 business

 travelers

.

 Paris

 is

 famous

 for

 its

 stunning

 architecture

,

 art

 museums

,

 fashion

,

 cuisine

,

 and

 romantic

 atmosphere

.

 Whether

 you

 are

 looking

 to

 visit

 the

 iconic

 E

iff

el

 Tower

,

 explore

 the

 Lou

vre

 Museum

,

 or

 simply

 stroll

 along

 the

 Se

ine

 River

,

 Paris

 has

 something

 for

 everyone

.


When

 planning

 a

 trip

 to

 Paris

,

 consider

 booking

 a

 hotel

 in

 the

 city

 center

 to

 be

 close

 to

 the

 main

 attractions

.

 Some

 popular

 areas

 to

 stay

 include

:


The

 Latin

 Quarter

:

 Known

 for

 its

 charming

 streets

,

 historic

 buildings

,

 and

 lively

 nightlife

.


Le

 Mar

ais

:

 A

 trendy

 neighborhood

 with

 a

 mix

 of

 old

 and

 new

 buildings

,

 boutique

 shops




Generated text: 

 likely

 to

 be

 shaped

 by

 the

 choices

 we

 make

 today

.

 As

 we

 navigate

 the

 rapidly

 evolving

 landscape

 of

 artificial

 intelligence

,

 we

 must

 consider

 the

 implications

 of

 AI

 on

 society

,

 the

 economy

,

 and

 our

 individual

 lives

.

 Here

 are

 some

 possible

 future

 scenarios

 for

 AI

,

 and

 the

 opportunities

 and

 challenges

 that

 come

 with

 them

:



**

Scenario

 

1

:

 AI

-driven

 U

top

ia

**



In

 this

 scenario

,

 AI

 has

 become

 an

 integral

 part

 of

 our

 daily

 lives

,

 making

 our

 lives

 easier

,

 healthier

,

 and

 more

 enjoyable

.

 AI

-powered

 systems

 manage

 our

 homes

,

 transport

 us

 efficiently

,

 and

 provide

 personalized

 healthcare

.

 Automation

 has

 enabled

 businesses

 to

 focus

 on

 creative

 and

 high

-value

 tasks

,

 leading

 to

 unprecedented




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Michael

 and

 I

 am

 a

 volunteer

 moderator

 here

 at

 Cyber

security

 

360

.


I

 have

 a

 background

 in

 IT

 and

 have

 been

 involved

 with

 the

 field

 of

 cybersecurity

 for

 over

 

10

 years

.

 I

 am

 excited

 to

 be

 a

 part

 of

 this

 community

 and

 look

 forward

 to

 learning

 from

 and

 contributing

 to

 the

 discussions

 here

.


Please

 feel

 free

 to

 reach

 out

 to

 me

 if

 you

 have

 any

 questions

 or

 need

 assistance

 with

 anything

 related

 to

 cybersecurity

.

 I

'll

 do

 my

 best

 to

 provide

 helpful

 and

 informative

 responses

.



Let

's

 get

 started

!

 What

 brings

 you

 to

 Cyber

security

 

360

?

 Are

 you

 looking

 for

 advice

,

 guidance

,

 or

 just

 interested

 in

 learning

 more

 about

 cybersecurity

?

 Let

 me

 know

,

 and




Generated text: 

 the

 most

 visited

 city

 in

 the

 world

,

 and

 for

 good

 reason

.

 Paris

 is

 a

 beautiful

 city

 that

 offers

 stunning

 architecture

,

 world

-class

 museums

,

 and

 a

 rich

 cultural

 scene

.

 From

 the

 E

iff

el

 Tower

 to

 the

 Lou

vre

,

 there

's

 no

 shortage

 of

 iconic

 landmarks

 to

 explore

.

 Here

 are

 some

 of

 the

 top

 things

 to

 do

 in

 Paris

:


1

.

 Visit

 the

 E

iff

el

 Tower

:

 The

 iconic

 iron

 lattice

 tower

 is

 a

 must

-

see

 attraction

 in

 Paris

.

 You

 can

 take

 the

 stairs

 or

 elevator

 to

 the

 top

 for

 breathtaking

 views

 of

 the

 city

.


2

.

 Explore

 the

 Lou

vre

 Museum

:

 The

 world

-f

amous

 museum

 is

 home

 to

 an

 impressive

 collection

 of

 art

 and




Generated text: 

 human




The

 future

 of

 AI

 is

 human




Art

ificial

 Intelligence

 is

 at

 an

 all

-time

 high

 in

 the

 industry

.

 In

 fact

,

 AI

 has

 become

 a

 buzz

word

 that

 can

 barely

 be

 avoided

 in

 conversations

 about

 tech

,

 business

,

 and

 the

 future

.

 However

,

 in

 this

 AI

-driven

 era

,

 there

 is

 a

 growing

 consensus

 that

 the

 future

 of

 AI

 is

 not

 about

 replacing

 human

 intelligence

 but

 rather

 augment

ing

 it

.


There

 are

 several

 reasons

 why

 the

 future

 of

 AI

 is

 human

-centric

:


1

.

 The

 complexity

 of

 human

 emotions

:

 AI

 is

 far

 from

 replic

ating

 human

 emotions

,

 empathy

,

 and

 compassion

.

 Humans

 are

 more

 inclined

 to

 understand

 and

 navigate

 the

 complexities

 of

 human

 emotions

,

 making

 them




In [6]:
llm.shutdown()