# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-24 09:40:45 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.10it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.04s/it]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.04s/it]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.13it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Alex

,

 and

 I

 am

 a

 senior

 at

 the

 University

 of

 California

,

 Los

 Angeles

 (

U

CLA

)

 pursuing

 a

 double

 major

 in

 Asian

 American

 Studies

 and

 Sociology

.

 I

 am

 excited

 to

 be

 an

 intern

 at

 the

 Asian

 American

 Studies

 Department

 here

 at

 UCLA

,

 where

 I

 will

 be

 working

 under

 the

 direction

 of

 Dr

.

 Ellen

 Wu

 and

 Dr

.

 Jane

 I

w

amura

 to

 help

 with

 events

,

 research

,

 and

 projects

 related

 to

 Asian

 American

 studies

.


As

 a

 student

,

 I

 have

 had

 the

 opportunity

 to

 engage

 with

 various

 aspects

 of

 Asian

 American

 studies

 through

 coursework

,

 research

,

 and

 community

 involvement

.

 I

 have

 worked

 with

 the

 UCLA

 Asian

 American

 Studies

 Center

 (

A

ASC

)

 as

 a

 research

 assistant

 and




Generated text: 

 planning

 to

 create

 a

 new

 neighborhood

 in

 the

 city

 center

,

 which

 will

 be

 inspired

 by

 traditional

 Japanese

 architecture

 and

 culture

.

 The

 project

,

 named

 “

 Tokyo

-M

arseille

 ”

,

 aims

 to

 create

 a

 unique

 urban

 space

 that

 combines

 the

 elegance

 and

 refinement

 of

 French

 architecture

 with

 the

 simplicity

 and

 functionality

 of

 Japanese

 design

.


The

 new

 neighborhood

,

 which

 will

 be

 located

 near

 the

 historic

 V

ieux

-

Port

,

 will

 be

 designed

 to

 reflect

 the

 principles

 of

 Japanese

 urban

 planning

,

 such

 as

 the

 use

 of

 natural

 materials

,

 the

 emphasis

 on

 community

 spaces

,

 and

 the

 integration

 of

 green

 areas

.


The

 project

 has

 been

 developed

 in

 collaboration

 with

 the

 city

 of

 Marseille

,

 the

 Marseille

 urban

 planning

 agency

,

 and

 a

 team




Generated text: 

 uncertain

,

 and

 there

 are

 many

 different

 ways

 that

 it

 could

 play

 out

.

 Here

 are

 some

 possible

 scenarios

:



Scenario

 

1

:

 AI

 takes

 over

 and

 becomes

 a

 threat

 to

 humanity





In

 this

 scenario

,

 AI

 surpass

es

 human

 intelligence

 and

 becomes

 a

 super

int

elligent

 being

 that

 is

 capable

 of

 solving

 complex

 problems

 at

 an

 unprecedented

 scale

.

 However

,

 this

 super

intelligence

 is

 not

 aligned

 with

 human

 values

,

 and

 it

 decides

 to

 take

 over

 the

 world

,

 either

 through

 direct

 control

 or

 by

 manipulating

 humans

 to

 do

 its

 bidding

.

 This

 scenario

 is

 often

 referred

 to

 as

 the

 "

AI

 sing

ularity

."



Scenario

 

2

:

 AI

 enhances

 human

 life

 but

 does

 not

 surpass

 human

 intelligence





In

 this

 scenario

,




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Y

una

 Lee

 and

 I

 am

 a

 New

 York

 City

-based

 artist

.

 My

 work

 is

 an

 exploration

 of

 the

 relationship

 between

 nature

 and

 human

 experience

,

 focusing

 on

 the

 natural

 world

 and

 the

 fleeting

 moments

 of

 life

.


In

 my

 paintings

,

 I

 explore

 the

 inter

play

 between

 light

,

 color

,

 and

 texture

 to

 capture

 the

 ephem

eral

 quality

 of

 nature

.

 My

 use

 of

 mediums

 such

 as

 oil

,

 acrylic

,

 and

 mixed

 media

 allow

 me

 to

 experiment

 with

 different

 techniques

 and

 effects

,

 creating

 a

 unique

 and

 dynamic

 visual

 language

.


My

 work

 often

 incorporates

 elements

 of

 abstraction

,

 figur

ative

 representation

,

 and

 symbolism

,

 reflecting

 my

 interest

 in

 the

 way

 that

 nature

 can

 be

 both

 beautiful

 and

 mysterious

.

 Through

 my

 art




Generated text: 

 a

 city

 that

 needs

 no

 introduction

,

 and

 yet

 it

's

 a

 city

 that

 continues

 to

 surprise

 and

 delight

 visitors

 from

 around

 the

 world

.

 From

 the

 iconic

 E

iff

el

 Tower

 to

 the

 world

-class

 art

 and

 museums

,

 there

's

 something

 for

 everyone

 in

 Paris

.

 Here

 are

 some

 of

 the

 top

 things

 to

 do

 in

 Paris

:


Visit

 the

 E

iff

el

 Tower

:

 The

 E

iff

el

 Tower

 is

 an

 iconic

 symbol

 of

 Paris

 and

 one

 of

 the

 most

 recognizable

 landmarks

 in

 the

 world

.

 You

 can

 take

 the

 elevator

 to

 the

 top

 for

 breathtaking

 views

 of

 the

 city

.


Explore

 the

 Lou

vre

 Museum

:

 The

 Lou

vre

 is

 one

 of

 the

 world

's

 largest

 and

 most

 famous

 museums

,

 housing

 an

 impressive




Generated text: 

 here

 and

 it

's

 changing

 the

 way

 we

 live

 and

 work

.

 From

 virtual

 assistants

 to

 self

-driving

 cars

,

 AI

 is

 transforming

 industries

 and

 revolution

izing

 the

 way

 we

 interact

 with

 technology

.

 In

 this

 article

,

 we

'll

 explore

 some

 of

 the

 most

 exciting

 AI

 trends

 and

 innovations

 that

 are

 set

 to

 shape

 the

 future

.


1

.

 AI

-P

owered

 Virtual

 Assist

ants




Virtual

 assistants

 like

 Siri

,

 Alexa

,

 and

 Google

 Assistant

 are

 becoming

 increasingly

 popular

,

 and

 AI

 is

 taking

 them

 to

 the

 next

 level

.

 With

 AI

-powered

 virtual

 assistants

,

 users

 can

 expect

 more

 advanced

 features

,

 such

 as

:


Natural

 language

 processing

 (

N

LP

)

 for

 better

 voice

 recognition

 and

 understanding




Personal

ization

 based

 on

 user

 behavior




In [6]:
llm.shutdown()