# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-27 22:59:19 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.21it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.17it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.15it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Amanda and I'm a mom of three kids, ages 7, 9, and 11. My family and I recently moved to the United States from the Philippines. I'm excited to be here and start this new chapter of our lives. We settled in a small town near the city and I'm enjoying the peace and quiet of our new neighborhood.
As a mom, I'm always looking for ways to engage my kids in fun and educational activities. We love spending time outdoors, playing games, and learning new things together. In the Philippines, we were always on the go, exploring new places and trying new foods, and I
Prompt: The president of the United States is
Generated text:  a powerful leader who serves as the head of the government and the commander-in-chief of the armed forces. The president is also the symbol of American democracy and the embodiment of its values and ideals. The president is elected through a democratic process, in which citizens vote for their preferred candidate. The president s

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Heather

,

 and

 I

 am

 a

 software

 engineer

.

 I

 have

 over

 

10

 years

 of

 experience

 in

 software

 development

,

 with

 a

 strong

 focus

 on

 backend

 systems

,

 database

 design

,

 and

 scalability

.

 I

 am

 passionate

 about

 building

 robust

 and

 maintain

able

 software

 systems

 that

 deliver

 value

 to

 users

.



I

 have

 worked

 on

 a

 wide

 range

 of

 projects

,

 from

 e

-commerce

 platforms

 to

 data

 analytics

 systems

,

 and

 have

 experience

 with

 various

 technologies

 including

 Java

,

 Python

,

 Node

.js

,

 and

 .

NET

.

 I

 am

 also

 familiar

 with

 Agile

 development

 methodologies

 and

 have

 worked

 in

 team

 environments

 to

 deliver

 high

-quality

 software

 products

 on

 time

.



In

 addition

 to

 my

 technical

 skills

,

 I

 am

 a

 strong

 communicator

 and

 team

 player

.



Prompt: The capital of France is
Generated text: 

 a

 city

 like

 no

 other

,

 and

 I

 was

 lucky

 to

 have

 spent

 a

 day

 exploring

 it

.

 Paris

 is

 a

 city

 that

 ex

udes

 romance

,

 art

,

 fashion

,

 and

 history

,

 and

 I

 fell

 in

 love

 with

 it

.

 Here

 are

 some

 of

 my

 top

 picks

 for

 things

 to

 do

 in

 Paris

:


1

.

 Visit

 the

 E

iff

el

 Tower

:

 This

 iconic

 tower

 is

 a

 must

-

visit

 attraction

 in

 Paris

.

 You

 can

 take

 the

 stairs

 or

 elevator

 to

 the

 top

 for

 breathtaking

 views

 of

 the

 city

.


2

.

 Explore

 the

 Lou

vre

 Museum

:

 The

 Lou

vre

 is

 one

 of

 the

 world

's

 largest

 and

 most

 famous

 museums

,

 housing

 an

 impressive

 collection

 of

 art

 and

 artifacts

 from

 around

 the



Prompt: The future of AI is
Generated text: 

 clear

:

 it

’s

 going

 to

 be

 huge

.

 According

 to

 a

 report

 from

 Markets

and

Mark

ets

,

 the

 global

 AI

 market

 is

 expected

 to

 grow

 from

 $

190

 billion

 in

 

202

1

 to

 $

390

 billion

 by

 

202

6

,

 at

 a

 Compound

 Annual

 Growth

 Rate

 (

C

AGR

)

 of

 

34

.

1

%

 during

 the

 forecast

 period

.


But

 what

 does

 this

 mean

 for

 businesses

?

 How

 can

 they

 capitalize

 on

 the

 AI

 revolution

?


Here

 are

 

3

 ways

 businesses

 can

 leverage

 AI

 for

 growth

:


1

.

 Enh

ance

 Customer

 Experience

:


AI

 can

 help

 businesses

 personalize

 customer

 interactions

,

 streamline

 processes

,

 and

 offer

 proactive

 support

.

 For

 example

,

 chat

bots

 can

 be

 used

 to

 provide

 




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Ben. I am a postdoctoral researcher at the University of California, Santa Barbara. My research focuses on the mathematical modeling of complex biological systems, particularly in the context of plant ecology and population biology.
I was born and raised in the United Kingdom, where I earned my undergraduate degree in Mathematics from the University of Cambridge. I then moved to the United States, where I earned my Ph.D. in Applied Mathematics from the University of Arizona.
My research interests include mathematical modeling of ecological and evolutionary processes, with a focus on understanding the dynamics of plant populations and communities. I am particularly interested in the interactions between plants and their environment, including

Prompt: The capital of France is
Generated text:  the most visited city in the world, attracting over 23 million tourists every year. But what is it about Paris that draws people to it like a magnet? Her

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Brian

 Le

it

head

.

 I

 am

 a

 freelance

 writer

,

 pod

caster

,

 and

 photographer

 based

 in

 Edinburgh

,

 Scotland

.

 I

 have

 a

 background

 in

 journalism

 and

 a

 passion

 for

 storytelling

,

 which

 I

 like

 to

 explore

 through

 my

 work

.


You

 can

 find

 more

 information

 about

 my

 writing

 and

 photography

 on

 this

 site

,

 as

 well

 as

 on

 social

 media

.

 I

'm

 always

 looking

 for

 new

 projects

 and

 collaborations

,

 so

 please

 get

 in

 touch

 if

 you

'd

 like

 to

 chat

.


I

'm

 also

 the

 host

 of

 the

 Human

 Story

 podcast

,

 which

 explores

 the

 complexities

 of

 the

 human

 experience

 through

 in

-depth

 interviews

 with

 people

 from

 all

 walks

 of

 life

.

 You

 can

 find

 the

 podcast

 on

 Apple

 Podcast

s

,

 Spotify



Prompt: The capital of France is
Generated text: 

 Paris

,

 and

 the

 capital

 of

 France

 is

 also

 Paris

.

 The

 official

 name

 of

 the

 capital

 is

 the

 City

 of

 Paris

.

 The

 city

 is

 located

 in

 the

 northern

 part

 of

 the

 country

,

 on

 the

 Se

ine

 River

.

 The

 city

 is

 divided

 into

 

20

 arr

ond

isse

ments

,

 or

 districts

,

 which

 are

 numbered

 from

 

1

 to

 

20

.


Paris

 is

 a

 major

 tourist

 destination

,

 attracting

 millions

 of

 visitors

 each

 year

.

 The

 city

 is

 home

 to

 many

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 the

 Notre

-D

ame

 Cathedral

,

 and

 the

 Arc

 de

 Tri

omp

he

.

 The

 city

 also

 has

 many

 parks

 and

 gardens

,

 including

 the

 Luxembourg

 Gardens



Prompt: The future of AI is
Generated text: 

 changing

,

 and

 it

’s

 happening

 rapidly

.

 The

 field

 is

 shifting

 from

 a

 focus

 on

 individual

 AI

 applications

 to

 a

 focus

 on

 integrating

 AI

 into

 larger

 systems

 and

 processes

.

 This

 trend

 is

 often

 referred

 to

 as

 “

AI

 convergence

”

 or

 “

system

ic

 AI

.”


Con

vergence

 refers

 to

 the

 integration

 of

 AI

 with

 other

 technologies

,

 such

 as

 the

 Internet

 of

 Things

 (

Io

T

),

 robotics

,

 and

 human

-machine

 interfaces

.

 This

 integration

 enables

 AI

 to

 interact

 with

 the

 physical

 world

,

 learn

 from

 vast

 amounts

 of

 data

,

 and

 make

 decisions

 based

 on

 complex

 inputs

.


System

ic

 AI

,

 on

 the

 other

 hand

,

 focuses

 on

 creating

 AI

 systems

 that

 can

 learn

,

 reason

,

 and

 adapt

 to

 changing

 circumstances




In [6]:
llm.shutdown()