# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.92it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.65it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:22,  1.02s/it]

  9%|▊         | 2/23 [00:01<00:12,  1.74it/s]

 13%|█▎        | 3/23 [00:01<00:08,  2.38it/s]

 17%|█▋        | 4/23 [00:01<00:06,  2.98it/s]

 22%|██▏       | 5/23 [00:01<00:05,  3.47it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.77it/s]

 30%|███       | 7/23 [00:02<00:04,  4.00it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.17it/s]

 39%|███▉      | 9/23 [00:02<00:03,  4.23it/s]

 43%|████▎     | 10/23 [00:03<00:03,  4.28it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.45it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.59it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.55it/s]

 61%|██████    | 14/23 [00:03<00:01,  4.57it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.28it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.40it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.38it/s]

 78%|███████▊  | 18/23 [00:04<00:01,  4.51it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.52it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.60it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.52it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  4.64it/s]100%|██████████| 23/23 [00:05<00:00,  4.74it/s]100%|██████████| 23/23 [00:05<00:00,  3.92it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ben, and I'm a proud husband and father of two young girls. My wife, Rachel, and I met in college and have been married for over 10 years. We have two amazing daughters, Emily and Abigail, who are the lights of our lives. We live in the beautiful state of Colorado and love the outdoors. When I'm not working, you can find me hiking, camping, or skiing in the mountains. I'm passionate about helping others and making a positive impact on my community. I'm excited to share my story and connect with like-minded individuals.
Hello, my name is Ben, and I'm a proud
Prompt: The president of the United States is
Generated text:  not bound by the same rules as the rest of us. While we have to follow the law, the president can often operate with a degree of impunity, thanks to the Constitution's "executive privilege" doctrine.
Here's a rundown of what executive privilege is and how it's used:
What is executive privilege?
Executive privilege is a legal doc

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Am

ara

 Kar

an

 and

 I

 am

 a

 television

 and

 theatre

 actress

.

 I

 am

 best

 known

 for

 my

 role

 as

 Am

ara

 '

L

ola

'

 Kar

an

 in

 the

 BBC

 television

 series

 '

L

ip

 Service

'

 and

 also

 as

 S

uki

 Pan

esar

 in

 the

 BBC

 soap

 '

East

End

ers

'.


I

 was

 born

 in

 London

 and

 grew

 up

 in

 the

 UK

 and

 India

.

 I

 have

 always

 been

 passionate

 about

 acting

 and

 have

 been

 involved

 in

 various

 school

 productions

 and

 stage

 plays

 throughout

 my

 life

.


I

 have

 worked

 on

 a

 variety

 of

 projects

 including

 television

,

 theatre

 and

 short

 films

.

 I

 have

 appeared

 in

 numerous

 stage

 productions

 including

 '

C

olum

bus

'

 by

 Sebastian

 Barry

 and

 '

The

 House

 That



Prompt: The capital of France is
Generated text: 

 a

 city

 with

 a

 rich

 history

 and

 culture

,

 full

 of

 famous

 landmarks

 and

 attractions

.

 Paris

 is

 a

 city

 that

 has

 been

 a

 center

 for

 art

,

 fashion

,

 and

 cuisine

 for

 centuries

 and

 has

 been

 the

 home

 to

 famous

 artists

,

 writers

,

 and

 musicians

.


Paris

 is

 home

 to

 some

 of

 the

 most

 famous

 landmarks

 in

 the

 world

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

 Dame

 Cathedral

.

 The

 city

 is

 also

 known

 for

 its

 beautiful

 parks

 and

 gardens

,

 such

 as

 the

 Luxembourg

 Gardens

 and

 the

 Tu

il

eries

 Garden

.


In

 addition

 to

 its

 famous

 landmarks

,

 Paris

 is

 a

 city

 with

 a

 vibrant

 cultural

 scene

,

 with

 numerous

 museums

,

 galleries

,

 and

 performance



Prompt: The future of AI is
Generated text: 

 not

 about

 more

 computing

 power

,

 but

 about

 more

 human

-like

 reasoning

 and

 decision

-making

 abilities

.

 This

 is

 where

 cognitive

 architectures

 come

 in

,

 which

 are

 software

 frameworks

 that

 integrate

 multiple

 AI

 technologies

 to

 create

 more

 intelligent

 and

 human

-like

 systems

.


C

ognitive

 architectures

 are

 designed

 to

 simulate

 human

 cognition

,

 allowing

 AI

 systems

 to

 reason

,

 learn

,

 and

 interact

 with

 the

 environment

 in

 a

 more

 human

-like

 way

.

 They

 combine

 multiple

 AI

 technologies

,

 such

 as

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

,

 to

 create

 a

 more

 comprehensive

 and

 robust

 AI

 system

.


There

 are

 several

 cognitive

 architectures

,

 each

 with

 its

 own

 strengths

 and

 weaknesses

.

 Some

 of

 the

 most

 popular

 ones

 include

:


1

.

 SO

AR




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Mark. I am a passionate techie, gamer, and learner. I have a strong background in computer programming and software development. I enjoy solving complex problems, learning new things, and staying up to date with the latest technologies.

Here are some of the things I'm interested in:
- **Programming languages**: I'm proficient in C++, Java, Python, and JavaScript. I enjoy working with different programming languages and exploring their unique features and applications.
- **Game development**: I have a strong passion for game development and enjoy creating games using various game engines such as Unity and Unreal Engine.
- **Artificial intelligence and machine learning**: I

Prompt: The capital of France is
Generated text:  a city of romance, art, fashion and history. As you walk through the streets of Paris, you'll see stunning architecture, world-class museums, and picturesque cafes. From the iconic Eiffel Tower to the charming Seine River, 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Joe

 P

esh

lak

ai

.

 I

 am

 the

 new

 program

 director

 of

 the

 Pine

 Hill

 Elementary

 School

 Head

 Start

 Program

,

 located

 in

 the

 To

h

ono

 O

'

od

ham

 Nation

.

 My

 family

 has

 been

 living

 in

 this

 community

 for

 generations

 and

 I

 am

 proud

 to

 be

 working

 with

 the

 children

 and

 families

 of

 my

 community

.


I

 am

 looking

 forward

 to

 working

 with

 the

 staff

 and

 community

 to

 provide

 a

 high

-quality

 Head

 Start

 program

 that

 meets

 the

 needs

 of

 the

 children

 and

 families

 we

 serve

.

 My

 goal

 is

 to

 build

 strong

 relationships

 with

 families

 and

 to

 support

 their

 children

 in

 their

 early

 years

 of

 learning

 and

 development

.


I

 am

 committed

 to

 providing

 a

 culturally

 responsive

 and

 inclusive

 program

 that

 values

 and



Prompt: The capital of France is
Generated text: 

 a

 city

 of

 romance

,

 art

,

 fashion

,

 and

 history

.

 It

 is

 a

 place

 where

 the

 past

 and

 present

 blend

 seamlessly

,

 where

 iconic

 landmarks

 like

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

 sit

 alongside

 modern

 bout

iques

 and

 trendy

 cafes

.

 From

 the

 stunning

 architecture

 of

 the

 Se

ine

 River

 to

 the

 vibrant

 street

 life

 of

 Mont

mart

re

,

 Paris

 is

 a

 city

 that

 has

 something

 for

 everyone

.

 Whether

 you

're

 looking

 to

 indulge

 in

 fine

 dining

,

 explore

 the

 world

 of

 art

 and

 culture

,

 or

 simply

 soak

 up

 the

 city

's

 unique

 atmosphere

,

 Paris

 is

 a

 must

-

visit

 destination

 for

 any

 traveler

.


Paris

,

 the

 capital

 of

 France

,

 is

 a

 city

 with

 a

 rich



Prompt: The future of AI is
Generated text: 

 human

-centric




Art

ificial

 intelligence

 (

AI

)

 has

 been

 making

 headlines

 for

 years

,

 with

 its

 rapid

 advancements

 and

 applications

 in

 various

 industries

.

 However

,

 there

 is

 a

 growing

 concern

 that

 AI

 is

 becoming

 too

 focused

 on

 efficiency

 and

 productivity

,

 potentially

 leading

 to

 a

 neglect

 of

 human

 needs

 and

 values

.

 In

 this

 article

,

 we

'll

 explore

 the

 shift

 towards

 a

 more

 human

-centric

 approach

 in

 AI

 development

.


The

 current

 state

 of

 AI




Today

,

 AI

 systems

 are

 primarily

 designed

 to

 optimize

 processes

,

 reduce

 costs

,

 and

 increase

 efficiency

.

 This

 focus

 on

 efficiency

 has

 led

 to

 the

 development

 of

 AI

-powered

 tools

 that

 can

 perform

 tasks

 faster

 and

 more

 accurately

 than

 humans

.

 While

 this

 has

 improved

 productivity

 and




In [6]:
llm.shutdown()