# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  5.00it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.75it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.38it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.22it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:24,  1.09s/it]

  9%|▊         | 2/23 [00:01<00:13,  1.53it/s]

 13%|█▎        | 3/23 [00:01<00:11,  1.81it/s]

 17%|█▋        | 4/23 [00:02<00:09,  1.99it/s]

 22%|██▏       | 5/23 [00:02<00:07,  2.25it/s]

 26%|██▌       | 6/23 [00:02<00:07,  2.42it/s]

 30%|███       | 7/23 [00:03<00:05,  2.70it/s]

 35%|███▍      | 8/23 [00:03<00:05,  2.99it/s]

 39%|███▉      | 9/23 [00:03<00:04,  3.26it/s]

 43%|████▎     | 10/23 [00:04<00:03,  3.46it/s]

 48%|████▊     | 11/23 [00:04<00:03,  3.59it/s]

 52%|█████▏    | 12/23 [00:04<00:02,  3.69it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  3.75it/s]

 61%|██████    | 14/23 [00:05<00:02,  3.82it/s]

 65%|██████▌   | 15/23 [00:05<00:02,  3.84it/s]

 70%|██████▉   | 16/23 [00:05<00:01,  3.90it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  3.93it/s]

 78%|███████▊  | 18/23 [00:06<00:01,  3.94it/s]

 83%|████████▎ | 19/23 [00:06<00:01,  3.91it/s]

 87%|████████▋ | 20/23 [00:06<00:00,  3.92it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  3.92it/s]

 96%|█████████▌| 22/23 [00:07<00:00,  3.88it/s]

100%|██████████| 23/23 [00:07<00:00,  3.87it/s]100%|██████████| 23/23 [00:07<00:00,  3.13it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  [Your Name] and I'm a therapist at [Your Practice/Therapy Center]. Welcome to our center. My specialty is helping individuals, couples, and families work through challenges and find balance in their lives. I'm glad you're taking the first step towards seeking help, and I'm here to support you.
Before we begin, I want to assure you that everything discussed in our sessions is confidential and protected by law. I'll do my best to create a safe, non-judgmental space for you to express yourself freely.

To get started, can you tell me a little bit about what brings you here today?
Prompt: The president of the United States is
Generated text:  the most powerful person in the world. This president is Donald Trump. The president is elected by the American people to make important decisions on their behalf.
President Donald Trump has a very busy schedule. He makes important decisions about the country, meets with foreign leaders, and works with Congre

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 J

avi

 and

 I

 am

 a

 travel

 writer

 and

 photographer

 based

 in

 Barcelona

,

 Spain

.

 I

 have

 been

 traveling

 extensively

 throughout

 Europe

 and

 Latin

 America

 for

 many

 years

,

 always

 seeking

 out

 the

 most

 authentic

 and

 unique

 experiences

 that

 showcase

 the

 culture

,

 history

,

 and

 natural

 beauty

 of

 each

 destination

.


My

 passion

 for

 travel

 has

 taken

 me

 to

 over

 

30

 countries

,

 where

 I

 have

 immersed

 myself

 in

 the

 local

 way

 of

 life

,

 learning

 new

 languages

 and

 traditions

,

 and

 capturing

 the

 essence

 of

 each

 place

 through

 my

 writing

 and

 photography

.


I

 am

 the

 author

 of

 several

 travel

 guides

 and

 articles

,

 including

 Lonely

 Planet

's

 "

Spain

's

 Hidden

 Beach

es

"

 and

 "

Cost

a

 del

 Sol

"

 titles

.



Prompt: The capital of France is
Generated text: 

 Paris

,

 which

 is

 the

 second

 most

 visited

 city

 in

 the

 world

 after

 Bangkok

.

 The

 city

 is

 a

 major

 cultural

,

 economic

,

 and

 historical

 center

,

 and

 the

 country

 has

 a

 rich

 and

 diverse

 cultural

 heritage

.

 The

 official

 language

 is

 French

,

 but

 many

 people

 speak

 English

,

 especially

 in

 tourist

 areas

.


The

 best

 time

 to

 visit

 France

 is

 in

 the

 spring

 (

April

 to

 June

)

 or

 autumn

 (

September

 to

 November

),

 when

 the

 weather

 is

 mild

 and

 pleasant

,

 and

 there

 are

 many

 festivals

 and

 events

 taking

 place

.

 Summer

 can

 be

 hot

 and

 crowded

,

 while

 winters

 can

 be

 chilly

 and

 rainy

.


The

 most

 popular

 tourist

 attractions

 in

 France

 include

 the

 E

iff

el

 Tower

,

 the

 Lou

vre



Prompt: The future of AI is
Generated text: 

 being

 written

 in

 the

 realm

 of

 open

-source

 software

.

 Developers

 worldwide

 are

 contributing

 to

 the

 creation

 of

 more

 advanced

 and

 intelligent

 systems

.

 AI

,

 in

 its

 current

 form

,

 is

 merely

 the

 beginning

 of

 what

 is

 to

 come

.

 The

 pace

 of

 AI

 development

 is

 quick

ening

 and

 the

 future

 is

 looking

 bright

 for

 those

 who

 embrace

 this

 technology

.

 Here

 are

 some

 of

 the

 most

 significant

 open

-source

 AI

 projects

 that

 are

 revolution

izing

 the

 industry

.


1

.

 Open

CV

 (

Open

 Source

 Computer

 Vision

 Library

)


Open

CV

 is

 a

 library

 of

 programming

 functions

 for

 real

-time

 computer

 vision

.

 It

 is

 widely

 used

 in

 robotics

,

 surveillance

,

 and

 other

 applications

 that

 require

 image

 and

 video

 processing

.

 Open

CV

's

 open




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  René and I am a proud owner of a 1986 Mercedes-Benz 190E.
The 190E is a beautiful and refined car that exudes elegance and sophistication. I am honored to be the owner of this classic Mercedes-Benz model and I am excited to share my experience with you.

I purchased my 190E about a year ago, and since then, I have been thoroughly enjoying driving and maintaining it. The car has been well-maintained and has been kept in great condition, which is a testament to its original owner's dedication to preserving the vehicle.

Over the years, I have been learning more about the car

Prompt: The capital of France is
Generated text:  Paris, located in the north-central region of the country. Paris is known as the "City of Light" (La Ville Lumière) and is famous for its stunning architecture, art museums, fashion, and cuisine. Some of the most famous landmarks in Paris include the Eiffel Tower, the Louvre Museum, Notre Dame Cathedral, and the Champs-Élys

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 John

,

 and

 I

'm

 a

 writer

.


I

'm

 not

 writing

 about

 myself

,

 though

.

 I

'm

 writing

 about

 a

 character

 named

 John

,

 who

 is

 a

 writer

.

 That

's

 where

 the

 fun

 starts

.

 I

'm

 experimenting

 with

 the

 ambiguity

 of

 this

 kind

 of

 writing

.

 On

 the

 one

 hand

,

 I

'm

 writing

 a

 story

 about

 a

 writer

 named

 John

,

 who

 may

 or

 may

 not

 be

 a

 version

 of

 myself

.

 But

 on

 the

 other

 hand

,

 I

'm

 also

 writing

 a

 story

 about

 the

 act

 of

 writing

,

 which

 is

 a

 fundamental

 aspect

 of

 the

 story

 itself

.

 This

 creates

 a

 feedback

 loop

 of

 self

-reference

,

 where

 the

 narrative

 is

 constantly

 commenting

 on

 its

 own

 construction

.


I

've

 been



Prompt: The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 the

 most

 visited

 city

 in

 the

 world

,

 and

 it

 is

 known

 for

 its

 stunning

 architecture

,

 art

 museums

,

 fashion

,

 cuisine

,

 and

 romantic

 atmosphere

.

 Paris

 is

 situated

 in

 the

 northern

 part

 of

 the

 country

,

 and

 it

 is

 located

 in

 the

 heart

 of

 the

 Î

le

-de

-F

rance

 region

.

 Paris

 is

 famous

 for

 its

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 Notre

 Dame

 Cathedral

,

 and

 the

 Arc

 de

 Tri

omp

he

.


The

 city

 has

 a

 rich

 history

 and

 has

 been

 the

 center

 of

 politics

,

 culture

,

 and

 economy

 for

 centuries

.

 The

 city

 is

 also

 known

 for

 its

 cuisine

,

 fashion

,

 and

 art

 scene

,

 and

 it



Prompt: The future of AI is
Generated text: 

 not

 a

 prediction

,

 but

 a

 necessity




The

 future

 of

 AI

 is

 not

 a

 prediction

,

 but

 a

 necessity




What

 is

 AI

 and

 why

 is

 it

 necessary

?


AI

 stands

 for

 Artificial

 Intelligence

,

 which

 refers

 to

 the

 development

 of

 computer

 systems

 that

 can

 perform

 tasks

 that

 typically

 require

 human

 intelligence

,

 such

 as

 learning

,

 problem

-solving

,

 and

 decision

-making

.

 AI

 is

 necessary

 because

 it

 has

 the

 potential

 to

 transform

 various

 industries

 and

 aspects

 of

 our

 lives

,

 making

 them

 more

 efficient

,

 productive

,

 and

 beneficial

.


The

 importance

 of

 AI

 in

 today

's

 world




AI

 is

 essential

 in

 today

's

 world

 because

 it

 can

 help

 us

:


 

 

1

.

 Improve

 productivity

:

 AI

 can

 automate

 repetitive

 tasks

,




In [6]:
llm.shutdown()