# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.39it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.20it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.70it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.49it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:23,  1.05s/it]

  9%|▊         | 2/23 [00:01<00:11,  1.79it/s]

 13%|█▎        | 3/23 [00:01<00:07,  2.53it/s]

 17%|█▋        | 4/23 [00:01<00:06,  3.11it/s]

 22%|██▏       | 5/23 [00:01<00:05,  3.59it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.83it/s]

 30%|███       | 7/23 [00:02<00:03,  4.12it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.33it/s]

 39%|███▉      | 9/23 [00:02<00:03,  4.45it/s]

 43%|████▎     | 10/23 [00:02<00:02,  4.54it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.63it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.65it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.65it/s]

 61%|██████    | 14/23 [00:03<00:01,  4.68it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  4.68it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.62it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.68it/s]

 78%|███████▊  | 18/23 [00:04<00:01,  4.70it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  4.76it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.79it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.81it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  4.82it/s]

100%|██████████| 23/23 [00:05<00:00,  4.79it/s]100%|██████████| 23/23 [00:05<00:00,  4.06it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Larissa and I'm a 3D artist from Brazil. I'm a self-taught artist and I've been working with 3D art for about 6 years now. I specialize in creating digital illustrations and concept art for games, movies and books.
I'm really passionate about sci-fi and fantasy art, and I love experimenting with different styles and techniques to create unique and captivating pieces. My art often features a mix of realistic and stylized elements, blending digital painting with 3D modeling and texturing.
When I'm not working on my art, I love playing video games, watching anime and reading fantasy novels.
Prompt: The president of the United States is
Generated text:  not above the law, despite the "absolute immunity" that has been claimed by some of his supporters. That is the gist of a ruling handed down by the U.S. Court of Appeals for the D.C. Circuit on Thursday.
The court's decision was a major victory for the House of Representatives, which had sought to 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Matt

 and

 I

 am

 a

 senior

 at

 California

 State

 University

,

 Long

 Beach

.

 I

 am

 a

 communications

 major

 with

 a

 focus

 on

 digital

 media

.

 My

 passion

 is

 creating

 engaging

 stories

 through

 various

 forms

 of

 media

 such

 as

 video

,

 photography

,

 and

 writing

.

 I

 have

 experience

 working

 in

 video

 production

,

 and

 I

 am

 excited

 to

 continue

 pursuing

 this

 passion

 as

 I

 move

 into

 the

 industry

.


My

 experience

 in

 video

 production

 started

 in

 my

 freshman

 year

 of

 college

.

 I

 was

 asked

 to

 be

 part

 of

 a

 video

 production

 team

 that

 was

 working

 on

 a

 promotional

 video

 for

 our

 school

's

 recreation

 and

 wellness

 center

.

 This

 experience

 sparked

 my

 interest

 in

 video

 production

 and

 I

 continued

 to

 work

 on

 various

 projects

 throughout

 my



Prompt: The capital of France is
Generated text: 

 Paris

,

 and

 it

 is

 known

 for

 its

 beautiful

 architecture

,

 art

 museums

,

 and

 fashion

.

 However

,

 the

 largest

 city

 in

 France

 is

 Marseille

,

 which

 is

 located

 on

 the

 Mediterranean

 coast

.


France

 is

 a

 country

 located

 in

 Western

 Europe

,

 bordered

 by

 several

 countries

 including

 Belgium

,

 Luxembourg

,

 Germany

,

 Switzerland

,

 Italy

,

 and

 Spain

.

 It

 has

 a

 diverse

 geography

,

 with

 mountains

,

 forests

,

 and

 a

 long

 coastline

 along

 the

 Atlantic

 and

 Mediterranean

 seas

.


The

 country

 has

 a

 rich

 history

,

 with

 ancient

 civilizations

 such

 as

 the

 Gaul

s

 and

 Romans

 leaving

 their

 mark

.

 France

 has

 also

 played

 a

 significant

 role

 in

 European

 history

,

 including

 being

 a

 major

 power

 in

 the

 Middle

 Ages

 and

 a

 key



Prompt: The future of AI is
Generated text: 

 not

 in

 the

 clouds

,

 but

 in

 the

 hands

 of

 people

 who

 are

 creating

 a

 new

 generation

 of

 AI

-powered

 robots

 and

 machines

.

 With

 the

 advent

 of

 AI

,

 we

 are

 seeing

 a

 rise

 in

 the

 development

 of

 robots

 and

 machines

 that

 can

 learn

,

 adapt

,

 and

 interact

 with

 humans

 in

 increasingly

 sophisticated

 ways

.

 As

 AI

 becomes

 more

 ubiquitous

,

 we

 can

 expect

 to

 see

 more

 robots

 and

 machines

 that

 can

 perform

 tasks

 that

 were

 previously

 thought

 to

 be

 the

 exclusive

 domain

 of

 humans

.


The

 latest

 advancements

 in

 AI

 have

 opened

 up

 a

 wide

 range

 of

 possibilities

 for

 robots

 and

 machines

 to

 perform

 complex

 tasks

.

 From

 healthcare

 to

 education

,

 and

 from

 manufacturing

 to

 transportation

,

 AI

-powered

 robots

 and

 machines

 are

 changing




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Zairah, and I am a Licensed Massage Therapist (LMT). I graduated from the Institute of Integrative Therapies in Fort Lauderdale, Florida in 2010. I have been practicing massage therapy for over 10 years and I have worked with clients of all ages, backgrounds, and body types. My passion is helping individuals manage stress, alleviate pain, and improve their overall well-being.
I have experience working with clients who have chronic pain, fibromyalgia, sports injuries, and post-operative recovery. My approach to massage therapy is holistic, integrating various techniques to address both physical and emotional needs. I utilize a

Prompt: The capital of France is
Generated text:  one of the world's greatest cities. With its grand history, rich culture, and unforgettable landmarks, Paris is a destination that every traveler should experience at least once. Here are some of the must-visit attractions and experiences that you shouldn't miss while in

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Jack

 and

 I

'm

 a

 travel

 writer

 and

 photographer

.

 I

 have

 always

 been

 passionate

 about

 travel

 and

 exploring

 new

 cultures

.

 I

 have

 visited

 over

 

40

 countries

 across

 the

 world

 and

 I

 am

 always

 on

 the

 lookout

 for

 new

 destinations

 to

 discover

.

 I

 write

 about

 my

 travel

 experiences

 on

 my

 blog

 and

 I

 also

 work

 with

 travel

 companies

 and

 tourism

 boards

 to

 promote

 their

 destinations

.

 I

 am

 based

 in

 London

 but

 I

 love

 to

 travel

 and

 explore

 new

 places

.

 In

 my

 free

 time

,

 I

 enjoy

 hiking

,

 rock

 climbing

 and

 trying

 out

 new

 foods

.

 I

'm

 always

 looking

 for

 inspiration

 for

 my

 next

 travel

 adventure

.


Read

 more

 about

 me

 on

 my

 About

 page

.


You

 can

 find

 me

 on

 social



Prompt: The capital of France is
Generated text: 

 Paris

,

 and

 it

 is

 known

 for

 its

 stunning

 landmarks

,

 historic

 architecture

,

 fashion

,

 cuisine

,

 and

 culture

.

 From

 the

 E

iff

el

 Tower

 to

 the

 Lou

vre

 Museum

,

 Paris

 has

 something

 for

 everyone

.


The

 French

 Riv

iera

,

 also

 known

 as

 the

 C

ôte

 d

'

Az

ur

,

 is

 a

 popular

 tourist

 destination

 located

 on

 the

 Mediterranean

 coast

 of

 France

.

 It

 is

 known

 for

 its

 beautiful

 beaches

,

 clear

 waters

,

 and

 picturesque

 towns

 such

 as

 Saint

-T

ro

pez

 and

 Cannes

.


The

 French

 Alps

 offer

 a

 range

 of

 outdoor

 activities

 such

 as

 skiing

,

 hiking

,

 and

 mountain

 biking

,

 and

 are

 a

 popular

 destination

 for

 winter

 sports

 enthusiasts

.

 The

 French

 Alps

 are

 also

 home

 to

 many



Prompt: The future of AI is
Generated text: 

 in

 the

 hands

 of

 the

 people

 who

 will

 be

 impacted

 by

 it

 the

 most

 –

 the

 public

.

 But

 how

 can

 we

,

 the

 public

,

 participate

 in

 shaping

 the

 future

 of

 AI

 in

 a

 way

 that

 is

 inclusive

,

 diverse

,

 and

 equitable

?

 Here

 are

 some

 key

 take

aways

 from

 the

 AI

 for

 Social

 Good

 Summit

 and

 the

 

202

3

 AI

 for

 Everyone

 conference

.


The

 AI

 for

 Social

 Good

 Summit




The

 AI

 for

 Social

 Good

 Summit

,

 held

 in

 October

 

202

3

,

 brought

 together

 experts

 from

 various

 fields

 to

 discuss

 the

 role

 of

 AI

 in

 addressing

 social

 issues

.

 The

 summit

 highlighted

 the

 need

 for

 a

 more

 inclusive

 and

 equitable

 approach

 to

 AI

 development

 and

 deployment

.

 Key

 take

aways

 include




In [6]:
llm.shutdown()