# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  2.99it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.49it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.24it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:42,  1.95s/it]

  9%|▊         | 2/23 [00:02<00:21,  1.04s/it]

 13%|█▎        | 3/23 [00:02<00:13,  1.49it/s]

 17%|█▋        | 4/23 [00:02<00:09,  2.02it/s]

 22%|██▏       | 5/23 [00:03<00:07,  2.50it/s]

 26%|██▌       | 6/23 [00:03<00:06,  2.70it/s]

 30%|███       | 7/23 [00:03<00:05,  3.04it/s]

 35%|███▍      | 8/23 [00:03<00:04,  3.38it/s]

 39%|███▉      | 9/23 [00:04<00:03,  3.58it/s]

 43%|████▎     | 10/23 [00:04<00:03,  3.83it/s]

 48%|████▊     | 11/23 [00:04<00:02,  4.04it/s]

 52%|█████▏    | 12/23 [00:04<00:02,  4.17it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  4.37it/s]

 61%|██████    | 14/23 [00:05<00:02,  4.49it/s]

 65%|██████▌   | 15/23 [00:05<00:01,  4.44it/s]

 70%|██████▉   | 16/23 [00:05<00:01,  4.21it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  3.89it/s]

 78%|███████▊  | 18/23 [00:06<00:01,  3.97it/s]

 83%|████████▎ | 19/23 [00:06<00:00,  4.03it/s]

 87%|████████▋ | 20/23 [00:06<00:00,  4.02it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  4.11it/s]

 96%|█████████▌| 22/23 [00:07<00:00,  4.19it/s]

100%|██████████| 23/23 [00:07<00:00,  3.97it/s]100%|██████████| 23/23 [00:07<00:00,  3.10it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Summer. I am a 14-year-old girl who loves to write. I am a freshman in high school. I love animals, especially dogs and cats. I love spending time with my family and friends. I love to read and write about fantasy stories. I love fantasy because it takes me to another world where I can be anyone and anything. I love to imagine all the things that could be if I was in a fantasy world. I hope you enjoy reading my stories. I will be posting new stories as I write them. So, be sure to check back often.
The Magical Kingdom of Azura
Once upon a time,
Prompt: The president of the United States is
Generated text:  essentially the head of state and head of government. The president serves as the commander-in-chief of the military and has a variety of other important duties. The president is also responsible for appointing federal judges, ambassadors, and other high-ranking officials. The president also has the power to propose laws to Congress, althoug

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Ish

ay

 and

 I

 am

 a

 

16

-year

-old

 from

 Israel

.

 I

 was

 born

 and

 raised

 in

 the

 city

 of

 Tel

 Aviv

 and

 have

 been

 passionate

 about

 photography

 for

 over

 

5

 years

 now

.

 My

 love

 for

 photography

 started

 when

 my

 grandfather

,

 who

 was

 a

 great

 photographer

,

 passed

 away

.

 He

 left

 me

 his

 camera

 and

 the

 passion

 for

 photography

 that

 he

 had

.

 From

 that

 moment

 on

,

 I

 started

 to

 explore

 and

 learn

 more

 about

 photography

.

 I

 take

 pictures

 of

 everything

,

 from

 the

 city

 to

 nature

,

 and

 I

 love

 to

 experiment

 and

 try

 new

 things

.

 My

 dream

 is

 to

 become

 a

 professional

 photographer

 and

 travel

 the

 world

 to

 capture

 its

 beauty

 and

 share

 it

 with

 others

.



Prompt: The capital of France is
Generated text: 

 famous

 for

 its

 art

 museums

,

 historical

 landmarks

,

 fashion

,

 and

 gourmet

 cuisine

.

 The

 E

iff

el

 Tower

 is

 the

 most

 iconic

 landmark

 in

 Paris

.

 This

 

324

-meter

-t

all

 iron

 lattice

 tower

 was

 built

 in

 

188

9

 as

 a

 symbol

 of

 engineering

 and

 technological

 progress

.

 Visitors

 can

 ride

 the

 elevator

 to

 the

 top

 for

 a

 panoramic

 view

 of

 the

 city

.


The

 Lou

vre

 Museum

 is

 another

 must

-

visit

 attraction

 in

 Paris

.

 With

 a

 collection

 of

 over

 

550

,

000

 works

 of

 art

,

 including

 the

 Mona

 Lisa

,

 the

 Lou

vre

 is

 one

 of

 the

 world

's

 largest

 and

 most

 visited

 museums

.

 The

 museum

's

 stunning

 glass

 pyramid

 entrance

,

 designed

 by

 I

.M

.

 Pe



Prompt: The future of AI is
Generated text: 

 in

 data

,

 and

 the

 future

 of

 data

 is

 in

 the

 cloud

.

 To

 support

 the

 growing

 demand

 for

 cloud

-based

 AI

,

 Google

 Cloud

 has

 announced

 a

 range

 of

 new

 and

 enhanced

 services

 for

 machine

 learning

 (

ML

)

 and

 artificial

 intelligence

 (

AI

).

 These

 services

 are

 designed

 to

 make

 it

 easier

 for

 developers

 and

 data

 scientists

 to

 build

 and

 deploy

 AI

 applications

 at

 scale

.


Here

 are

 some

 of

 the

 new

 and

 enhanced

 services

 announced

 by

 Google

 Cloud

:


1

.

 Cloud

 AI

 Platform

 (

CA

IP

):

 CA

IP

 is

 a

 fully

-man

aged

 platform

 for

 building

,

 deploying

,

 and

 managing

 machine

 learning

 models

.

 It

 provides

 a

 range

 of

 pre

-built

 templates

 and

 tools

 for

 popular

 frameworks

 like

 TensorFlow

 and

 Py

T




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Mark and I am a bit of a curious and adventurous person who loves nothing more than exploring new places and meeting new people. In this blog, I will be sharing my experiences, thoughts and opinions on a variety of topics including travel, food, culture, and more.
As I have traveled extensively throughout my life, I have developed a passion for experiencing different cultures and trying new foods. From the bustling streets of Tokyo to the beautiful beaches of Bali, I have been fortunate enough to have visited many amazing places and I am always on the lookout for my next adventure.
In this blog, I will be sharing some of my favorite destinations, restaurants

Prompt: The capital of France is
Generated text:  the city of Paris, which is the most populous city in the country. The city is known for its stunning architecture, vibrant arts and culture scene, and romantic atmosphere. The Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral are

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 and

 i

'm

 an

 engineer




Hello

,

 my

 name

 is

 [

Your

 Name

]

 and

 I

'm

 an

 engineer

.


Note

:

 Replace

 [

Your

 Name

]

 with

 your

 actual

 name

.


Hello

,

 my

 name

 is

 and

 I

'm

 an

 addict




Hello

,

 my

 name

 is

 [

Your

 Name

]

 and

 I

'm

 an

 addict

.


Note

:

 Replace

 [

Your

 Name

]

 with

 your

 actual

 name

.


Hello

,

 my

 name

 is

 and

 I

'm

 a

 computer

 user




Hello

,

 my

 name

 is

 [

Your

 Name

]

 and

 I

'm

 a

 computer

 user

.


Note

:

 Replace

 [

Your

 Name

]

 with

 your

 actual

 name

.


Hello

,

 my

 name

 is

 and

 I

'm

 an

 Internet

 user




Hello

,

 my

 name

 is



Prompt: The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 most

 populated

 city

 of

 the

 country

 with

 a

 population

 of

 more

 than

 

2

.

1

 million

.

 The

 city

 is

 a

 world

 famous

 tourist

 destination

 due

 to

 its

 rich

 history

,

 art

 museums

,

 fashion

,

 cuisine

,

 and

 beautiful

 architecture

.

 The

 city

 is

 home

 to

 the

 E

iff

el

 Tower

 which

 is

 one

 of

 the

 most

 iconic

 landmarks

 in

 the

 world

.


France

 has

 a diverse

 geography with

 mountains,

 hills

, and

 coastal

 regions

.

 The

 country

 has

 a

 rich

 history

 and

 a

 unique

 culture

.

 The

 official

 language

 of

 France

 is

 French

 but

 many

 people

 also

 speak

 English

.

 The

 country

 has

 a

 high

 standard

 of

 living

 and

 is

 considered

 to

 be

 one

 of

 the

 most

 developed

 countries



Prompt: The future of AI is
Generated text: 

 bright

,

 but

 it

 also

 raises

 fundamental

 questions

 about

 the

 nature

 of

 intelligence

,

 consciousness

,

 and

 the

 human

 condition

.

 From

 the

 development

 of

 machines

 that

 can

 learn

 and

 adapt

 to

 the

 prospect

 of

 super

int

elligent

 AI

 that

 surpass

es

 human

 intelligence

,

 we

 are

 faced

 with

 both

 incredible

 opportunities

 and

 daunting

 challenges

.

 Here

 are

 some

 potential

 future

 developments

 in

 AI

 that

 will

 shape

 our

 world

 and

 our

 lives

:



1

.

 

 **

Adv

ancements

 in

 Natural

 Language

 Processing

 (

N

LP

)**

:

 AI

 systems

 will

 become

 increasingly

 adept

 at

 understanding

,

 generating

,

 and

 processing

 human

 language

,

 enabling

 seamless

 communication

 between

 humans

 and

 machines

.

 This

 could

 revolution

ize

 customer

 service

,

 translation

 services

,

 and

 content

 creation

.



2

.




In [6]:
llm.shutdown()