# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  6.16it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.69it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.37it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.22it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:35,  1.63s/it]

  9%|▊         | 2/23 [00:02<00:19,  1.10it/s]

 13%|█▎        | 3/23 [00:02<00:12,  1.64it/s]

 17%|█▋        | 4/23 [00:02<00:08,  2.14it/s]

 22%|██▏       | 5/23 [00:02<00:06,  2.57it/s]

 26%|██▌       | 6/23 [00:03<00:06,  2.75it/s]

 30%|███       | 7/23 [00:03<00:05,  3.03it/s]

 35%|███▍      | 8/23 [00:03<00:04,  3.24it/s]

 39%|███▉      | 9/23 [00:03<00:04,  3.40it/s]

 43%|████▎     | 10/23 [00:04<00:03,  3.59it/s]

 48%|████▊     | 11/23 [00:04<00:03,  3.67it/s]

 52%|█████▏    | 12/23 [00:04<00:02,  3.74it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  3.83it/s]

 61%|██████    | 14/23 [00:05<00:02,  3.87it/s]

 65%|██████▌   | 15/23 [00:05<00:02,  3.86it/s]

 70%|██████▉   | 16/23 [00:05<00:01,  3.82it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  3.81it/s]

 78%|███████▊  | 18/23 [00:06<00:01,  3.74it/s]

 83%|████████▎ | 19/23 [00:06<00:01,  3.66it/s]

 87%|████████▋ | 20/23 [00:06<00:00,  3.62it/s]

 91%|█████████▏| 21/23 [00:07<00:00,  3.57it/s]

 96%|█████████▌| 22/23 [00:07<00:00,  3.51it/s]

100%|██████████| 23/23 [00:07<00:00,  3.51it/s]100%|██████████| 23/23 [00:07<00:00,  3.00it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lisa and I am the Community Manager for our school's Parent-Teacher Association. I am excited to connect with you and welcome you to our PTA community!
We are a group of dedicated parents, teachers, and staff who work together to support our school and make it an amazing place for our students to learn and grow. Our PTA is committed to building a strong and inclusive community that values diversity, creativity, and kindness.
Throughout the year, we host various events and activities that promote parent engagement, student well-being, and school spirit. Some of our popular events include:
PTA Meetings: We meet monthly to discuss school initiatives,
Prompt: The president of the United States is
Generated text:  the head of the executive branch of the federal government. The president serves a four-year term and is directly elected by the people through the Electoral College. The president's responsibilities include serving as commander-in-chief 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Hanna

 and

 I

 am

 a

 skilled

 social

 media

 manager

 with

 

5

+

 years

 of

 experience

 in

 creating

 engaging

 content

 for

 various

 brands

 and

 industries

.

 I

 specialize

 in

 developing

 and

 implementing

 effective

 social

 media

 strategies

 that

 drive

 results

 and

 boost

 brand

 awareness

.

 With

 a

 strong

 focus

 on

 visual

 storytelling

 and

 community

 engagement

,

 I

 have

 a

 proven

 track

 record

 of

 increasing

 followers

,

 likes

,

 and

 shares

 across

 multiple

 platforms

.


I

 have

 worked

 with

 various

 clients

,

 from

 small

 businesses

 to

 large

 corporations

,

 helping

 them

 to

 establish

 a

 strong

 online

 presence

 and

 build

 a

 loyal

 community

 of

 followers

.

 My

 expertise

 includes

:


Creating

 and

 cur

ating

 high

-quality

,

 engaging

 content

 that

 reson

ates

 with

 the

 target

 audience




Develop

ing

 and

 implementing



Prompt: The capital of France is
Generated text: 

 often

 referred

 to

 as

 the

 most

 romantic

 city

 in

 the

 world

.

 

 Paris

 is

 known

 for

 its

 stunning

 architecture

,

 world

-class

 museums

,

 and

 charming

 streets

 lined

 with

 cafes

 and

 bout

iques

.

 Whether

 you

’re

 looking

 for

 history

,

 art

,

 fashion

,

 or

 food

,

 Paris

 has

 something

 for

 everyone

.


Mont

mart

re

,

 a

 historic

 and

 artistic

 neighborhood

,

 is

 famous

 for

 its

 bo

hem

ian

 vibe

,

 street

 artists

,

 and

 stunning

 views

 of

 the

 city

.

 Visitors

 can

 explore

 the

 cob

ble

stone

 streets

,

 visit

 the

 Basil

ica

 of

 the

 Sac

ré

-C

œur

,

 and

 enjoy

 the

 charming

 cafes

 and

 restaurants

.


The

 E

iff

el

 Tower

,

 an

 iconic

 symbol

 of

 Paris

,

 is

 a

 must

-



Prompt: The future of AI is
Generated text: 

 being

 shaped

 by

 advances

 in

 multiple

 technologies

,

 including

 deep

 learning

,

 natural

 language

 processing

,

 computer

 vision

,

 and

 human

-com

puter

 interaction

.

 Here

 are

 some

 key

 areas

 to

 watch

:


1

.

 **

Ex

plain

able

 AI

 (

X

AI

)**

:

 As

 AI

 becomes

 more

 pervasive

,

 there

's

 a

 growing

 need

 to

 understand

 how

 decisions

 are

 made

.

 X

AI

 aims

 to

 provide

 insights

 into

 AI

 decision

-making

 processes

,

 enabling

 greater

 transparency

 and

 trust

.


2

.

 **

Edge

 AI

**:

 With

 the

 proliferation

 of

 IoT

 devices

,

 edge

 AI

 focuses

 on

 processing

 data

 closer

 to

 the

 source

,

 reducing

 latency

 and

 improving

 real

-time

 decision

-making

.


3

.

 **

Transfer

 Learning

**:

 Transfer

 learning

 enables

 AI

 models

 to

 leverage

 pre




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Laura and I'm a 25-year-old free spirit. I live in a small village in the countryside surrounded by rolling hills and woodland. I have a big garden where I grow my own fruits and vegetables, a beehive where I harvest honey and a small orchard where I keep a few chickens for eggs. I'm a bit of a country girl at heart and I love nothing more than spending time outdoors, surrounded by nature.
I work as a florist, creating beautiful arrangements for special occasions and events. I find it really fulfilling to be able to bring a little bit of joy and beauty into people's lives through my work

Prompt: The capital of France is
Generated text:  famous for its beautiful and historic buildings, art museums, and romantic atmosphere. Here are some things to do and see in Paris:
1. Visit the Eiffel Tower: The Eiffel Tower is an iconic symbol of Paris and one of the most visited attractions in the world. Visitors can take a lift to the top for stunning vi

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Chris

 and

 I

'm

 a

 relatively

 new

 developer

 to

 the

 community

.

 I

've

 been

 tink

ering

 with

 various

 programming

 languages

 and

 frameworks

,

 including

 Node

.js

,

 React

,

 and

 Electron

.

 I

'm

 excited

 to

 share

 my

 experiences

 and

 learn

 from

 the

 community

.



My

 current

 project

 is

 a

 desktop

 application

 built

 with

 Electron

 and

 React

,

 and

 I

'm

 facing

 a

 challenge

 with

 implementing

 a

 smooth

 and

 efficient

 file

 upload

 process

.

 I

'd

 love

 to

 hear

 any

 suggestions

 or

 best

 practices

 from

 fellow

 developers

 who

 have

 tackled

 similar

 issues

.



Here

's

 a

 brief

 overview

 of

 my

 current

 setup

:



*

 Front

end

:

 React




*

 Backend

:

 Node

.js

 (

Express

)


*

 File

 upload

:

 I

'm

 using

 the

 `

mul



Prompt: The capital of France is
Generated text: 

 a

 city

 of

 many

 things

 –

 history

,

 culture

,

 fashion

,

 food

,

 art

,

 and

 romance

.

 Paris

,

 the

 city

 of

 love

,

 is

 a

 place

 where

 you

 can

 find

 something

 for

 everyone

.

 From

 the

 iconic

 E

iff

el

 Tower

 to

 the

 beautiful

 gardens

 of

 the

 Luxembourg

 Palace

,

 Paris

 is

 a

 city

 that

 will

 leave

 you

 in

 awe

.

 Whether

 you

 are

 interested

 in

 history

,

 art

,

 fashion

,

 or

 food

,

 Paris

 has

 something

 to

 offer

.


The

 E

iff

el

 Tower

 is

 a

 must

-

see

 attraction

 in

 Paris

.

 Built

 for

 the

 

188

9

 World

's

 Fair

,

 the

 E

iff

el

 Tower

 is

 an

 engineering

 marvel

 and

 a

 symbol

 of

 Paris

.

 You

 can

 take

 the

 elevator



Prompt: The future of AI is
Generated text: 

 bright

,

 but

 are

 we

 prepared

?


In

 recent

 years

,

 Artificial

 Intelligence

 (

AI

)

 has

 made

 tremendous

 progress

,

 transforming

 the

 way

 we

 live

 and

 work

.

 From

 virtual

 assistants

 like

 Siri

 and

 Alexa

 to

 self

-driving

 cars

 and

 personalized

 medicine

,

 AI

 has

 become

 an

 integral

 part

 of

 our

 daily

 lives

.


However

,

 as

 AI

 continues

 to

 advance

,

 it

 raises

 important

 questions

 about

 its

 impact

 on

 society

,

 the

 economy

,

 and

 our

 future

.

 Are

 we

 prepared

 for

 the

 benefits

 and

 challenges

 that

 AI

 will

 bring

?


The

 benefits

 of

 AI

 are

 undeniable

:


 

 

1

.

 Increased

 productivity

:

 AI

 can

 automate

 repetitive

 and

 mundane

 tasks

,

 freeing

 up

 human

 resources

 for

 more

 strategic

 and

 creative

 work

.


 

 




In [6]:
llm.shutdown()