# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.43it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.32it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.22it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.61it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.48it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:27,  1.25s/it]

  9%|▊         | 2/23 [00:01<00:15,  1.40it/s]

 13%|█▎        | 3/23 [00:01<00:10,  1.87it/s]

 17%|█▋        | 4/23 [00:02<00:08,  2.23it/s]

 22%|██▏       | 5/23 [00:02<00:07,  2.43it/s]

 26%|██▌       | 6/23 [00:03<00:07,  2.38it/s]

 30%|███       | 7/23 [00:03<00:06,  2.55it/s]

 35%|███▍      | 8/23 [00:03<00:05,  2.72it/s]

 39%|███▉      | 9/23 [00:03<00:05,  2.79it/s]

 43%|████▎     | 10/23 [00:04<00:04,  2.94it/s]

 48%|████▊     | 11/23 [00:04<00:03,  3.12it/s]

 52%|█████▏    | 12/23 [00:04<00:03,  3.30it/s]

 57%|█████▋    | 13/23 [00:05<00:02,  3.41it/s]

 61%|██████    | 14/23 [00:05<00:02,  3.28it/s]

 65%|██████▌   | 15/23 [00:05<00:02,  3.29it/s]

 70%|██████▉   | 16/23 [00:06<00:02,  3.27it/s]

 74%|███████▍  | 17/23 [00:06<00:01,  3.32it/s]

 78%|███████▊  | 18/23 [00:06<00:01,  3.30it/s]

 83%|████████▎ | 19/23 [00:06<00:01,  3.28it/s]

 87%|████████▋ | 20/23 [00:07<00:00,  3.24it/s]

 91%|█████████▏| 21/23 [00:07<00:00,  3.16it/s]

 96%|█████████▌| 22/23 [00:07<00:00,  3.16it/s]

100%|██████████| 23/23 [00:08<00:00,  3.16it/s]100%|██████████| 23/23 [00:08<00:00,  2.79it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Erika and I am an animal lover. I have been working with animals for over 20 years and I have experience in many different fields of animal care, including veterinary clinics, animal shelters and pet sitting. I am a certified dog trainer and behaviorist and I also have certifications in pet first aid and CPR.
I have a passion for helping animals and people come together and I love teaching people how to understand and communicate with their pets. I believe that animals are a vital part of our families and they deserve love, care and respect.
As a pet sitter, I offer a variety of services, including overnight stays, daytime visits and
Prompt: The president of the United States is
Generated text:  elected through a process called the Electoral College. Here’s how it works: the president is not elected directly by the people. Instead, each state is allocated a certain number of electoral votes based on its population. Candidates for president cam

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Chris

 and

 I

 am

 an

 IT

 professional

.

 I

 am

 currently

 working

 as

 a

 network

 engineer

 for

 a

 company

 in

 the

 telecommunications

 industry

.

 My

 experience

 spans

 over

 

10

 years

,

 working

 with

 a

 variety

 of

 technologies

 including

 routing

,

 switching

,

 and

 security

.


I

 have

 been

 working

 with

 Cisco

 products

 for

 over

 

10

 years

 now

,

 and

 I

 have

 a

 deep

 understanding

 of

 their

 features

 and

 capabilities

.

 I

 have

 experience

 in

 designing

,

 implementing

,

 and

 troubleshooting

 complex

 networks

,

 as

 well

 as

 providing

 technical

 support

 to

 customers

.


I

 am

 also

 familiar

 with

 other

 networking

 vendors

 such

 as

 Jun

iper

,

 Ar

ista

,

 and

 Bro

cade

,

 and

 I

 have

 experience

 in

 designing

 and

 implementing

 multi

-v

endor

 networks

.


In



Prompt: The capital of France is
Generated text: 

 a

 place

 you

 can

 fall

 in

 love

 with

 at

 first

 sight

.

 From

 the

 stunning

 landmarks

 like

 the

 E

iff

el

 Tower

 to

 the

 charming

 streets

 of

 Mont

mart

re

,

 there

's

 something

 for

 everyone

 in

 Paris

.

 Whether

 you

're

 looking

 for

 romantic

 get

aways

,

 historical

 explor

ations

,

 or

 simply

 a

 taste

 of

 the

 French

 culture

,

 Paris

 has

 it

 all

.


Whether

 you

're

 a

 history

 buff

,

 an

 art

 lover

,

 or

 simply

 looking

 for

 a

 romantic

 getaway

,

 Paris

 is

 the

 perfect

 destination

 for

 you

.

 Explore

 the

 historic

 landmarks

,

 indulge

 in

 the

 local

 cuisine

,

 and

 soak

 up

 the

 city

's

 vibrant

 atmosphere

.


Mont

mart

re

 is

 a

 charming

 neighborhood

 in

 the

 heart

 of

 Paris

,

 known

 for



Prompt: The future of AI is
Generated text: 

 being

 shaped

 by

 innovation

 in

 software

 development

,

 machine

 learning

,

 and

 human

-com

puter

 interaction

.

 As

 AI

 becomes

 more

 ubiquitous

,

 it

 will

 continue

 to

 transform

 various

 industries

 and

 aspects

 of

 our

 lives

.


AI

 advancements

 in

 software

 development

,

 such

 as

 automation

,

 continuous

 integration

,

 and

 Dev

Ops

,

 will

 continue

 to

 drive

 efficiency

 and

 productivity

 in

 the

 software

 development

 lifecycle

.

 Machine

 learning

 algorithms

 will

 become

 increasingly

 sophisticated

,

 enabling

 AI

 systems

 to

 learn

 from

 experience

 and

 improve

 performance

 over

 time

.


Human

-com

puter

 interaction

 will

 also

 undergo

 significant

 changes

,

 with

 the

 emergence

 of

 more

 intuitive

 and

 natural

 interfaces

,

 such

 as

 voice

 assistants

,

 augmented

 reality

,

 and

 gesture

 recognition

.

 These

 advancements

 will

 make

 AI

 more

 accessible

 and

 user




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Mathew and I am a senior at Lincoln High School. I am writing to ask for help in completing a community service project. I am involved with the National Honor Society and we are organizing a food drive to collect non-perishable items for the local food bank. We are planning to collect items from now until December 15th and then deliver them to the food bank. I am reaching out to ask if your company would be willing to donate a few non-perishable items to support our cause. Any items that your company can donate would be greatly appreciated. If you are unable to donate items, perhaps you could consider making a

Prompt: The capital of France is
Generated text:  Paris, which is the city of love, fashion, and art. Paris is home to famous landmarks such as the Eiffel Tower and Notre Dame Cathedral. The city is also known for its beautiful gardens, museums, and galleries.
Paris is the most populous city in France, with over 2.1 million people livi

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Derek

,

 and

 I

'm

 a

 High

 School

 Diploma

 Graduate

 with

 a

 strong

 background

 in

 Computer

 Science

 and

 a

 passion

 for

 Web

 Development

.

 I

'm

 excited

 to

 bring

 my

 skills

 and

 experience

 to

 a

 dynamic

 team

 and

 contribute

 to

 the

 success

 of

 your

 organization

.


About

 my

 skills

 and

 experience




-

 Prof

icient

 in

 programming

 languages

 such

 as

 Java

,

 Python

,

 and

 JavaScript

,

 with

 a

 strong

 understanding

 of

 data

 structures

 and

 algorithms

.


-

 Experience

 with

 web

 development

 frameworks

 such

 as

 Spring

,

 Django

,

 and

 React

,

 and

 have

 worked

 on

 various

 projects

 involving

 user

 interfaces

,

 databases

,

 and

 APIs

.


-

 Strong

 understanding

 of

 computer

 systems

,

 including

 operating

 systems

,

 networks

,

 and

 databases

,

 with

 experience

 in

 designing

 and



Prompt: The capital of France is
Generated text: 

 a

 city

 that

 has

 it

 all

 –

 stunning

 architecture

,

 world

-class

 museums

,

 fashion

,

 fine

 dining

,

 and

 of

 course

,

 wine

.

 From

 the

 E

iff

el

 Tower

 to

 the

 Lou

vre

,

 there

 are

 countless

 ways

 to

 spend

 your

 time

 in

 Paris

.

 But

 with

 so

 many

 options

,

 it

 can

 be

 overwhelming

 to

 plan

 your

 trip

.

 Here

 are

 the

 top

 

10

 things

 to

 do

 in

 Paris

:


1

.

 Visit

 the

 E

iff

el

 Tower

:

 The

 iconic

 E

iff

el

 Tower

 is

 a

 must

-

visit

 attraction

 in

 Paris

.

 Take

 the

 stairs

 or

 elevator

 to

 the

 top

 for

 breathtaking

 views

 of

 the

 city

.


2

.

 Explore

 the

 Lou

vre

 Museum

:

 The

 Louvre

 is

 one

 of

 the



Prompt: The future of AI is
Generated text: 

 here

,

 and

 it

’s

 in

 the

 cloud




The

 future

 of

 AI

 is

 here

,

 and

 it

’s

 in

 the

 cloud




As

 artificial

 intelligence

 (

AI

)

 becomes

 more

 prevalent

 in

 various

 industries

,

 the

 cloud

 is

 playing

 a

 significant

 role

 in

 its

 growth

 and

 development

.

 Cloud

-based

 AI

 solutions

 are

 becoming

 increasingly

 popular

,

 offering

 flexibility

,

 scalability

,

 and

 cost

-effect

iveness

.

 In

 this

 article

,

 we

 will

 explore

 the

 future

 of

 AI

 and

 its

 connection

 to

 the

 cloud

.


Why

 is

 the

 cloud

 important

 for

 AI

?


The

 cloud

 provides

 several

 advantages

 for

 AI

,

 including

:


1

.

 Scal

ability

:

 Cloud

-based

 AI

 solutions

 can

 scale

 up

 or

 down

 as

 needed

,

 making

 it

 easier

 to

 handle

 large

 amounts




In [6]:
llm.shutdown()