# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.13it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.15it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.67it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.36it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.04it/s]

  9%|▊         | 2/23 [00:01<00:10,  1.94it/s] 13%|█▎        | 3/23 [00:01<00:07,  2.71it/s]

 17%|█▋        | 4/23 [00:01<00:05,  3.31it/s] 22%|██▏       | 5/23 [00:01<00:04,  3.81it/s]

 26%|██▌       | 6/23 [00:01<00:04,  4.05it/s] 30%|███       | 7/23 [00:02<00:03,  4.35it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.54it/s] 39%|███▉      | 9/23 [00:02<00:02,  4.71it/s]

 43%|████▎     | 10/23 [00:02<00:02,  4.82it/s]

 48%|████▊     | 11/23 [00:02<00:02,  4.85it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.86it/s] 57%|█████▋    | 13/23 [00:03<00:02,  4.91it/s]

 61%|██████    | 14/23 [00:03<00:01,  4.96it/s] 65%|██████▌   | 15/23 [00:03<00:01,  5.02it/s]

 70%|██████▉   | 16/23 [00:03<00:01,  5.03it/s] 74%|███████▍  | 17/23 [00:04<00:01,  5.07it/s]

 78%|███████▊  | 18/23 [00:04<00:00,  5.04it/s] 83%|████████▎ | 19/23 [00:04<00:00,  5.07it/s]

 87%|████████▋ | 20/23 [00:04<00:00,  5.11it/s] 91%|█████████▏| 21/23 [00:04<00:00,  5.12it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  5.14it/s]100%|██████████| 23/23 [00:05<00:00,  5.16it/s]100%|██████████| 23/23 [00:05<00:00,  4.33it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  René. I am a Dutch man living in the Netherlands. I am very interested in mathematics, computers and technology in general. I am also a bit of a puzzle nut, so I love solving all sorts of puzzles and brain teasers. I am a member of the Dutch Sudoku Society and have solved many Sudoku puzzles. I have even been featured in the Dutch newspaper “Het Algemeen Dagblad” as the best Sudoku solver in the Netherlands.
In my free time I like to solve mathematical problems and riddles. I also like to tinker with my old computer, a 486DX-25 which I have upgraded to
Prompt: The president of the United States is
Generated text:  not, of course, the leader of the country in any formal sense; that is the function of the Vice President. The president is, however, the leader of the executive branch of the government, and thus has a great deal of power and influence. In addition, the president is often considered the leader of the country as a whole, particularly

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()


=== Testing synchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Wesley

 and

 I

 am

 a

 professional

 gamer

.

 I

 love

 playing

 video

 games

 and

 streaming

 them

 online

.

 I

'm

 currently

 a

 part

 of

 a

 gaming

 team

 and

 we

 compete

 in

 various

 tournaments

 and

 events

.


I

'm

 also

 very

 interested

 in

 technology

 and

 innovation

,

 and

 I

 enjoy

 keeping

 up

 with

 the

 latest

 advancements

 in

 the

 gaming

 industry

.

 I

'm

 always

 looking

 for

 new

 ways

 to

 improve

 my

 gameplay

 and

 stay

 ahead

 of

 the

 competition

.


In

 my

 free

 time

,

 I

 enjoy

 watching

 movies

 and

 TV

 shows

,

 reading

 books

,

 and

 spending

 time

 with

 friends

 and

 family

.

 I

'm

 a

 bit

 of

 a

 food

ie

 and

 I

 love

 trying

 out

 new

 restaurants

 and

 cuis

ines

.


I

'm

 a

 bit

 of

 a

 intro



Prompt: The capital of France is
Generated text: 

 Paris

.

 This

 is

 the

 most

 visited

 city

 in

 the

 world

.

 People

 come

 here

 from

 all

 over

 the

 world

 to

 see

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

 Dame

 Cathedral

,

 and

 many

 other

 historical

 and

 cultural

 landmarks

.

 Paris

 is

 a

 romantic

 city

,

 where

 love

 is

 in

 the

 air

.


Paris

,

 the

 capital

 of

 France

,

 is

 one

 of

 the

 world

’s

 most

 beautiful

 and

 romantic

 cities

.

 It

 is

 known

 for

 its

 stunning

 architecture

,

 art

 museums

,

 fashion

,

 and

 cuisine

.

 Here

 are

 some

 interesting

 facts

 about

 Paris

:


1

.

 Paris

 is

 home

 to

 many

 world

-f

amous

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 Notre

 Dame

 Cathedral



Prompt: The future of AI is
Generated text: 

 inherently

 tied

 to

 human

 values

,

 ethics

,

 and

 societal

 needs

.

 The

 potential

 of

 AI

 is

 vast

,

 and

 while

 it

 has

 the

 potential

 to

 greatly

 benefit

 humanity

,

 it

 also

 raises

 concerns

 about

 bias

,

 accountability

,

 privacy

,

 and

 job

 displacement

.


AI

 is

 a

 rapidly

 evolving

 field

,

 and

 its

 impact

 will

 be

 felt

 across

 various

 sectors

,

 including

 healthcare

,

 finance

,

 education

,

 transportation

,

 and

 more

.

 As

 AI

 becomes

 increasingly

 integrated

 into

 our

 daily

 lives

,

 it

's

 essential

 to

 consider

 the

 ethics

 and

 values

 that

 under

pin

 its

 development

 and

 deployment

.


Some

 of

 the

 key

 challenges

 and

 concerns

 surrounding

 AI

 include

:


1

.

 **

Bias

 and

 fairness

**:

 AI

 systems

 can

 perpet

uate

 and

 even

 amplify

 existing




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Hello, my name is
Generated text:  Nicole and I'm a Nutella fanatic. I'm a self-proclaimed Italian food expert, I've lived in Italy for years, and have become addicted to all things hazelnut and chocolate. I also love traveling, cooking, and sharing my passion for food with others. I'm excited to be here sharing my adventures and recipes with you!

Here are some popular posts to get you started:

*   [Nutella Stuffed French Toast](https://www.tasteofhome.com/recipes/nutella-stuffed-french-toast/): A decadent breakfast treat that combines the richness of Nutella with the fluffiness of

Prompt: The capital of France is
Generated text:  one of the most beautiful and romantic cities in the world. Paris is known for its stunning architecture, world-class museums, and charming neighborhoods. In this article, we will explore some of the best places to visit in Paris, including iconic landmarks, cultural attractions, and hidden gems.
The Eiffel Tower (La Tour Eiffel) is an iconic symb

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())


=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is
Generated text: 

 Tim

 and

 I

 am

 the

 owner

 and

 operator

 of

 this

 blog

.

 I

'm

 a

 bit

 of

 a

 nerd

,

 and

 I

 love

 all

 things

 technology

.

 I

'm

 a

 software

 developer

 by

 trade

,

 and

 I

've

 been

 working

 in

 the

 field

 for

 over

 

10

 years

.

 I

've

 worked

 on

 a

 wide

 range

 of

 projects

,

 from

 simple

 web

 applications

 to

 complex

 enterprise

 software

 systems

.



In

 my

 free

 time

,

 I

 enjoy

 reading

 about

 the

 latest

 developments

 in

 technology

,

 and

 I

'm

 always

 on

 the

 lookout

 for

 new

 and

 interesting

 things

 to

 learn

.

 I

'm

 particularly

 interested

 in

 artificial

 intelligence

,

 machine

 learning

,

 and

 data

 science

.

 I

 find

 these

 fields

 to

 be

 fascinating

,

 and

 I

 think

 they

 have

 the



Prompt: The capital of France is
Generated text: 

 Paris

,

 which

 is

 the

 most

 visited

 city

 in

 the

 world

.

 This

 year

,

 the

 city

 will

 welcome

 more

 than

 

23

 million

 tourists

.

 Paris

 is

 the

 hub

 of

 French

 culture

,

 fashion

,

 and

 cuisine

.

 The

 E

iff

el

 Tower

 is

 the

 iconic

 symbol

 of

 Paris

.

 The

 city

 has

 a

 rich

 history

,

 architecture

,

 and

 art

.

 Some

 of

 the

 most

 famous

 landmarks

 in

 Paris

 include

 the

 Lou

vre

 Museum

,

 Notre

 Dame

 Cathedral

,

 Arc

 de

 Tri

omp

he

,

 and

 Mont

mart

re

.

 Paris

 is

 also

 known

 for

 its

 fashion

 and

 shopping

,

 with

 famous

 fashion

 houses

 such

 as

 Chanel

 and

 Louis

 V

uit

ton

.


France

 is

 a

 country

 located

 in

 Western

 Europe

.

 It

 is

 the

 largest



Prompt: The future of AI is
Generated text: 

 bright

,

 and

 it

 will

 bring

 significant

 benefits

 to

 various

 industries

 and

 aspects

 of

 our

 lives

.

 However

,

 it

 also

 raises

 several

 concerns

 and

 challenges

 that

 need

 to

 be

 addressed

.

 The

 following

 are

 some

 potential

 risks

 and

 challenges

 associated

 with

 the

 development

 and

 deployment

 of

 AI

:


Potential

 Ris

ks

:


1

.

 Job

 displacement

:

 AI

 and

 automation

 could

 dis

place

 certain

 jobs

,

 especially

 those

 that

 involve

 repetitive

 tasks

.


2

.

 Bias

 and

 discrimination

:

 AI

 systems

 can

 perpet

uate

 and

 amplify

 existing

 biases

 and

 discrimination

 if

 they

 are

 trained

 on

 biased

 data

 or

 designed

 with

 a

 particular

 worldview

.


3

.

 Cyber

security

:

 AI

 systems

 can

 be

 vulnerable

 to

 cyber

 attacks

 and

 data

 breaches

,

 which

 could

 compromise

 sensitive

 information

 and




In [6]:
llm.shutdown()