<a href="https://colab.research.google.com/github/rahiakela/genai-research-and-practice/blob/main/hands-on-llm-serving-and-optimization/04_llm_streaming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

In [None]:
!pip install --quiet vllm transformers tiktoken

In [None]:
import asyncio
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.sampling_params import SamplingParams

## Load Model

In [None]:
# Initialize the engine arguments
engine_args = AsyncEngineArgs(
    model="Qwen/Qwen2.5-0.5B",
    dtype="float16",
    tensor_parallel_size=1,      # Number of GPUs to use
    gpu_memory_utilization=0.9,  # GPU memory utilization
    max_num_batched_tokens=32768,# Maximum number of tokens to process in a batch
    max_num_seqs=256,            # Maximum number of sequences to process
    disable_log_requests=True,   # Disable request logging
    disable_log_stats=True,      # Disable stats logging
)

# Create the vLLM async streaming engine
engine = AsyncLLMEngine.from_engine_args(engine_args)


## LLM Streaming

Streaming:
1. Return result when generation completes.
2. Return as soon as we have a token.

Mention this code is just show, please run streaming.py for actual execution.

sudo code, please run stream.py code to see the execution.

In [4]:
# Define the prompt.
prompt = """You are an expert AI historian writing a detailed chapter for a book titled "The Evolution of Human-AI Collaboration."

Begin by summarizing the early stages of artificial intelligence in the 1950s, touching on symbolic logic and rule-based systems. Then transition into the rise of machine learning, particularly deep learning in the 2010s.

Afterward, describe how large language models like GPT transformed human-computer interaction, enabling applications in education, creative writing, customer support, and software development.

Finally, reflect on the societal and ethical implications of AI, such as misinformation, bias, and the alignment problem.

Write in a formal tone, with rich detail and examples in each era."""

In [5]:
async def generate_text(prompt: str, max_tokens: int = 100, request_id="id"):
  try:
    # Define sampling parameters
    sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=max_tokens,
        stop=["\n"],  # Stop at newline
    )

    # Generate text in async and streaming fashion
    results_generator = engine.generate(
        prompt=prompt,
        sampling_params=sampling_params,
        request_id=request_id
    )

    # Process the results
    final_output = None
    async for request_output in results_generator:
        final_output = request_output
        # Print each token as it's generated
        print("chunk \n")
        for output in request_output.outputs:
            print(output.text, end="", flush=True)
        print()
    print()  # Newline at the end

    # This will only be reached if all tokens are generated
    print("\nGeneration completed successfully")

    return final_output
  except asyncio.CancelledError:
    print("\nGeneration was cancelled")
    return None
  finally:
    # Always clean up
    try:
      await engine.abort(request_id)
    except:
      pass

In [6]:
# Example of the streamingusage:
prompt = "What is the capital of US?"
asyncio.run(generate_text(prompt)).outputs[0].text

RuntimeError: asyncio.run() cannot be called from a running event loop

In [7]:
# Example of the streamingusage:
prompt = "What is the capital of US?"
res = asyncio.run(generate_text(prompt))
print(res.outputs[0].text)

RuntimeError: asyncio.run() cannot be called from a running event loop

In [None]:
request_id = "any_id"
await generate_text(prompt, 10000, requtest_id)


If you have problem running above code due to Jupyter event loop, try run the example (https://github.com/orca3/llm-model-serving/blob/main/ch02/streaming.py) in terminal.

In [None]:
!wget https://github.com/orca3/llm-model-serving/raw/main/ch02/streaming.py

In [None]:
!python streaming.py