# Quick Start

## Launch a server

This code uses `subprocess.Popen` to start an SGLang server process, equivalent to executing 

```bash
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0 --log-level warning
```
in your command line and wait for the server to be ready.

In [8]:
import subprocess
import time
import requests
import os

server_process = subprocess.Popen(
    [
        "python",
        "-m",
        "sglang.launch_server",
        "--model-path",
        "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "--port",
        "30000",
        "--host",
        "0.0.0.0",
        "--log-level",
        "error",
    ],
    text=True,
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL,
)

while True:
    try:
        response = requests.get(
            "http://localhost:30000/v1/models",
            headers={"Authorization": "Bearer None"},
        )
        if response.status_code == 200:
            break
    except requests.exceptions.RequestException:
        time.sleep(1)

print("Server is ready. Proceeding with the next steps.")

Server is ready. Proceeding with the next steps.


## Send a Request

Once the server is running, you can send test requests using curl.

In [9]:
!curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer None" \
  -d '{"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is a LLM?"}]}'

{"id":"d85b69cf1f274f628c3288478ff6d5cf","object":"chat.completion","created":1729811858,"model":"meta-llama/Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"LLM stands for Large Language Model. It's a type of artificial intelligence (AI) designed to process and generate human-like language. LLMs are trained on vast amounts of text data, which allows them to learn patterns, relationships, and structures of language.\n\nLarge language models are typically characterized by:\n\n1. **Training data:** They're trained on massive datasets, often billions of words or more, which enables them to understand a wide range of language styles, genres, and domains.\n2. **Model size:** They consist of complex neural networks with millions or even billions of parameters, allowing them to capture subtle nuances of language.\n3. **Generative capabilities:** LLMs can generate text based on input prompts, which can range from simple sentences to longer passages or 

## Using OpenAI Compatible API

SGLang supports OpenAI-compatible APIs. Here are Python examples:

In [10]:
import openai

# Always assign an api_key, even if not specified during server initialization.
# Setting an API key during server initialization is strongly recommended.

client = openai.Client(
    base_url="http://127.0.0.1:30000/v1", api_key="None"
)

# Chat completion example

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print(response)

ChatCompletion(id='53a386a5b50942f58637488a6f965d79', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\n\n1. **Country:** Japan\n**Capital:** Tokyo\n\n2. **Country:** Australia\n**Capital:** Canberra\n\n3. **Country:** Brazil\n**Capital:** Bras√≠lia', refusal=None, role='assistant', function_call=None, tool_calls=None), matched_stop=128009)], created=1729811859, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=46, prompt_tokens=49, total_tokens=95, prompt_tokens_details=None))


In [11]:
import signal
import gc
import torch

def terminate_process(process):
    try:
        process.terminate()
        try:
            process.wait(timeout=5)
        except subprocess.TimeoutExpired:
            if os.name != 'nt':
                try:
                    pgid = os.getpgid(process.pid)
                    os.killpg(pgid, signal.SIGTERM)
                    time.sleep(1)
                    if process.poll() is None:
                        os.killpg(pgid, signal.SIGKILL)
                except ProcessLookupError:
                    pass
            else:
                process.kill()
            process.wait()
    except Exception as e:
        print(f"Warning: {e}")
    finally:
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            torch.cuda.ipc_collect()

terminate_process(server_process)
time.sleep(2)