# üöÄ LMFast: High-Performance Inference Server

**Deploy your SLM with an OpenAI-compatible API!**

## What You'll Learn
- Start a high-throughput inference server
- Use vLLM acceleration (2-5x faster)
- Query the API using the standard OpenAI Python client
- Benchmark inference speed (Tokens per Second)

## Why Use `SLMServer`?
- **Easy**: 1 line to start.
- **Compatible**: Works with LangChain, AutoGen, CrewAI.
- **Fast**: Optimized for T4 GPUs with batching.

**Time to complete:** ~10 minutes

## 1Ô∏è‚É£ Setup

In [None]:
!pip install -q lmfast[all] vllm openai

import lmfast
lmfast.setup_colab_env()

import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")

## 2Ô∏è‚É£ Initialize Server

We can load a HuggingFace model or a local path.

In [None]:
from lmfast.inference import SLMServer

# Initialize server with efficient serving backend
server = SLMServer(
    "HuggingFaceTB/SmolLM-360M-Instruct",
    use_vllm=True  # Acceleration (if available)
)

print("‚úÖ Server Initialized (in-process)")

## 3Ô∏è‚É£ Direct Generation (Python API)

Great for scripts running on the same machine.

In [None]:
prompt = "Write a haiku about speed."

print(f"üìù Prompt: {prompt}")
output = server.generate(prompt, max_new_tokens=50, temperature=0.7)
print(f"ü§ñ Output: {output}")

## 4Ô∏è‚É£ Run as HTTP Server (OpenAI API)

This allows external tools to connect. 
*Note: In Colab, this blocks the cell. We run it in background for demo.*

In [None]:
import threading
import time
import requests

# Start server in a background thread
def start_server():
    # This blocks, so we run it in thread
    server.serve(host="127.0.0.1", port=8000)

thread = threading.Thread(target=start_server, daemon=True)
thread.start()

# Wait for server to start
print("‚è≥ Waiting for server to start...")
time.sleep(10)  # Give it a few seconds
print("‚úÖ Server should be running on http://127.0.0.1:8000")

## 5Ô∏è‚É£ Connect with OpenAI Client

Now we can use the standard OpenAI library!

In [None]:
from openai import OpenAI

# Point client to local server
client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="lmfast-key"  # Dummy key
)

response = client.chat.completions.create(
    model="smollm",
    messages=[
        {"role": "user", "content": "Why is the sky blue?"}
    ],
    max_tokens=100
)

print(f"ü§ñ OpenAI Client Response:\n{response.choices[0].message.content}")

## 6Ô∏è‚É£ Benchmark Performance

Let's see how fast it is.

In [None]:
start_time = time.time()
tokens = 0
N = 5

print(f"üèéÔ∏è Benchmarking {N} requests...")

for _ in range(N):
    resp = client.chat.completions.create(
        model="smollm",
        messages=[{"role": "user", "content": "Count to 20."}],
        max_tokens=50
    )
    tokens += resp.usage.completion_tokens

duration = time.time() - start_time
tps = tokens / duration

print(f"‚ö° Speed: {tps:.2f} tokens/sec")

## üéâ Summary

You've learned how to:
- ‚úÖ Serve SLMs with `SLMServer`
- ‚úÖ Enable vLLM speedups
- ‚úÖ Drop-in replace OpenAI API in your apps

### Compatibility
Because it's OpenAI compatible, you can use this server with:
- **LangChain / LlamaIndex**
- **AutoGen / CrewAI**
- **Cursor / VS Code extensions**

### Next Steps
- `15_browser_deployment.ipynb`: No server needed!