# 📱 LMFast: Edge Deployment

**Run SLMs on Raspberry Pi, Android, and Edge Devices!**

## What You'll Learn
- Export models to GGUF format
- Run models with `llama.cpp`
- Optimize for low-RAM devices (RPi 4/5)
- Build a simple terminal chat app for edge

## Supported Hardware
- **Raspberry Pi 4/5** (4GB+ RAM)
- **Android Phones** (via UserLAnd or Termux)
- **NVIDIA Jetson Nano**
- **Laptops** (Mac M1/M2/M3, Windows, Linux)

**Time to complete:** ~15 minutes

## 1️⃣ Setup

In [None]:
# Install standard tools
!pip install -q lmfast[all]

# Install python bindings for llama.cpp
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python  # Use CUBLAS if on Colab GPU
# On Raspberry Pi: pip install llama-cpp-python (no CMAKE_ARGS needed usually)

import lmfast
lmfast.setup_colab_env()

## 2️⃣ Export to GGUF

We need the GGUF format for efficient CPU/Edge inference.

In [None]:
from lmfast.inference import export_gguf

model_id = "HuggingFaceTB/SmolLM-135M-Instruct"

# Export to 4-bit quantized GGUF (LMFast handles llama.cpp setup!)
# This reduces size from ~270MB to ~100MB
export_gguf(
    model_path=model_id,
    output_path="./smollm-135m-q4.gguf",
    quantization="q4_k_m"
)

print("✅ Exported: ./smollm-135m-q4.gguf")

## 3️⃣ Run on "Edge" (Simulated)

We'll use `Llama` class to load the GGUF model.

In [None]:
from llama_cpp import Llama

# Load model (set n_gpu_layers=0 to simulate pure CPU edge device)
llm = Llama(
    model_path="./smollm-135m-q4.gguf",
    n_ctx=2048,
    n_gpu_layers=0,  # Run purely on CPU
    verbose=False
)

print("🤖 Model Loaded on CPU")

## 4️⃣ Edge Inference Loop

A simple chat loop optimized for low latency.

In [None]:
def chat_with_edge_model(prompt):
    # Format prompt for SmolLM/ChatML
    # <|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n
    full_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
    
    stream = llm(
        full_prompt,
        max_tokens=100,
        stop=["<|im_end|>"],
        echo=False,
        stream=True  # Stream for perceived speed!
    )
    
    print("EdgeAI: ", end="", flush=True)
    response = ""
    for chunk in stream:
        text = chunk['choices'][0]['text']
        print(text, end="", flush=True)
        response += text
    print("\n")
    return response

chat_with_edge_model("What is the best way to save energy?")

## 5️⃣ Building a Standalone Edge App

Save this script as `app.py` and run it on your Pi!

In [None]:
app_code = """
from llama_cpp import Llama
import sys

print("Loading model...")
llm = Llama(model_path="./smollm-135m-q4.gguf", verbose=False)

print("Ready! Type 'exit' to quit.")
while True:
    user_input = input("User: ")
    if user_input.lower() == "exit":
        break
    
    prompt = f"<|im_start|>user\\n{user_input}<|im_end|>\\n<|im_start|>assistant\\n"
    output = llm(prompt, max_tokens=128, stop=["<|im_end|>"])
    print(f"AI: {output['choices'][0]['text']}")
"""

with open("edge_app.py", "w") as f:
    f.write(app_code)

print("✅ Created edge_app.py")

## 🎉 Summary

You've learned how to:
- ✅ Convert models for edge usage
- ✅ Run inference on CPU
- ✅ Create a standalone script for Raspberry Pi

### Tips for Raspberry Pi
- Use 64-bit OS (Raspberry Pi OS 64-bit).
- Overclock slightly for 10-20% speedup.
- Use a cooling fan!

### Next Steps
- Copy `edge_app.py` and `smollm-135m-q4.gguf` to your device and run!