<a href="https://colab.research.google.com/github/xouyang1/project-notes/blob/main/colab_serve_vllm_cloudflared_tunnel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [59]:
%%bash
nvidia-smi -L || true
python -V
pip -q install "vllm>=0.6.1" "transformers>=4.44" --upgrade

python - <<'PY'
import torch, sys
print("python", sys.version.split()[0],
      "cuda_ok", torch.cuda.is_available(),
      "torch_cuda", torch.version.cuda)
try:
    import vllm
    print("vLLM", vllm.__version__)
except Exception as e:
    print("vLLM import error:", e)
PY

export VLLM_USE_RAY=0
export CUDA_VISIBLE_DEVICES=0
export NCCL_P2P_DISABLE=1

GPU 0: NVIDIA L4 (UUID: GPU-93683824-e52f-dcad-50b6-17ca2ba08203)
Python 3.12.11
python 3.12.11 cuda_ok True torch_cuda 12.6
vLLM 0.10.1.1


In [60]:
MODEL="Qwen/Qwen3-0.6B"
PORT=8000

# Useful memory flags for small/Colab GPUs:
# --dtype float16 : fp16 weights
# --gpu-memory-utilization 0.90 : let vLLM use up to ~90% VRAM
# --max-model-len 4096 : cap contextb length (lower if you're tight on VRAM)
# (If a model requires custom code, add: --trust-remote-code)
!setsid nohup python -m vllm.entrypoints.openai.api_server \
    --model $MODEL \
    --host 0.0.0.0 --port $PORT \
    --dtype float16 \
    --gpu-memory-utilization 0.60 \
    --max-model-len 4096 \
    --download-dir /root/.cache/ \
    >server.log 2>&1 &
!echo $! > server.pid

In [64]:
!lsof -i:8000 || ss -lntp | grep 8000 || true
!curl -sS http://0.0.0.0:8000/v1/models | sed -n '1,120p'
!tail -n 60 server.log

COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
python3 13268 root   33u  IPv4 289752      0t0  TCP *:8000 (LISTEN)
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.86it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.86it/s]
[1;36m(EngineCore_0 pid=13488)[0;0m 
[1;36m(EngineCore_0 pid=13488)[0;0m INFO 09-06 05:21:13 [default_loader.py:262] Loading weights took 0.57 seconds
[1;36m(EngineCore_0 pid=13488)[0;0m INFO 09-06 05:21:13 [gpu_model_runner.py:2007] Model loading took 1.1201 GiB and 1.720256 seconds
[1;36m(EngineCore_0 pid=13488)[0;0m INFO 09-06 05:21:24 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/611660d274/rank_0_0/backbone for vLLM's torch.compile
[1;36m(EngineCore_0 pid=13488)[0;0m INFO 09-06 05:21:24 [backends.py:559] Dynamo bytecode transform time: 10.34 s
[1;36m(EngineCore_0 pid=13488)[0;0m INFO 09-06 05:21:32 [backends.py:161] Directly load the compiled grap

In [None]:
# cloudflared PyPI package on Python 3.12 has import issues so manual download
!uname -m  # should print x86_64 on Colab
!wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64
!chmod +x cloudflared-linux-amd64
!mv cloudflared-linux-amd64 /usr/local/bin/cloudflared
!cloudflared --version

x86_64
cloudflared version 2025.8.1 (built 2025-08-21-1535 UTC)


In [65]:
%%bash
# Launch tunnel in background and log output
nohup cloudflared tunnel --url http://localhost:8000 --no-autoupdate \
  > /tmp/cloudflared.log 2>&1 &
echo $! > /tmp/cloudflared.pid

# Poll the log for the trycloudflare URL (up to ~30s)
for i in {1..30}; do
  URL="$(grep -o 'https://[a-z0-9-]*\.trycloudflare\.com' -m1 /tmp/cloudflared.log || true)"
  if [ -n "$URL" ]; then
    echo "TUNNEL_URL=$URL"
    exit 0
  fi
  sleep 1
done

echo "Tunnel is starting... check logs: tail -f /tmp/cloudflared.log"
exit 0

TUNNEL_URL=https://enclosed-oscar-cherry-postage.trycloudflare.com


In [69]:
!lsof -i:8000
!curl https://enclosed-oscar-cherry-postage.trycloudflare.com/v1/models

COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
python3 13268 root   33u  IPv4 289752      0t0  TCP *:8000 (LISTEN)
{"object":"list","data":[{"id":"Qwen/Qwen3-0.6B","object":"model","created":1757136192,"owned_by":"vllm","root":"Qwen/Qwen3-0.6B","parent":null,"max_model_len":4096,"permission":[{"id":"modelperm-fe76005467a349b1853b04aaf9f9d96f","object":"model_permission","created":1757136192,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

In [76]:
%%bash
curl -sS https://enclosed-oscar-cherry-postage.trycloudflare.com/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"Qwen/Qwen3-0.6B",
    "messages":[{"role":"user","content":"Explain LoRA in one sentence."}],
    "max_tokens":128,
    "temperature":0.7
  }' | jq .

{
  "id": "chatcmpl-dfe763e3fed84483b864cf11e85d3e11",
  "object": "chat.completion",
  "created": 1757136311,
  "model": "Qwen/Qwen3-0.6B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think>\nOkay, the user is asking for a one-sentence explanation of LoRA. Let me start by recalling what LoRA stands for. It's a method in the context of fine-tuning models for large language models, right? So, the key points here are that LoRA allows for small-scale modifications to the model parameters.\n\nWait, I should make sure not to mix up the terms. LoRA is a technique where the fine-tuning is done on a smaller dataset, not a large one. The model's parameters are modified using a small linear transformation. That's important because it makes the model more efficient without requiring massive",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "