<a href="https://colab.research.google.com/github/sandeepkhamari/llm_deploment_using_jupiter_notebook/blob/main/llm_deployment_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Deployment with HuggingFace

This notebook demonstrates end-to-end LLM deployment using HuggingFace Transformers.

In [7]:
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

CUDA available: True
GPU: Tesla T4


In [8]:
!pip install -q torch transformers fastapi uvicorn pydantic

In [9]:
%%writefile app.py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from fastapi import FastAPI
from pydantic import BaseModel

MODEL_NAME = "distilgpt2"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
model.to(DEVICE)
model.eval()

def generate_text(prompt, max_new_tokens=50):
    inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.8,
            top_p=0.95
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

app = FastAPI(title="LLM Inference API")

class PromptRequest(BaseModel):
    prompt: str
    max_tokens: int = 50

@app.post("/generate")
def generate(req: PromptRequest):
    return {"response": generate_text(req.prompt, req.max_tokens)}

Overwriting app.py


In [10]:
!ls

app.py	__pycache__  sample_data


In [5]:
!uvicorn app:app --host 0.0.0.0 --port 8000

tokenizer_config.json:   0% 0.00/26.0 [00:00<?, ?B/s]tokenizer_config.json: 100% 26.0/26.0 [00:00<00:00, 113kB/s]
config.json: 100% 762/762 [00:00<00:00, 5.97MB/s]
vocab.json: 100% 1.04M/1.04M [00:00<00:00, 20.6MB/s]
merges.txt: 100% 456k/456k [00:00<00:00, 59.1MB/s]
tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 30.6MB/s]
2025-12-14 06:04:48.092652: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765692288.110352     632 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765692288.115341     632 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1765692288.130265     632 computation_placer.cc:177] computation placer already registered. Please

In [11]:
!wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64
!chmod +x cloudflared-linux-amd64
!mv cloudflared-linux-amd64 /usr/local/bin/cloudflared

In [12]:
!cloudflared --version

cloudflared version 2025.11.1 (built 2025-11-07-16:59 UTC)


In [None]:
!cloudflared tunnel --url http://localhost:8000

[90m2025-12-14T06:16:21Z[0m [32mINF[0m Thank you for trying Cloudflare Tunnel. Doing so, without a Cloudflare account, is a quick way to experiment and try it out. However, be aware that these account-less Tunnels have no uptime guarantee, are subject to the Cloudflare Online Services Terms of Use (https://www.cloudflare.com/website-terms/), and Cloudflare reserves the right to investigate your use of Tunnels for violations of such terms. If you intend to use Tunnels in production you should use a pre-created named tunnel by following: https://developers.cloudflare.com/cloudflare-one/connections/connect-apps
[90m2025-12-14T06:16:21Z[0m [32mINF[0m Requesting new quick Tunnel on trycloudflare.com...
[90m2025-12-14T06:16:24Z[0m [32mINF[0m +--------------------------------------------------------------------------------------------+
[90m2025-12-14T06:16:24Z[0m [32mINF[0m |  Your quick Tunnel has been created! Visit it at (it may take some time to be reachable):  |
[90m2025

In [None]:
https://xxxxx.trycloudflare.com/docs

In [None]:
import requests

url = "https://translation-lewis-context-plastics.trycloudflare.com/generate"
payload = {
    "prompt": "Explain MPI parallel programming in simple terms",
    "max_tokens": 50
}

response = requests.post(url, json=payload)
response.json()

In [None]:
next(model.parameters()).device