Local, enterprise-style Retrieval-Augmented Generation (RAG) app with NVIDIA NIM inference, a FastAPI backend, and a minimal web UI.
+---------------------+ +---------------------+ +---------------------+
| UI | HTTP | app | HTTP | nim |
| NGINX + static +-------->+ FastAPI + RAG +-------->+ NVIDIA NIM LLM |
| http://:3000 | /v1/* | http://:8080 | /v1/* | http://:8000 |
+---------------------+ +---------------------+ +---------------------+
Copy .env.example to .env and fill in values (it is gitignored):
NGC_API_KEY=...
NIM_MODEL_PROFILE=
NIM_MAX_MODEL_LEN=4096
NIM_MAX_NUM_SEQS=1
LOCAL_NIM_CACHE=/home/your-wsl-user/.cache/nim
Then run in the foreground (you will see logs in your terminal):
docker compose up --buildNotes:
NIM_MODEL_PROFILEcan be blank to let NIM auto-select a compatible profile.NIM_MAX_MODEL_LEN=4096is a safe default for 8GB GPUs (e.g., RTX 4070 Laptop).- If you run Docker from Windows PowerShell, set
LOCAL_NIM_CACHEto a Windows path (for exampleC:\Users\<you>\.cache\nim).
The UI is a lightweight NGINX-served page for quick testing:
- Chat window + input + send
- "Ingest docs" button to load data from
./data/ - Uses relative
/v1/*and/ingestpaths (no CORS needed)
docker run -it --rm `
--gpus all `
--shm-size=16GB `
-e NGC_API_KEY="$env:NGC_API_KEY" `
-e NIM_MODEL_PROFILE="$env:NIM_MODEL_PROFILE" `
-e NIM_MAX_MODEL_LEN=4096 `
-e NIM_MAX_NUM_SEQS=1 `
-p 8000:8000 `
-v "$env:LOCAL_NIM_CACHE:/opt/nim/.cache" `
nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0Use the UI button (top right) or call the endpoint directly:
curl -X POST http://localhost:8080/ingestBy default, it loads .txt and .md files from ./data/.
To refresh the knowledge base, drop new files into ./data/ and re-run /ingest.
curl -X POST http://localhost:8080/v1/chat/completions ^
-H "Content-Type: application/json" ^
-d '{
"model": "nim",
"messages": [
{"role": "user", "content": "Summarize the sample document."}
]
}'