---

# What RAG is (and why)

RAG injects *fresh, external* knowledge into the model’s prompt at runtime: you retrieve relevant passages from your data, then ask ChatGPT to answer *using those passages*. This boosts factual accuracy and lets you ground responses in your own docs, wikis, tickets, PDFs, etc. ([OpenAI Platform][1], [OpenAI Help Center][2])

---

# The minimal architecture

1. **Ingest & chunk** your content
2. **Embed** chunks → vectors (use OpenAI Embeddings)
3. **Index** vectors in a store (e.g., FAISS, pgvector, Pinecone)
4. At query time: **embed the query → retrieve top-k** chunks
5. **Augment the prompt** with retrieved chunks
6. **Generate** the final answer with ChatGPT
7. (Optional) **Re-rank**, **cite sources**, and **guardrail** (answer-only / refuse)

OpenAI provides an embeddings API (e.g., `text-embedding-3-large` for max quality, or `...-3-small` for lower cost). ([OpenAI Platform][3], [OpenAI][4])

---

# Quick-start: Python (vanilla, no frameworks)

> Installs: `pip install openai faiss-cpu tiktoken` (or use pgvector/Weaviate/etc. instead of FAISS)


In [1]:
import sys
!{sys.executable} -m pip install openai faiss-cpu tiktoken

Collecting openai
  Downloading openai-1.102.0-py3-none-any.whl.metadata (29 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-win_amd64.whl.metadata (5.2 kB)
Collecting tiktoken
  Downloading tiktoken-0.11.0-cp312-cp312-win_amd64.whl.metadata (6.9 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.10.0-cp312-cp312-win_amd64.whl.metadata (5.3 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Using cached httpcore-1.0.9-py3-none-any.whl.metadata (21 kB)
Collecting h11>=0.16 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Using cached h11-0.16.0-py3-none-any.whl.metadata (8.3 kB)
Downloading openai-1.102.0-py3-none-any.whl (812 kB)
   ---------------------------------------- 0.0/812.0 kB ? eta -:--:--
   -------------------------------

In [None]:
# 1) Setup
from openai import OpenAI
import faiss, numpy as np
import textwrap, json

client = OpenAI()

EMBED_MODEL = "text-embedding-3-large"  # high accuracy; use -3-small to save cost

def embed_texts(texts):
    resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
    return np.array([d.embedding for d in resp.data], dtype="float32")

# 2) Ingest & chunk (toy example)
docs = {
    "doc1.md": "RAG augments LLMs with retrieval from external knowledge bases...",
    "doc2.md": "Use embeddings to index text chunks; query → retrieve → prompt → answer."
}
def chunk(text, max_chars=800):
    # simple splitter; replace with sentence-aware splitter in production
    return textwrap.wrap(text, max_chars)

chunks, meta = [], []
for path, text in docs.items():
    for i, ch in enumerate(chunk(text)):
        chunks.append(ch)
        meta.append({"source": path, "chunk_id": i})

# 3) Build vector index
vecs = embed_texts(chunks)
index = faiss.IndexFlatIP(vecs.shape[1])
# normalize for cosine similarity
faiss.normalize_L2(vecs)
index.add(vecs)

# 4) Query → retrieve top-k
def retrieve(query, k=4):
    qv = embed_texts([query])
    faiss.normalize_L2(qv)
    scores, idxs = index.search(qv, k)
    ctx = []
    for rank, (i, s) in enumerate(zip(idxs[0], scores[0])):
        ctx.append({"text": chunks[i], "score": float(s), **meta[i]})
    return ctx

# 5) Augment prompt & 6) Generate with ChatGPT
CHAT_MODEL = "gpt-4o-mini"  # pick your chat model
def answer(query):
    ctx = retrieve(query)
    context_block = "\n\n".join(
        [f"[{c['source']}#{c['chunk_id']}] {c['text']}" for c in ctx]
    )
    system = "You are a helpful assistant. Use ONLY the provided context. If missing, say you don't know."
    user = f"Question: {query}\n\nContext:\n{context_block}\n\nAnswer with citations like [source#chunk]."
    resp = client.chat.completions.create(
        model=CHAT_MODEL,
        messages=[{"role":"system","content":system},{"role":"user","content":user}],
        temperature=0.2,
    )
    return resp.choices[0].message.content, ctx

print(answer("What is RAG and how does it help?")[0])


**Why this works:** OpenAI embeddings turn text into vectors; nearest-neighbor search finds relevant chunks; you pass those chunks to ChatGPT in a tight prompt so it answers with grounded citations. ([OpenAI Platform][3])

---

# Quick-start: Node.js

```js
import OpenAI from "openai";
const client = new OpenAI();

const EMBED_MODEL = "text-embedding-3-small";

async function embed(texts) {
  const res = await client.embeddings.create({ model: EMBED_MODEL, input: texts });
  return res.data.map(d => d.embedding);
}

// Use your vector DB client (e.g., pgvector) for upsert/query...
// Then, at query time:

async function ragAnswer(query, retrievedChunks) {
  const context = retrievedChunks.map(c => `[${c.source}#${c.id}] ${c.text}`).join("\n\n");
  const system = "Use only the provided context. If insufficient, say you don't know.";
  const user = `Q: ${query}\n\nContext:\n${context}\n\nAnswer with citations.`;
  const res = await client.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [{ role: "system", content: system }, { role: "user", content: user }],
    temperature: 0.2,
  });
  return res.choices[0].message.content;
}
```

---


# Good defaults & tips

* **Chunking:** 400–1,000 characters or \~200–500 tokens per chunk; overlap 10–20% to keep context intact. Sentence-aware splitting usually improves retrieval quality.
* **Metadata:** Keep `source`, `section`, `created_at`, `tags`. You can filter by metadata before similarity search (e.g., only “policies”).
* **Top-k:** Start with `k=4`–`8`. Tune via evaluation.
* **Model choices:** Use `text-embedding-3-large` when quality matters, `...-3-small` for scale/cost. Any ChatGPT family chat model can do the generation step. ([OpenAI Platform][3], [OpenAI][4])
* **Prompting:** Tell the model to (a) *quote or cite* chunks, (b) *refuse* when context is missing, and (c) *avoid adding facts not in context*. This aligns with OpenAI’s RAG best-practices. ([OpenAI Platform][1])
* **Evaluation:** Create a small question set with gold answers; run A/B on chunk sizes, `k`, models, and prompts. OpenAI’s guide discusses accuracy optimization levers. ([OpenAI Platform][1])
* **Security/PII:** Pre-filter or redact sensitive text before indexing. Add allow-lists by source.
* **Caching:** Cache embeddings, retrieval results, and even full answers for repeated queries.
* **Reranking (optional):** After initial vector search, call ChatGPT (or a smaller reranker) to score snippets *just* for relevance, then pass only the best 3–5 to the final prompt (improves precision on long corpora). ([OpenAI Platform][1])

---

# Variants with OpenAI features

* **Assistants + Files / “file search”:** If you prefer a managed approach, the Assistants API has built-in retrieval on uploaded files—handy for quick prototypes and internal tools. (See “retrieval/file search” features in the OpenAI docs.) ([OpenAI Platform][1])
* **Structured outputs:** Ask the model for JSON schemas (e.g., FAQ extraction) to power downstream UIs.

---

# When NOT to use RAG

* Your task is *pure reasoning* on fully provided input (no external knowledge).
* You control the schema tightly and need deterministic outputs → consider tool calls/functions or classical search.

---

If you want, tell me:

* your data sources (PDFs, Confluence, Git repos, Tickets),
* your stack preference (FAISS/pgvector/Pinecone),
* constraints (cost, latency, privacy),

…and I’ll tailor a production-grade RAG plan (indexing scripts, schemas, eval harness) for your setup.

[1]: https://platform.openai.com/docs/guides/optimizing-llm-accuracy/retrieval-augmented-generation-rag?utm_source=chatgpt.com "OpenAI Guide: Optimizing LLM Accuracy with RAG"
[2]: https://help.openai.com/en/articles/8868588-retrieval-augmented-generation-rag-and-semantic-search-for-gpts?utm_source=chatgpt.com "Retrieval Augmented Generation (RAG) and Semantic ..."
[3]: https://platform.openai.com/docs/guides/embeddings?utm_source=chatgpt.com "OpenAI Embeddings Guide"
[4]: https://openai.com/index/new-embedding-models-and-api-updates/?utm_source=chatgpt.com "New embedding models and API updates"
