# RAG Quick-start

*A minimal, fully-working Retrieval-Augmented Generation demo.*

We will:

1. **Explain embeddings & RAG** (2 short theory cells)  
2. **Install / configure** required libraries  
3. **Index every `.txt`** file under `./data/` into an in-memory vector store  
4. **Ask the LLM** a question with retrieved context  
5. Show how to **extend / modify** the pipeline


### Word Embeddings: A Compact Overview

Embeddings represent text as numerical vectors, enabling comparison based on semantic similarity. A well trained embedding model will make sure that words with similar meaning have similar vector representations. 
**Word embeddings** refer to vector representations of individual words.  

- **Embedding = vector** that captures semantics.
- Similar meanings ⇒ nearby vectors.
- Works for words *and* larger chunks (sentences / paragraphs).

Links for the interested: 
- [What are word embeddings? (YouTube)](https://www.youtube.com/watch?v=wgfSDrqYMJ4)
- [A Crash Course on Building RAG Systems](https://www.dailydoseofds.com/a-crash-course-on-building-rag-systems-part-1-with-implementations/)

<img src="https://www.nlplanet.org/course-practical-nlp/_images/word_embeddings.png" alt="Word Embeddings" width="600"/>


### Chunk embeddings:

Instead of embedding individual **words**, we can embed larger **text chunks** (phrases, sentences, or paragraphs). A well-trained model maps semantically related sentences 
- “I am happy”
- “I am glad”

to nearby vectors, while an unrelated sentence such as 
- “The dog ate the homework”

 lies far away in the embedding space. 
 ####  **Key rule**   Similar meaning ⇒ similar vectors.

# Retrieval-Augmented Generation (RAG) Workflow

The RAG workflow is built around feeding an LLM with relevant context. Source documents are chopped up into chunks of text, and each chunk is embedded as we discussed above. Then when a query/prompt is passed from user to the rag, the query itself is embedded. The vectors most similar to the embedded query are _retrieved_ and put into the context sent to the LLM together with the original query.

<img src="https://embed.filekitcdn.com/e/k7YHPN24SoxyM8nGKZnDxa/miU72TZBCNDc2wBTyruAC/email" alt="RAG workflow" width="1200"/>

## Numbered Steps
| # | Phase | Action |
|---|-------|--------|
| **1** | Ingestion | **Encode docs** → split into chunks (≈200-1 000 tokens (1 token ≈ 1 word), 10-20 % overlap) and embed each chunk. |
| **2** | Ingestion | **Index vectors** → store in a vector DB with metadata (source, page, pos). |
| **3** | Retrieval | **Encode query** → generate the embedding of user query/prompt (what you write to the model). |
| **4** | Retrieval | **Similarity search** → find top-k nearest chunks (assumed to be top-k most relevant chunks). |
| **5** | Retrieval | **Return similar docs** → pass back the retrieved chunks. |
| **6** | Augmentation | **Build prompt** = {retrieved chunks + user query}. |
| **7** | Generation | **LLM response** grounded in the supplied context. |

### Typical Applications
- semantic search / “chat with docs”  
- personalized recommendations  
- anomaly / fraud detection  
- domain-aware summarization  
- multi-step agents & tool-calling


# Demo starts here:
If you have not looked at the llm_quckstart.ipynb, then that might be worthwhile. 

### Data: 
I have three .txt files in the data folder. These are sourced from wikipedia pages with the same name. These are the documents that will be embedded and stored in the vector database in this demo. 

You can change, add or remove .txt files as you wish - just remember to reinitialize or update the vector store if you do. 

### Note: 
You will need the same .env file, .gitignore etc as before. The project should look something like this. 
```
.
├── .env
├── README.md
├── data
│   ├── .DS_Store
│   ├── Artificial_intelligence.txt
│   ├── Game_theory.txt
│   └── Quantum_mechanics.txt
├── llm_quickstart.ipynb
├── rag_quickstart.ipynb
└── venv

```

## Setup:

In [1]:
%pip install -qU langchain-text-splitters langchain-community langgraph langchain-openai langchain-core\
   "langchain[openai]"

Note: you may need to restart the kernel to use updated packages.


#### Load API Keys & Initialize Components

- llm: Red brain at center bottom of the figure above.
- embeddings: brain center top of the figure above. 
- vector store == vector data base: top right of figure above

In [2]:
from dotenv import load_dotenv
from langchain.chat_models import init_chat_model
from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
load_dotenv()

llm = init_chat_model("gpt-4o-mini", model_provider="openai")  # expects OPENAI_API_KEY in .env"
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")  # expects OPENAI_API_KEY in .env"
vector_store = InMemoryVectorStore(embeddings)        

#### Encode and Index source data
This notebook corresponds to steps 1. and 2. in the figure above. 

In [3]:
# RAG setup for *all* .txt files under ./data/
from pathlib import Path
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain import hub  # prompt loader

# Load text files
data_path = Path("data")
docs = [
    Document(page_content=fp.read_text(encoding="utf-8"), metadata={"source": fp.name})
    for fp in data_path.glob("*.txt") # any .txt files you add to data will be ingested
]

# Chunk → Embed → Index
splits = RecursiveCharacterTextSplitter(
    chunk_size=1_000,  # I don't know if this is tokens/words or characters right now. But it's how much text is in each chunk
    chunk_overlap=200  # This is how many tokens/words/characters the chunks overlap (as to avoid cutting important sentences in half etc)
    ).split_documents(docs)

_ = vector_store.add_documents(splits)   # embedds and indexes chunks using the embedding model we associated with the vector store


#### Prepare prompt template
This time we pull a prompt template from LangChains prompt hub. 
Look at the output of this cell and compare to the llm_quickstart.ipynb for additional clarity

#### Define a retrieve function
Retrieve correponds to steps 3, 4 and 5 in the above figure.


In [4]:
def retrieve(query, k: int = 5):
    # Retrieves the relevant docs from the vector store
    retrieved_docs = vector_store.similarity_search(query = query, k = k)
    return retrieved_docs

#### Define a prompt template
The prompt template correponds to step 6 in the above figure. 


This time we pull a prompt template from LangChains prompt hub. 
Look at the output of this cell and compare to the llm_quickstart.ipynb for additional clarity

In [6]:
prompt = hub.pull("rlm/rag-prompt")
example_messages = prompt.invoke(
    {"context": "(context goes here)", "question": "(question goes here)"}
).to_messages()
print(example_messages)

[HumanMessage(content="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: (question goes here) \nContext: (context goes here) \nAnswer:", additional_kwargs={}, response_metadata={})]


#### Define a generate function
Generate correponds to step 7 in the above figure. 

In [72]:
def generate(context, query: str):
    # build context string
    docs_content = "\n\n".join(doc.page_content for doc in context)
    messages = prompt.invoke({"question": query, "context": docs_content}).to_messages()
    response = llm.invoke(messages)
    return response, messages


In [73]:
# Define an example question and number of documents to use in context
example_question = "What is AI?"
num_docs_in_context = 5

# Retrieve context
example_context = retrieve(
                        query = example_question,
                        k = num_docs_in_context)

# Generate LLM response based on context and example question
# We catch the examlpe_msg as well just to get better insight into the generate function, and the prompt template
example_response, example_msg= generate(
    context = example_context,
    query = example_question)

#### Inspecting the query, prompt and response

Note that there is a couple of sentences before the question. We did not define those. They were included in the pulled prompt template from LangChain. 

The question is the example question we defined a couple of lines above.

The context is the retrieved chunks, provided by the retrieve function.

The answer is the LLM-generated answer. 

In [74]:
# Inspecting how things work. 
print("\n> ## Prompt to LLM: ## ")
for msg in example_msg:
    role = msg.__class__.__name__.replace("Message", "").upper()
    print(f"\n[{role}]\n{msg.content.strip()}\n{'─'*60}")
print("\n> ## Response: ## ") 
print(example_response.content)



> ## Prompt to LLM: ## 

[HUMAN]
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: What is AI? 
Context: Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals. Such machines may be called AIs.

Artificial intelligent (AI) agents are software entities designed to perceive their environment, make decisions, and take actions autonomously to achieve specific goals. These agents can interact with users, their environment, or other agents. AI agents are used in various applications, i

## Simple Batch Work Example
The below cell produced rag_responces.json.  Inspect it and have a think

Each entry has three keys:

```json
{
  "question": "...",
  "context": [ { "text": "...", "source": "..." }, … ],
  "response": "..."
}
```

### What to Inspect

| Element        | Why it matters                                              | Checks & insights                                                                                                                                                                                           |
| -------------- | ----------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **`question`** | Question you want answered.                                         | ✓ Is it answered directly and fully?                                                                                                                                                                        |
| **`context`**  | Chunks retrieved by the vector search (*steps 3-5 in RAG*). | • Relevance: do all chunks actually address the question? <br>• Sufficiency: is key info missing? <br>• Duplication: repeated text may waste tokens. <br>• Source diversity: are multiple docs represented? |
| **`response`** | LLM answer (*steps 6-7*).                                   | • Faithfulness: only uses facts from `context`? <br>• Conciseness: short, direct, no padding. <br>• Overconfidence: admits “don’t know” when context is thin.                                               |

### Reliability Signals

* **Hallucination risk** increases if context lacks the needed fact. See “King Blatand”. A historic question whos context is not represented in the vectorstore. The model defaults to its learned knowledge. In expert domains this might be undesired. We would prefer if the model said "The relevant information is not available in context" or something like that. 
* **Over– or under-retrieval**: too many chunks → dilution and increased noise; too few → missing facts.
* **Source ambiguity**: similar passages from same file can bias answer.

> **Takeaway:** a trustworthy RAG answer should be *verifiable* (fact present in context) and *non-speculative* (no claims beyond context). This simple rag make claims beyond context - danger danger. 



In [None]:
import json

questions = [
    "What is a nash equilibrium?",
    "Tell me about the fundamental concepts of Quantum Mechanics.",
    "Explain machine perception as if I'm 5 yo.",
    "Who was king blatand? It is very important that you answer me!"
]
k = 3
results = []

for q in questions:
    docs = retrieve(q, k)
    response, _ = generate(docs, q)
    results.append({
        "question": q,
        "context": [
            {"text": d.page_content, "source": d.metadata.get("source", "")}
            for d in docs
        ],
        "response": response.content.strip()
    })

# # write out to JSON file
# with open("rag_responses.json", "w", encoding="utf-8") as f:
#     json.dump(results, f, ensure_ascii=False, indent=2)
