## Data Ingestion — Preparing Documents for RAG

***Goal:***  
Learn how to load, clean, and normalize documents from various sources (PDFs, text files, APIs, databases) for downstream chunking and embedding.

---

***Why this step matters:***
- **Garbage in → Garbage out:** Quality of retrieved answers depends on quality of source text.  
- **Format diversity:** Sources can be PDFs, HTML, JSON, CSV, DBs.  
- **Preprocessing cost:** Better to normalize early than patch later.


### Common Sources for RAG Pipelines

***Local Files:***
- PDFs
- Word docs (.docx)
- Plain text (.txt)
- Markdown (.md)

***Web Sources:***
- HTML pages
- API responses (JSON/XML)

***Databases:***
- SQL (MySQL, PostgreSQL, SQLite)
- NoSQL (MongoDB, Elasticsearch)

***Other:***
- Email archives
- Spreadsheets (Excel, CSV)


### Common Sources for RAG Pipelines

***Local Files:***
- PDFs
- Word docs (.docx)
- Plain text (.txt)
- Markdown (.md)

***Web Sources:***
- HTML pages
- API responses (JSON/XML)

***Databases:***
- SQL (MySQL, PostgreSQL, SQLite)
- NoSQL (MongoDB, Elasticsearch)

***Other:***
- Email archives
- Spreadsheets (Excel, CSV)


### Standard Data Ingestion Pipeline

***1. Load:*** Read the raw document from source.  
***2. Parse:*** Extract text and basic structure (headings, tables, metadata).  
***3. Clean:*** Remove noise (extra spaces, headers/footers, HTML tags).  
***4. Normalize:*** Convert to UTF-8, unify newline style, lowercase (if applicable).  
***5. Store:*** Save in a standardized internal format (JSON or plain text).


### your pdf that you take in data will completely determines what your RAG can answer.
### Think of it like this:

The retrieval step can only find information that exists in your document store.

The generation step (LLM) will base its answer on the retrieved text.

So if your PDF is a research paper about transformers, your RAG will be great at answering:

#### “What is the architecture of the proposed model?”
but useless for:
#### “What was Microsoft’s revenue in 2024?”



In [6]:
import fitz  

def load_pdf(path):
    """Extract text from a PDF file."""
    doc = fitz.open(path)
    text = ""
    for page in doc:
        text += page.get_text() + "\n"
    return text.strip()

# Load your research paper
pdf_path = "E:/Sujal/Machine learning projects/Rag impletation/data/RAW.pdf"

pdf_text = load_pdf(pdf_path)

# Preview first 500 characters
print(pdf_text[:500])


Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗
Louis Martin†
Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev


### Clean the pdf 

In [9]:
import re
def clean_text(text):
    """Clean the extracted text."""
    text = re.sub(r'\n+', '\n', text)           # collapse multiple newlines
    text = re.sub(r'[ \t]+', ' ', text)         # collapse spaces/tabs
    text = re.sub(r'\s+\n', '\n', text)         # remove spaces before newlines
     # Remove page numbers (standalone digits on a line)
    text = re.sub(r'^\s*\d+\s*$', '', text, flags=re.MULTILINE)
    
    # Strip leading/trailing spaces
    return text.strip()

# Clean your extracted text
cleaned_text = clean_text(pdf_text)

# Preview first 500 characters
print(cleaned_text[:5000])


Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗
Louis Martin†
Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Edunov

### Chunking
#### Why We Chunk

- **Smaller chunks = better retrieval** in RAG.  
- **Overlap preserves context** across chunks.  



In [10]:
# --- Cell 6: Chunk text ---
def chunk_text(text, chunk_size=500, overlap=50):
    """
    Split text into chunks for retrieval.
    
    chunk_size: number of characters per chunk
    overlap: number of characters to overlap between chunks
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk.strip())
        start += chunk_size - overlap
    return chunks

# Create chunks from cleaned text
chunks = chunk_text(cleaned_text, chunk_size=800, overlap=100)

print(f"Total chunks created: {len(chunks)}")
print("--- Preview first chunk ---")
print(chunks[0])


Total chunks created: 376
--- Preview first chunk ---
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗
Louis Martin†
Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqi


In [12]:
# --- Cell 7: Embed and store chunks ---
from sentence_transformers import SentenceTransformer
import numpy as np

# Load an embedding model (small & fast for learning)
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

# Generate embeddings for each chunk
embeddings = embed_model.encode(chunks, convert_to_numpy=True)

# In-memory "vector store"
vector_store = {
    "chunks": chunks,
    "embeddings": embeddings
}

print(f"Stored {len(chunks)} chunks with embeddings.")
print(f"Embedding shape: {embeddings.shape}")


ModuleNotFoundError: No module named 'sentence_transformers'

In [13]:
# --- Cell 8: Simple retrieval (cosine similarity) ---
import numpy as np

def normalize(v):
    """L2-normalize a 2D numpy array of vectors (inplace copy)."""
    norms = np.linalg.norm(v, axis=1, keepdims=True)
    # avoid division by zero
    norms[norms == 0] = 1.0
    return v / norms

# normalize embeddings once for cosine similarity
embeddings_norm = normalize(vector_store['embeddings'].astype(float))

def retrieve(query, top_k=5):
    """
    Return top_k chunks most similar to query.
    Uses sentence-transformers model to embed the query.
    """
    q_emb = embed_model.encode([query], convert_to_numpy=True).astype(float)
    q_emb = q_emb / np.linalg.norm(q_emb, axis=1, keepdims=True)
    sims = np.dot(embeddings_norm, q_emb.T).squeeze()  # cosine similarities
    idx = np.argsort(-sims)[:top_k]  # top-k descending
    results = [{"score": float(sims[i]), "chunk": vector_store['chunks'][i], "index": int(i)} for i in idx]
    return results

# Demo retrieval
q = "Who are the authors and contributors of LLaMA 2?"
res = retrieve(q, top_k=3)
for i, r in enumerate(res, 1):
    print(f"Result {i} — score: {r['score']:.4f}\n{r['chunk'][:400]}\n---\n")


NameError: name 'vector_store' is not defined