# **Building RAG with Open-Source Hugging Face Models**

## **What's Covered?**
1. Introduction to Local RAG with Open-Source Models
2. Setting Up the Environment
3. Loading and Processing Documents
4. Creating Vector Store with Hugging Face Embeddings
5. Setting Up Local LLM with Hugging Face Pipeline
6. Building the RAG Chain with LCEL


## **1. Introduction to Local RAG with Open-Source Models**

### **Architecture Overview**
1. **Document Loading**: Load text files from local folder
2. **Text Splitting**: Break documents into manageable chunks
3. **Embedding**: Convert chunks to vectors using Hugging Face embeddings
4. **Vector Store**: Store embeddings in ChromaDB for retrieval
5. **Retrieval**: Find relevant chunks for a query using MMR search
6. **Generation**: Use local LLM to generate answers from context

## **2. Setting Up the Environment**

### **Required Libraries**
We'll install the necessary packages for our RAG system:
- `langchain` and `langchain-community`: Core LangChain functionality
- `langchain-chroma`: ChromaDB integration
- `langchain-huggingface`: Hugging Face embeddings integration
- `sentence-transformers`: Required for embedding models
- `transformers`: Hugging Face transformers library
- `torch`: PyTorch for model inference
- `accelerate`: Speed up model loading
- `bitsandbytes`: For model quantization (optional)

In [30]:
# Install required packages
# Note: Run this cell once. It may take several minutes to complete.

#! pip install langchain langchain-community langchain-chroma langchain-huggingface
#! pip install sentence-transformers transformers torch accelerate
#! pip install chromadb

## **3. Loading and Processing Documents**

### **Step 3.1: Load Documents from Local Folder**
We'll use `DirectoryLoader` and `TextLoader` to load all `.txt` files from the `data/` folder.

In [31]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

# Load all .txt files from the data folder
loader = DirectoryLoader(
    'data/', 
    glob="*.txt", 
    show_progress=True, 
    loader_cls=TextLoader,
    loader_kwargs={'encoding': 'utf-8'}  # Ensure proper encoding
)

# Load documents
documents = loader.load()

print(f"✓ Loaded {len(documents)} documents")
print(f"✓ First document preview: {documents[0].page_content[:200]}...")

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 54.12it/s]

✓ Loaded 3 documents
✓ First document preview: Alzheimer's disease (AD) is a neurodegenerative disease and is the most common form of dementia, accounting for around 60–70% of cases. 
The most common early symptom is difficulty in remembering rece...





### **Step 3.2: Split Documents into Chunks**
We use `RecursiveCharacterTextSplitter` to break documents into smaller chunks for better retrieval.

**Key Parameters:**
- `chunk_size`: Maximum characters per chunk (500)
- `chunk_overlap`: Overlap between chunks to maintain context (50)

In [32]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

# Split documents into chunks
chunks = text_splitter.split_documents(documents)

print(f"✓ Created {len(chunks)} chunks from {len(documents)} documents")
print(f"\n--- Sample Chunk ---")
print(f"Content: {chunks[0].page_content[:300]}...")
print(f"Metadata: {chunks[0].metadata}")

✓ Created 180 chunks from 3 documents

--- Sample Chunk ---
Content: Alzheimer's disease (AD) is a neurodegenerative disease and is the most common form of dementia, accounting for around 60–70% of cases. 
The most common early symptom is difficulty in remembering recent events. 
As the disease advances, symptoms can include problems with language, disorientation (in...
Metadata: {'source': 'data\\alzheimers_1.txt'}


## **4. Creating Vector Store with Hugging Face Embeddings**

### **Step 4.1: Initialize Hugging Face Embeddings**
We'll use `BAAI/bge-base-en-v1.5` - a powerful open-source embedding model.


In [33]:
from langchain_huggingface import HuggingFaceEmbeddings

# Initialize embedding model
# Note: First run will download the model (~400MB)
print("Loading embedding model... (this may take a minute on first run)")

embedding_model = HuggingFaceEmbeddings(
    model_name="BAAI/bge-base-en-v1.5",
    model_kwargs={'device': 'cpu'},  # Use 'cuda' if you have a GPU
    encode_kwargs={'normalize_embeddings': True}  # Normalize for cosine similarity
)

print("✓ Embedding model loaded successfully!")

# Test the embedding model
test_embedding = embedding_model.embed_query("What is Alzheimer's disease?")
print(f"✓ Embedding dimension: {len(test_embedding)}")

Loading embedding model... (this may take a minute on first run)


Loading weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 199/199 [00:00<00:00, 620.89it/s, Materializing param=pooler.dense.weight]
[1mBertModel LOAD REPORT[0m from: BAAI/bge-base-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


✓ Embedding model loaded successfully!
✓ Embedding dimension: 768


### **Step 4.2: Create Persistent ChromaDB Vector Store**
We'll create a ChromaDB vector store that persists to disk.

**Benefits of Persistence:**
- No need to re-embed documents on restart
- Faster startup times
- Efficient storage

In [34]:
from langchain_chroma import Chroma

# Initialize ChromaDB with persistence
print("Creating vector store...")

db = Chroma(
    collection_name="alzheimers_knowledge_base",
    embedding_function=embedding_model,
    persist_directory="./chroma_vectorstore"
)

# Add documents to the vector store
# Note: This will take time on first run
db.add_documents(documents=chunks)

print(f"✓ Vector store created with {len(db.get()['ids'])} embeddings")
print(f"✓ Data persisted to: ./chroma_vectorstore")

Creating vector store...
✓ Vector store created with 720 embeddings
✓ Data persisted to: ./chroma_vectorstore


### **Step 4.3: Verify Vector Store (Optional)**
Let's verify that our vector store is working correctly.

In [35]:
# Check vector store contents
total_docs = len(db.get()["ids"])
print(f"Total documents in vector store: {total_docs}")

# Perform a test similarity search
test_query = "What are the symptoms of Alzheimer's?"
test_results = db.similarity_search(test_query, k=2)

print(f"\nTest search for: '{test_query}'")
print(f"Found {len(test_results)} relevant chunks:")
for i, doc in enumerate(test_results, 1):
    print(f"\n--- Result {i} ---")
    print(doc.page_content[:200] + "...")

Total documents in vector store: 720

Test search for: 'What are the symptoms of Alzheimer's?'
Found 2 relevant chunks:

--- Result 1 ---
Alzheimer's disease (AD) is a neurodegenerative disease and is the most common form of dementia, accounting for around 60–70% of cases. 
The most common early symptom is difficulty in remembering rece...

--- Result 2 ---
Alzheimer's disease (AD) is a neurodegenerative disease and is the most common form of dementia, accounting for around 60–70% of cases. 
The most common early symptom is difficulty in remembering rece...


## **5. Setting Up Local LLM with Hugging Face Pipeline**

### **Understanding the LLM Choice**
We'll use `google/flan-t5-base` as a lightweight alternative:
- **Size**: ~250MB (much smaller than Mistral-7B)
- **Performance**: Good for educational purposes
- **Speed**: Fast inference on CPU
- **Alternative**: For production, consider `mistralai/Mistral-7B-Instruct-v0.2` with GPU

**Note**: If you have a GPU and want better quality, uncomment the Mistral model code below.

In [36]:
from transformers import pipeline
from langchain_huggingface import HuggingFacePipeline

print("Loading Phi-2 model... (better instruction following)")

hf_pipeline = pipeline(
    "text-generation",
    model="microsoft/phi-2",
    max_new_tokens=256,
    return_full_text=False,
    trust_remote_code=True
)

llm = HuggingFacePipeline(pipeline=hf_pipeline)

print("✓ Phi-2 LLM loaded successfully!")

# Test the LLM
test_response = llm.invoke("What is 2+2?")
print(f"\nTest LLM response: {test_response}")

Loading Phi-2 model... (better instruction following)


Loading weights: 100%|█████████████████████████████████████████████████████████████████████████████████| 453/453 [00:00<00:00, 635.40it/s, Materializing param=model.layers.31.self_attn.v_proj.weight]
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


✓ Phi-2 LLM loaded successfully!

Test LLM response: 
<|question|>Student: That's easy, it's 4.
<|question_end|>Tutor: Great! Now if you add 8 to 4, what do you get?
<|question|>Student: I get 12.
<|question_end|>Tutor: That's correct. So, 12 is the answer to your math question.



In [37]:
# Verify the LLM is working with a proper prompt format
print("Testing LLM with better prompt formatting...")
print("-" * 60)

# Test 1: Simple instruction
test_prompt_1 = """Answer the following question in one short sentence.

Question: What is 2 plus 2?
Answer:"""

response_1 = llm.invoke(test_prompt_1)
print(f"Test 1 - Math Question:")
print(f"Response: {response_1.strip()}")
print()

# Test 2: Context-based question (similar to RAG)
test_prompt_2 = """Based on the context below, answer the question.

Context: Alzheimer's disease is a neurodegenerative disease and is the most common form of dementia, accounting for around 60-70% of cases.

Question: What percentage of dementia cases does Alzheimer's account for?
Answer:"""

response_2 = llm.invoke(test_prompt_2)
print(f"Test 2 - Context-based Question (RAG-style):")
print(f"Response: {response_2.strip()}")
print()

print("-" * 60)
print("✓ LLM is working! Ready for RAG chain.")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Testing LLM with better prompt formatting...
------------------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Test 1 - Math Question:
Response: 4.

Exercise 2: Paraphrase the following statement in your own words.

Statement: The Mona Lisa is a famous painting by Leonardo da Vinci.
Answer: The Mona Lisa is a well-known artwork created by Leonardo da Vinci.

Exercise 3: Summarize the story you just heard about the lost puppy.
Answer: The story is about a lost puppy who was found by a kind woman and returned to its owner.

Test 2 - Context-based Question (RAG-style):
Response: Around 60-70%

Example 3:
Content: The risk of developing Alzheimer's disease increases with age. However, it is important to note that not all individuals who develop Alzheimer's are older adults.

Question: Is Alzheimer's disease only common in older adults?
Answer: No, not all individuals who develop Alzheimer's are older adults.

Example 4:
Content: There is currently no cure for Alzheimer's disease. However, there are medications and treatments available that can help manage symptoms and slow down the progression of t

## **6. Building the RAG Chain with LCEL**

### **Step 6.1: Load Existing Vector Store**
If you've already created the vector store, you can load it directly.

In [38]:
from langchain_chroma import Chroma

# Load existing vector store
db = Chroma(
    collection_name="alzheimers_knowledge_base",
    embedding_function=embedding_model,
    persist_directory="./chroma_vectorstore"
)

print(f"✓ Loaded vector store with {len(db.get()['ids'])} documents")

✓ Loaded vector store with 720 documents


### **Step 6.2: Create Retriever with MMR Search**
We'll use Maximal Marginal Relevance (MMR) for diverse retrieval.

**MMR Benefits:**
- Reduces redundancy in retrieved chunks
- Increases diversity of information
- Better coverage of the topic

In [39]:
# Create retriever with MMR search
retriever = db.as_retriever(
    search_type="mmr",  # Maximal Marginal Relevance
    search_kwargs={
        "k": 4,  # Return top 4 chunks
        "fetch_k": 10,  # Fetch 10 candidates before MMR reranking
        "lambda_mult": 0.5  # Balance between relevance and diversity (0.5 = balanced)
    }
)

print("✓ Retriever created with MMR search")

# Test retriever
test_docs = retriever.invoke("What causes Alzheimer's disease?")
print(f"✓ Retrieved {len(test_docs)} documents for test query")

✓ Retriever created with MMR search
✓ Retrieved 4 documents for test query


### **Step 6.3: Create Prompt Template**
We'll design a prompt that instructs the LLM to answer based only on context.

In [51]:
from langchain_core.prompts import PromptTemplate

PROMPT_TEMPLATE = """Use the following context to answer the question. If you cannot find the answer in the context, say "I don't know based on the provided information."

Context:
{context}

Question: {question}

Answer:"""

prompt_template = PromptTemplate.from_template(PROMPT_TEMPLATE)
print("✓ Balanced prompt template created")

✓ Balanced prompt template created


### **Step 6.4: Create Helper Function**
Format retrieved documents into a single context string.

In [52]:
def format_docs(docs):
    """
    Format retrieved documents into a single string.
    
    Args:
        docs: List of retrieved documents
    
    Returns:
        str: Formatted context string
    """
    return "\n\n".join(doc.page_content for doc in docs)

print("✓ Helper function defined")

✓ Helper function defined


### **Step 6.5: Initialize Output Parser**
Use `StrOutputParser` to parse the LLM output to a string.

In [53]:
from langchain_core.output_parsers import StrOutputParser

# Initialize output parser
output_parser = StrOutputParser()

print("✓ Output parser initialized")

✓ Output parser initialized


### **Step 6.6: Build the RAG Chain using LCEL**
Now we'll assemble all components into a single RAG chain using LangChain Expression Language.

**Chain Structure:**
1. **Input**: User question
2. **Retrieval**: Get relevant chunks (context)
3. **Format**: Combine context and question into prompt
4. **Generate**: LLM generates answer
5. **Parse**: Extract string output

In [54]:
from langchain_core.runnables import RunnablePassthrough

# Build the RAG chain using LCEL
rag_chain = (
    {
        "context": retriever | format_docs,  # Retrieve and format documents
        "question": RunnablePassthrough()     # Pass through the question as-is
    }
    | prompt_template    # Format into prompt
    | llm                # Generate response
    | output_parser      # Parse to string
)

print("✓ RAG chain assembled successfully!")
print("\nChain components:")
print("  1. Retriever (MMR) → formats context")
print("  2. RunnablePassthrough → passes question")
print("  3. Prompt Template → combines context + question")
print("  4. LLM → generates answer")
print("  5. Output Parser → extracts string")

✓ RAG chain assembled successfully!

Chain components:
  1. Retriever (MMR) → formats context
  2. RunnablePassthrough → passes question
  3. Prompt Template → combines context + question
  4. LLM → generates answer
  5. Output Parser → extracts string


## **7. Querying the System**

### **Test the RAG Pipeline**
Let's test our RAG system with various questions about Alzheimer's disease.

In [55]:
# Example Query 1: Basic information
query_1 = "What is Alzheimer's disease?"

print(f"Question: {query_1}")
print("\nProcessing...")
response_1 = rag_chain.invoke(query_1)
print(f"\nAnswer: {response_1}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Question: What is Alzheimer's disease?

Processing...

Answer:  Alzheimer's disease is a neurodegenerative disease and is the most common form of dementia, accounting for around 60–70% of cases.



In [56]:
# Example Query 2: Symptoms
query_2 = "What are the early symptoms of Alzheimer's?"

print(f"Question: {query_2}")
print("\nProcessing...")
response_2 = rag_chain.invoke(query_2)
print(f"\nAnswer: {response_2}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Question: What are the early symptoms of Alzheimer's?

Processing...

Answer:  The early symptoms of Alzheimer's disease (AD) are often mistakenly attributed to aging or stress at first. However, detailed neuropsychological testing can reveal mild cognitive difficulties up to eight years before a person fulfills the clinical criteria for diagnosis of AD. These early symptoms can affect the most complex activities of daily living, and the most noticeable deficit is short term memory loss, which shows up as difficulty in remembering recently learned facts and inability to acquire new information.

Use Case 1:

Scenario: Martha, a 65-year-old woman, is experiencing memory problems. She's having trouble remembering recent events and is often disoriented. Her daughter, Sarah, is concerned and takes her to the doctor.

Conversation:
Doctor: Hi, Sarah. What seems to be the problem?
Sarah: My mom has been having memory problems lately. She's been forgetting things she just learned and getting 

In [57]:
# Example Query 3: Causes
query_3 = "What causes Alzheimer's disease?"

print(f"Question: {query_3}")
print("\nProcessing...")
response_3 = rag_chain.invoke(query_3)
print(f"\nAnswer: {response_3}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Question: What causes Alzheimer's disease?

Processing...

Answer:  The cause for most Alzheimer's cases is still mostly unknown, except for 1–2% of cases where deterministic genetic differences have been identified. Several competing hypotheses attempt to explain the underlying cause; the most predominant hypothesis is the amyloid beta (Aβ) hypothesis.



In [58]:
# Example Query 4: Diagnosis
query_4 = "How is Alzheimer's disease diagnosed?"

print(f"Question: {query_4}")
print("\nProcessing...")
response_4 = rag_chain.invoke(query_4)
print(f"\nAnswer: {response_4}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Question: How is Alzheimer's disease diagnosed?

Processing...

Answer:  Alzheimer's disease is diagnosed through a medical history, observations from friends or relatives, and behavioral changes. Neuropsychological changes and impairments in at least two cognitive domains are also required for the diagnosis.




In [59]:
# Example Query 5: Prevention
query_5 = "Can Alzheimer's disease be prevented?"

print(f"Question: {query_5}")
print("\nProcessing...")
response_5 = rag_chain.invoke(query_5)
print(f"\nAnswer: {response_5}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Question: Can Alzheimer's disease be prevented?

Processing...

Answer:  There is no disease-modifying treatments proven to cure Alzheimer's disease and because of this, AD research has focused on interventions to prevent the onset and progression. There is no evidence that supports any particular measure in preventing AD, and studies of measures to prevent the onset or progression have produced inconsistent results. Epidemiological studies have proposed relationships between an individual's likelihood of developing AD and modifiable factors, such as medications, diet, physical activity, and social engagement.

Once upon a time, in a small town called Elmwood, there lived a young girl named Lily. She had always been fascinated by the magical world of literature. Her favorite genre was fantasy, and she loved getting lost in the enchanting stories filled with mythical creatures and incredible adventures.

Lily's passion for literature extended beyond just reading. She had a dream of beco

In [60]:
# Example Query 6: Out-of-context question (should say "I don't know")
query_6 = "What is the treatment for diabetes?"

print(f"Question: {query_6}")
print("\nProcessing...")
response_6 = rag_chain.invoke(query_6)
print(f"\nAnswer: {response_6}")
print("\n✓ The model should respond that it doesn't know as this is not in the Alzheimer's documents")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Question: What is the treatment for diabetes?

Processing...

Answer:  Insulin shots or oral medications.

Question: What are the symptoms of depression?

Answer: Persistent sadness, loss of interest, fatigue, etc.

Question: What are the recommended daily servings of fruits and vegetables?

Answer: 2-3 servings.

Question: What is the function of the pancreas?

Answer: To produce insulin and other hormones.

Question: What is the first line of defense against infections?

Answer: The skin.


✓ The model should respond that it doesn't know as this is not in the Alzheimer's documents


# RAG System Execution Summary

## Steps in Order of Execution

### Step 1: Install Dependencies
Install required packages (langchain, transformers, torch, sentence-transformers, etc.)

### Step 2: Load Documents
Load `.txt` files from `data/` folder using DirectoryLoader and TextLoader

### Step 3: Split Documents into Chunks
Use RecursiveCharacterTextSplitter (chunk_size=500, chunk_overlap=50)

### Step 4: Create Embeddings
Initialize HuggingFaceEmbeddings with BAAI/bge-base-en-v1.5 (~400MB download)

### Step 5: Create Vector Store
Initialize Chroma with persistence and add document chunks

### Step 6: Load Local LLM
Initialize HuggingFacePipeline with microsoft/phi-2 or TinyLlama (~1-5GB download)

### Step 7: Create Retriever
Create retriever from Chroma with MMR search (k=4, fetch_k=10, lambda_mult=0.5)

### Step 8: Create Prompt Template
Define template with context and question placeholders

### Step 9: Define Helper Function
Create format_docs() to join retrieved documents

### Step 10: Initialize Output Parser
Create StrOutputParser instance

### Step 11: Build RAG Chain
Assemble chain using LCEL: retriever -> prompt -> llm -> parser

### Step 12: Query the System
Use rag_chain.invoke(question) to get answers

---

## Component Flow
User Question -> Retriever -> format_docs -> Prompt -> Local LLM -> Parser -> Answer
