# Problem Statement

Build a beginner-level Retrieval Augmented Generation (RAG) system that retrieves relevant information from a text-based knowledge source and answers user queries using semantic similarity search.

The system should:
- Load a dataset
- Split text into chunks
- Convert text into embeddings
- Store embeddings in a vector database
- Retrieve relevant chunks for a given query


In [1]:
!pip install langchain
!pip install langchain-community
!pip install sentence-transformers
!pip install faiss-cpu
!pip install openai


Collecting langchain-community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-classic<2.0.0,>=1.0.0 (from langchain-community)
  Downloading langchain_classic-1.0.1-py3-none-any.whl.metadata (4.2 kB)
Collecting requests<3.0.0,>=2.32.5 (from langchain-community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain-community)
  Downloading marshmallow-3.26.2-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting langchain-text-splitters<2.0.0,>=1.1.0 (from langchain-classic<2.0.0,>=1.0.0->langchain-community)
  Downloading langchain_text_splitters-1.1.0

# Dataset / Knowledge Source

Type of Data: TXT file  
Data Source: Self-created AI notes  

The dataset contains basic information about:
- Artificial Intelligence
- Machine Learning
- Deep Learning
- NLP
- Computer Vision
- Applications of AI
- Limitations of AI


In [2]:
text = """
Artificial Intelligence (AI) is the simulation of human intelligence in machines.

Machine Learning (ML) is a subset of AI that enables systems to learn from data without explicit programming.

Deep Learning is a type of Machine Learning that uses neural networks with multiple layers.

Natural Language Processing (NLP) helps machines understand, interpret and generate human language.

Computer Vision allows machines to understand and analyze images and videos.

Applications of AI include healthcare, finance, education, autonomous vehicles and robotics.

Limitations of AI include bias in data, high computational cost and lack of human emotions.
"""

with open("ai_notes.txt", "w") as f:
    f.write(text)

print("Dataset created successfully!")


Dataset created successfully!


In [6]:
from langchain_community.document_loaders import TextLoader


loader = TextLoader("ai_notes.txt")
documents = loader.load()

print(documents)


[Document(metadata={'source': 'ai_notes.txt'}, page_content='\nArtificial Intelligence (AI) is the simulation of human intelligence in machines.\n\nMachine Learning (ML) is a subset of AI that enables systems to learn from data without explicit programming.\n\nDeep Learning is a type of Machine Learning that uses neural networks with multiple layers.\n\nNatural Language Processing (NLP) helps machines understand, interpret and generate human language.\n\nComputer Vision allows machines to understand and analyze images and videos.\n\nApplications of AI include healthcare, finance, education, autonomous vehicles and robotics.\n\nLimitations of AI include bias in data, high computational cost and lack of human emotions.\n')]


In [8]:
!pip install langchain-text-splitters




# Text Chunking Strategy

Chunk Size: 150 characters  
Chunk Overlap: 30 characters  

Reason:
Chunking is used to divide large text into smaller pieces for efficient retrieval.
Overlap ensures that context is not lost between chunks.


In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=30
)

docs = text_splitter.split_documents(documents)

print("Number of chunks created:", len(docs))

for i, doc in enumerate(docs):
    print(f"\nChunk {i+1}:")
    print(doc.page_content)



Number of chunks created: 7

Chunk 1:
Artificial Intelligence (AI) is the simulation of human intelligence in machines.

Chunk 2:
Machine Learning (ML) is a subset of AI that enables systems to learn from data without explicit programming.

Chunk 3:
Deep Learning is a type of Machine Learning that uses neural networks with multiple layers.

Chunk 4:
Natural Language Processing (NLP) helps machines understand, interpret and generate human language.

Chunk 5:
Computer Vision allows machines to understand and analyze images and videos.

Chunk 6:
Applications of AI include healthcare, finance, education, autonomous vehicles and robotics.

Chunk 7:
Limitations of AI include bias in data, high computational cost and lack of human emotions.


# RAG Architecture

Pipeline Flow:

User Query
    |
Convert Query to Embedding
    |
FAISS Similarity Search
    |
Retrieve Top-K Relevant Chunks
    |
Return Retrieved Content as Answer


# Embedding Details

Embedding Model Used:
sentence-transformers/all-MiniLM-L6-v2

Reason for Selection:
- Lightweight
- Fast
- Good semantic similarity performance
- Suitable for beginner-level RAG implementation


In [10]:
from langchain_community.embeddings import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

print("Embedding model loaded successfully!")


  embedding_model = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding model loaded successfully!


In [11]:
sample_vector = embedding_model.embed_query("What is AI?")
print("Vector length:", len(sample_vector))
print(sample_vector[:10])


Vector length: 384
[-0.024964462965726852, -0.009133681654930115, -0.007461548317223787, 0.01500906702131033, 0.013310413807630539, -0.010036050342023373, 0.07456009835004807, 0.04267149046063423, 0.016988538205623627, 0.05595098063349724]


# Vector Database

Vector Store Used: FAISS

Reason:
FAISS allows efficient similarity search over high-dimensional vectors.


In [12]:
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(docs, embedding_model)

print("FAISS vector store created successfully!")


FAISS vector store created successfully!


In [14]:
retriever = vectorstore.as_retriever(search_kwargs={"k":2})

query = "What is Machine Learning?"

results = retriever.invoke(query)

for i, doc in enumerate(results):
    print(f"\nResult {i+1}:")
    print(doc.page_content)




Result 1:
Machine Learning (ML) is a subset of AI that enables systems to learn from data without explicit programming.

Result 2:
Deep Learning is a type of Machine Learning that uses neural networks with multiple layers.


In [15]:
!pip install transformers




In [16]:
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [19]:
def rag_answer(query):
    retrieved_docs = retriever.invoke(query)

    context = " ".join([doc.page_content for doc in retrieved_docs])

    final_answer = f"""
    Question: {query}

    Retrieved Context:
    {context}

    Final Answer:
    {context}
    """

    return final_answer



In [20]:
print(rag_answer("What is Machine Learning?"))




    Question: What is Machine Learning?
    
    Retrieved Context:
    Machine Learning (ML) is a subset of AI that enables systems to learn from data without explicit programming. Deep Learning is a type of Machine Learning that uses neural networks with multiple layers.
    
    Final Answer:
    Machine Learning (ML) is a subset of AI that enables systems to learn from data without explicit programming. Deep Learning is a type of Machine Learning that uses neural networks with multiple layers.
    


In [21]:
test_queries = [
    "What is Artificial Intelligence?",
    "Applications of AI?",
    "What are limitations of AI?"
]

for q in test_queries:
    print("\n==============================")
    print("Question:", q)
    print("Answer:", rag_answer(q))



Question: What is Artificial Intelligence?
Answer: 
    Question: What is Artificial Intelligence?
    
    Retrieved Context:
    Artificial Intelligence (AI) is the simulation of human intelligence in machines. Machine Learning (ML) is a subset of AI that enables systems to learn from data without explicit programming.
    
    Final Answer:
    Artificial Intelligence (AI) is the simulation of human intelligence in machines. Machine Learning (ML) is a subset of AI that enables systems to learn from data without explicit programming.
    

Question: Applications of AI?
Answer: 
    Question: Applications of AI?
    
    Retrieved Context:
    Applications of AI include healthcare, finance, education, autonomous vehicles and robotics. Limitations of AI include bias in data, high computational cost and lack of human emotions.
    
    Final Answer:
    Applications of AI include healthcare, finance, education, autonomous vehicles and robotics. Limitations of AI include bias in data, h