# Problem Statement

Build a beginner-level Retrieval Augmented Generation (RAG) system that retrieves relevant information from a text-based knowledge source and answers user queries using semantic similarity search.

The system should:
- Load a dataset
- Split text into chunks
- Convert text into embeddings
- Store embeddings in a vector database
- Retrieve relevant chunks for a given query


In [None]:
!pip install langchain
!pip install langchain-community
!pip install sentence-transformers
!pip install faiss-cpu
!pip install openai


# Dataset / Knowledge Source

Type of Data: TXT file  
Data Source: Self-created AI notes  

The dataset contains basic information about:
- Artificial Intelligence
- Machine Learning
- Deep Learning
- NLP
- Computer Vision
- Applications of AI
- Limitations of AI


In [None]:
text = """
Artificial Intelligence (AI) is the simulation of human intelligence in machines.

Machine Learning (ML) is a subset of AI that enables systems to learn from data without explicit programming.

Deep Learning is a type of Machine Learning that uses neural networks with multiple layers.

Natural Language Processing (NLP) helps machines understand, interpret and generate human language.

Computer Vision allows machines to understand and analyze images and videos.

Applications of AI include healthcare, finance, education, autonomous vehicles and robotics.

Limitations of AI include bias in data, high computational cost and lack of human emotions.
"""

with open("ai_notes.txt", "w") as f:
    f.write(text)

print("Dataset created successfully!")


In [None]:
from langchain_community.document_loaders import TextLoader


loader = TextLoader("ai_notes.txt")
documents = loader.load()

print(documents)


In [None]:
!pip install langchain-text-splitters


# Text Chunking Strategy

Chunk Size: 150 characters  
Chunk Overlap: 30 characters  

Reason:
Chunking is used to divide large text into smaller pieces for efficient retrieval.
Overlap ensures that context is not lost between chunks.


In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=30
)

docs = text_splitter.split_documents(documents)

print("Number of chunks created:", len(docs))

for i, doc in enumerate(docs):
    print(f"\nChunk {i+1}:")
    print(doc.page_content)



# RAG Architecture

Pipeline Flow:

User Query
    |
Convert Query to Embedding
    |
FAISS Similarity Search
    |
Retrieve Top-K Relevant Chunks
    |
Return Retrieved Content as Answer


# Embedding Details

Embedding Model Used:
sentence-transformers/all-MiniLM-L6-v2

Reason for Selection:
- Lightweight
- Fast
- Good semantic similarity performance
- Suitable for beginner-level RAG implementation


In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

print("Embedding model loaded successfully!")


In [None]:
sample_vector = embedding_model.embed_query("What is AI?")
print("Vector length:", len(sample_vector))
print(sample_vector[:10])


# Vector Database

Vector Store Used: FAISS

Reason:
FAISS allows efficient similarity search over high-dimensional vectors.


In [None]:
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(docs, embedding_model)

print("FAISS vector store created successfully!")


In [None]:
retriever = vectorstore.as_retriever(search_kwargs={"k":2})

query = "What is Machine Learning?"

results = retriever.invoke(query)

for i, doc in enumerate(results):
    print(f"\nResult {i+1}:")
    print(doc.page_content)



In [None]:
!pip install transformers


In [None]:
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")


In [None]:
def rag_answer(query):
    retrieved_docs = retriever.invoke(query)

    context = " ".join([doc.page_content for doc in retrieved_docs])

    final_answer = f"""
    Question: {query}

    Retrieved Context:
    {context}

    Final Answer:
    {context}
    """

    return final_answer



In [None]:
print(rag_answer("What is Machine Learning?"))



In [None]:
test_queries = [
    "What is Artificial Intelligence?",
    "Applications of AI?",
    "What are limitations of AI?"
]

for q in test_queries:
    print("\n==============================")
    print("Question:", q)
    print("Answer:", rag_answer(q))
