<a href="https://colab.research.google.com/github/vineela-2315/chatbot_llm-IBM/blob/main/RAG_CHATBOT_FLANT5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1.Loading the data

After completing this lab you will be able to:

 - Understand how to use `TextLoader` to load **text files.**
 - Learn how to load **PDFs** using `PyPDFLoader` and `PyMuPDFLoader`.
 - Use `UnstructuredMarkdownLoader` to load **Markdown files.**(.md files-headings, lists, links, code blocks, etc. are format nicely with symbols)
 - Load **JSON** files with `JSONLoader` using jq schemas.
 - Process **CSV** files with `CSVLoader` and `UnstructuredCSVLoader`.
 - Load **Webpage content** using `WebBaseLoader`.
 - Load **Word documents** using `Docx2txtLoader`.
 - Utilize `UnstructuredFileLoader` for **various file types.**

In [5]:
'''
The LangChain team split the project in 2024 so that:
langchain → only has core logic (chains, agents, prompts).
langchain-community → has integrations (PDF loaders, SQL loaders, Hugging Face embeddings, Pinecone, etc.).
'''
!pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.3.29-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-core<2.0.0,>=0.3.75 (from langchain_community)
  Downloading langchain_core-0.3.75-py3-none-any.whl.metadata (5.7 kB)
Collecting requests<3,>=2.32.5 (from langchain_community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7,>=0.6.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.6.7->langchain_community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.6.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.6.7->langchain_community)
  Downloading mypy_extensions-1.1.0-py3-none-any.w

In [6]:
from langchain_community.document_loaders import TextLoader,PyPDFLoader,PyMuPDFLoader,UnstructuredMarkdownLoader,JSONLoader,CSVLoader,UnstructuredCSVLoader,WebBaseLoader,Docx2txtLoader,UnstructuredFileLoader



In [7]:
%pip install pypdf

Collecting pypdf
  Downloading pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)
Downloading pypdf-6.0.0-py3-none-any.whl (310 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.5/310.5 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-6.0.0


In [8]:
%pip install docx2txt

Collecting docx2txt
  Downloading docx2txt-0.9-py3-none-any.whl.metadata (529 bytes)
Downloading docx2txt-0.9-py3-none-any.whl (4.0 kB)
Installing collected packages: docx2txt
Successfully installed docx2txt-0.9


In [11]:
'''
To check certain file exist or not.
Summary:
os → builds file paths correctly across OS.
glob → finds files by pattern (like .txt, .pdf).
Together → reliable way to check, count, and list files before loading them in LangChain.

txt_files = glob.glob(os.path.join(folder_path, "*.txt")) #safe for all types of os.
txt_files = glob.glob("data/*.txt")   # might fail on Windows/Linux differences
'''

import os
import glob

from langchain_community.document_loaders import (
    TextLoader,
    CSVLoader,
    JSONLoader,
    PyPDFLoader,
    PyMuPDFLoader, # Added PyMuPDFLoader based on the first markdown cell
    UnstructuredMarkdownLoader,
    UnstructuredHTMLLoader,
    UnstructuredFileLoader,
    UnstructuredWordDocumentLoader,
    UnstructuredExcelLoader,
    UnstructuredPowerPointLoader,
    UnstructuredImageLoader,
    WebBaseLoader, # Added WebBaseLoader based on the first markdown cell
    Docx2txtLoader # Added Docx2txtLoader based on the first markdown cell
)
def loaders(p):
  d=[]
  #word
  for i in glob.glob(os.path.join(p,"**/*.docx"),recursive=True):
      l=Docx2txtLoader(i)
      d.extend(l.load())
  return d
'''
  #text
  for i in glob.glob(os.path.join(p,"**/*.txt"),recursive=True):
      l = TextLoader(i)
      d.extend(l.load())

  #webpage
  url=["https://en.wikipedia.org/wiki/Generative_artificial_intelligence","https://huggingface.co/docs/transformers/index"]
  for i in url:
    l=WebBaseLoader(i)
    d.extend(l.load())
'''

path = "/content/"
d=loaders(path)

print(f"Total text documents loaded: {len(d)}")
if d:
   for x, c in enumerate(d):
        print(f"\n--- Document {x+1} ---")
        print(c.page_content)
  #for 1 page
  #print(f"content:{d[0].page_content}")
else:
  print("no documents")

Total text documents loaded: 1

--- Document 1 ---
VINEELA ANUGURU

GenAI Engineer 

      Email: vineela.anuguru@gmail.com | Phone: +91-9182985631
       LinkedIn: linkedin.com/in/Vineela-anuguru-156893337  



PROFESSIONAL SUMMARY

Results-driven AI & Data professional with around 4 years of hands-on experience across Testing, Data Analytics, Machine Learning, and Generative AI. Proven track record of transforming raw data into actionable insights, building predictive models, and deploying LLM-powered applications to solve real-world business challenges. Skilled in Python, SQL, Scikit-learn, Power BI, and modern GenAI frameworks like LangChain and OpenAI. Passionate about leveraging AI to drive innovation, automation, and business impact.



CORE SKILLS

Programming: Python, SQL

Data science: Data Preprocessing, Machine learning, Deep Learning basics, NLP

Gen AI: Transformers, LLMs, Vector DBs, RAG, Finetuning, Prompt Engineering

Libraries: Pandas, NumPy, Scikit-learn, Matplotlib,

#2.Textsplitting
Why Do We Need Text Splitting?
1.Large docs (PDFs, webpages, Word, etc.) can be too big for an embedding model or LLM input.

2.Embedding models (e.g., OpenAI’s ada, Cohere, HuggingFace models) have token limits.

3.If you embed the whole doc, the embeddings become too general → retrieval fails.

Splitting into smaller chunks:
>Keeps semantic meaning
>Fits into LLM token window
>Makes search results more relevant.

#Key Concepts of Text Splitting
1. Chunking

Splitting long text into smaller chunks.

Example:
Doc = "Python is a programming language. It is widely used in AI. LangChain makes LLM apps easy."

If you split into chunks of 20 characters:
Chunk1 → "Python is a programm"
Chunk2 → "ing language.It is"

2. Chunk Size

Maximum length of each split (characters/tokens).
Too small → context is lost.
Too big → exceeds model limits or embeddings become vague.
Sweet spot → usually 500–1000 tokens (depends on task).

3. Chunk Overlap

Adds an overlap between chunks to preserve context across boundaries.
Example: chunk size = 100, overlap = 20.
Chunk1 = words 0–100
Chunk2 = words 80–180
This way, “cut off” sentences don’t lose meaning.

4. Splitter Strategies

#LangChain provides multiple text splitters:

1.CharacterTextSplitter
Splits based on character length.
Simple but may cut sentences abruptly.

2.RecursiveCharacterTextSplitter (most common)
Tries to split at:

Paragraph → Sentence → Word → Character
Maintains natural boundaries (best for RAG).

3.TokenTextSplitter
Splits by token count (more precise for LLM token limits).

4.MarkdownTextSplitter
Splits respecting Markdown structure (good for docs, READMEs).

5.Language-Specific Splitters

PythonCodeSplitter → keeps functions together.

HTMLSplitter → respects tags.

LatexSplitter, MarkdownSplitter, etc.

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [23]:
def text_splitter(d,chunk_size=500,chunk_overlap=10):
  s=RecursiveCharacterTextSplitter(chunk_size=chunk_size,chunk_overlap=chunk_overlap)
  chunks=s.split_documents(d)
  print(f"Total chunks:{len(chunks)}")
  for i,c in enumerate(chunks):
    print(f"\nchunks{i+1}:")
    print(c.page_content)
  # Return the chunks list
  return chunks

# Call the function and store the returned chunks in a variable
text_chunks = text_splitter(d)

Total chunks:11

chunks1:
VINEELA ANUGURU

GenAI Engineer 

      Email: vineela.anuguru@gmail.com | Phone: +91-9182985631
       LinkedIn: linkedin.com/in/Vineela-anuguru-156893337  



PROFESSIONAL SUMMARY

chunks2:
Results-driven AI & Data professional with around 4 years of hands-on experience across Testing, Data Analytics, Machine Learning, and Generative AI. Proven track record of transforming raw data into actionable insights, building predictive models, and deploying LLM-powered applications to solve real-world business challenges. Skilled in Python, SQL, Scikit-learn, Power BI, and modern GenAI frameworks like LangChain and OpenAI. Passionate about leveraging AI to drive innovation, automation, and

chunks3:
and business impact.

chunks4:
CORE SKILLS

Programming: Python, SQL

Data science: Data Preprocessing, Machine learning, Deep Learning basics, NLP

Gen AI: Transformers, LLMs, Vector DBs, RAG, Finetuning, Prompt Engineering

Libraries: Pandas, NumPy, Scikit-learn, Matplo

#3.Embedding

Why do we need embeddings in RAG?
In a QA chatbot with RAG:

We use an embedding model to convert each chunk into a vector.
When a user asks a question, we also convert the question into a vector.
We search for similar vectors (chunks) using FAISS/Pinecone/Chroma.
The most relevant chunks are retrieved and given to the LLM to answer.

#How?
The Problem
Computers only understand numbers, not text.
But words like “king”, “queen”, “man”, “woman” have meanings and relationships.
We want a way to map words into numbers so that similar words → similar numbers.

#Traditional methods:
1.one hot encoding(1,0,0,0)
2.word2vec-king=[0.56],queen=[0.45],man=[0.34],woman=[0.23]
3.Modern Embeddings (Transformers):
BERT, Sentence Transformers, OpenAI embeddings go beyond word-level.
They embed sentences, paragraphs, and documents into vectors.
They capture context → e.g., the word “bank” in “river bank” vs “money bank” will have different embeddings.

#Analogy
Think of embeddings like GPS coordinates for words:
Each word gets an “address” in a semantic space.
Words with similar meanings are neighbors (close coordinates).
Example: doctor and nurse will be close, but doctor and banana will be far apart.

#used:"sentence-transformers/all-MiniLM-L6-v2"
1️⃣ What it is:
Type: Transformer-based model
Library: sentence-transformers (built on top of Hugging Face Transformers)
Purpose: Generate dense vector embeddings for sentences or text snippets.
Model Size: Small and efficient (MiniLM) → faster than BERT while still accurate.

2️⃣ Architecture
MiniLM: Miniature version of BERT-like transformers
Lightweight but retains semantic understanding.
Fewer parameters → faster inference and lower memory usage.
L6: 6 Transformer layers (BERT-base has 12 layers).
v2: Improved version with better embedding quality.

3️⃣ Use Cases

Semantic Search / RAG
Convert documents and queries into embeddings
Find documents similar to a query by comparing vectors
Clustering
Group similar sentences or documents
Paraphrase Detection
Compare sentence embeddings for similarity
Text Classification
Use embeddings as input features to ML classifiers

4️⃣ Input & Output

Input: Single sentence or batch of sentences
sentences = ["I love AI", "I enjoy machine learning"]
Output: Fixed-size vector (768 dimensions) for each sentence
[0.12, -0.34, 0.87, ..., 0.45]  # length 384 or 768 depending on version
Vectors can then be used in cosine similarity, clustering, or ML models.

5️⃣ Why it’s popular

Efficient: Smaller than BERT → fast embeddings generation
Accurate: Good semantic understanding for sentences
Easy integration: Works seamlessly with sentence-transformers and LangChain RAG pipelines

In [26]:
# create Document objects
from langchain.schema import Document
# Use the text_chunks variable which contains the list of Document objects
list_of_documents = text_chunks

#Embedding
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

embeddings=HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
v=FAISS.from_documents(list_of_documents,embeddings)

print("FAISS vector store created successfully!")

FAISS vector store created successfully!


In [43]:
#view embeddings
doc_text = list_of_documents[10].page_content
print("Document Text:", doc_text[:2000], "...")

embedding_vector = embeddings.embed_query(doc_text)
print("Embedding length:", len(embedding_vector))
print("First 10 numbers of embedding:",embedding_vector[:400])

Document Text: GenAI with Azure (Coursera)

Introduction to LLMs (Udemy)



AWARDS & ACHIEVEMENTS

Successfully led GenAI PoC for internal QA Chatbot, adopted by test team in Wipro

Maintained 100% on-time delivery record for client AI/ML deliverables over 2+ years

Delivered 87 % accurate Models deployed into production



DECLARATION

                  I hereby declare that the information furnished above is true to the best of my knowledge and belief. ...
Embedding length: 384
First 10 numbers of embedding: [-0.07766377925872803, -0.04160761833190918, -0.024703199043869972, 0.011875772848725319, -0.03259938955307007, -0.02960122376680374, 0.02360340766608715, -0.026876408606767654, 0.015337335877120495, 0.07143209874629974, 0.030869893729686737, -0.0018991498509421945, 0.05734168738126755, 0.03758826106786728, 0.05762194097042084, 0.0706072673201561, 0.0925658792257309, -0.0977880135178566, 0.01757633127272129, -0.024884860962629318, -0.00580486049875617, 0.08150684088468552, 0.0563

#4.Query the Vector Store (Similarity Search)

When a user asks a question,
1.we:Embed the query using the same embedding model.
2.Compare it against all stored vectors.
3.Retrieve the most relevant chunks.


In [45]:
query = "What experience does Vineela have in Generative AI?"
#1.convert query to embedding
query_vector = embeddings.embed_query(query)

print("Query vector (first 10 numbers):")
for i in query_vector[:10]:
    print(i)

#2.do similarity search
docs = v.similarity_search(query, k=3)   # k = top 3 most similar chunks

for i, doc in enumerate(docs, start=1):
    print(f"\n--- Result {i} ---")
    print(doc.page_content)

Query vector (first 10 numbers):
-0.06450623273849487
-0.03404312953352928
-0.021131932735443115
0.04558712616562843
-0.08666884899139404
0.05452815070748329
0.021416565403342247
0.013445291668176651
-0.05660019814968109
-0.004076216369867325

--- Result 1 ---
Results-driven AI & Data professional with around 4 years of hands-on experience across Testing, Data Analytics, Machine Learning, and Generative AI. Proven track record of transforming raw data into actionable insights, building predictive models, and deploying LLM-powered applications to solve real-world business challenges. Skilled in Python, SQL, Scikit-learn, Power BI, and modern GenAI frameworks like LangChain and OpenAI. Passionate about leveraging AI to drive innovation, automation, and

--- Result 2 ---
VINEELA ANUGURU

GenAI Engineer 

      Email: vineela.anuguru@gmail.com | Phone: +91-9182985631
       LinkedIn: linkedin.com/in/Vineela-anuguru-156893337  



PROFESSIONAL SUMMARY

--- Result 3 ---
Identified, logged, a

#5.use llm to generate output
Why Do We Still Need an LLM?
1. Chunks ≠ Final Answer
FAISS + embeddings only give you the most relevant text pieces.
Example:
Query → "Summarize Vineela’s skills."
FAISS might return chunks like:

#without llm
"Skilled in Python, SQL, Scikit-learn…"
"Experience in Data Analytics, Machine Learning…"
"Worked with LangChain, OpenAI…"
These are raw pieces, not a proper summary.


2. LLM = Synthesizer / Composer

The LLM’s job is to read those 3 chunks and generate a coherent, natural-language answer.
Without the LLM, the user would see disjointed raw chunks.

#With the LLM, they get something like:
“Vineela is skilled in Python, SQL, Scikit-learn, and Power BI, with experience in data analytics, machine learning, and GenAI frameworks like LangChain and OpenAI.”

3. Handling User Queries

Users don’t always ask “give me the chunk”.
They may ask:
“Summarize Vineela’s skills” (needs summarization).
“What GenAI tools has she worked on?” (needs filtering).
“Write this in 2 sentences” (needs paraphrasing).
Embeddings+FAISS only find where the info is.

LLM decides how to express it.

✅ Analogy:
Think of FAISS as a librarian who fetches the right 3 books for your question.
The LLM is the teacher who reads those books and gives you a clear, direct answer instead of dumping the books on your desk.

Here we used:
1. Flan-T5 (Base / Large / XL) ✅ (Best for CPU / lightweight)
Model: "google/flan-t5-base" or "google/flan-t5-large"
Type: Instruction-tuned text2text model.
Pros:
Small, fast, and free.
Excellent for summarization & Q&A.
Works well with your MiniLM embeddings in FAISS.
Cons:
Not conversational, just Q&A.

In [48]:
%pip install langchain-huggingface transformers

Collecting langchain-huggingface
  Downloading langchain_huggingface-0.3.1-py3-none-any.whl.metadata (996 bytes)
Downloading langchain_huggingface-0.3.1-py3-none-any.whl (27 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-0.3.1


In [50]:
#1.generate a llm
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline

generator=pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    max_length=512,
    temperature=0
)
#2.wrap the model so that This makes our HuggingFace model (generator) behave like a LangChain LLM.
#Now you can use it inside any LangChain chain (RetrievalQA, ConversationalChain, etc).

llm = HuggingFacePipeline(pipeline=generator)

#3.Now creat retrival chain
'''
llm=llm → the Flan-T5 model we wrapped above.
retriever=v.as_retriever() → turns your FAISS vector DB into a retriever.
When you ask a question, it fetches the top-k most relevant chunks using embeddings.
chain_type="stuff" → tells LangChain to stuff (concatenate) all retrieved documents into a single prompt and pass them to the LLM.
Other options are "map_reduce", "refine", but "stuff" is the simplest and most common.
So this chain = Retriever + LLM → Final Answer.
'''
from langchain.chains import RetrievalQA

qa=RetrievalQA.from_chain_type(
    llm=llm,
    retriever=v.as_retriever(),
    chain_type="stuff"
)

Device set to use cpu
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [51]:
query = "Summarize Vineela's skills."
answer = qa.run(query)
print("Answer:", answer)

  answer = qa.run(query)
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Answer: Identified, logged, and tracked defects using JIRA, collaborating closely with development teams to ensure timely resolution and high-quality releases. PROJECTS Q & A chatbot for QA Team Jan 2024-Aug 2025 Developed a GenAI-based intelligent assistant that automates QA documentation analysis, reducing manual effort and improving test planning efficiency. CORE SKILLS Programming: Python, SQL Data science: Data Preprocessing, Machine learning, Deep Learning basics, NLP Gen AI: Transformers, LLMs, Vector DBs, RAG, Finetuning, Prompt Engineering Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, pyTorch, TensorFlow Tools: Power BI, Git, Jupyter Platforms: Azure ML studio, Azure OpenAI studio, Azure Data bricks, Azure Data Lake


In [52]:
query="In year 2024,vineela worked on which project,give me the project name"
answer=qa.run(query)
print("Answer:",answer)

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Answer: Q & A chatbot for QA


In [53]:
query="In year 2022,vineela worked on which project,give me the project name"
answer=qa.run(query)
print("Answer:",answer)

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Answer: Q & A chatbot for QA Team Jan 2024-Aug 2025 Developed a GenAI-based intelligent assistant that automates QA documentation analysis, reducing manual effort and improving test planning efficiency. GenAI with Azure (Coursera) Introduction to LLMs (Udemy) AWARDS & ACHIEVEMENTS Successfully led GenAI PoC for internal QA Chatbot, adopted by test team in Wipro Maintained 100% on-time delivery record for client AI/ML deliverables over 2+ years Delivered 87 % accurate Models deployed into production
