# **Project 02: LangChain RAG Project**


---



---


####  **Task**
Create a Google Colab Notebook that integrates a RAG workflow, leveraging the Google Gemini Flash API and Pinecone for vector storage and retrieval. Your system should:
- **Load and Chunk Documents:** Demonstrate how to load a document (e.g., `documents.txt`), split it into smaller chunks, and embed these chunks using Gemini embedding.
- **Store and Retrieve from Pinecone:** Set up Pinecone, create an index, and store embedding. Show how your system retrieves context for user queries.
- **Integrate Gemini Flash LLM:** Use the Gemini Flash model in a Retrieval QA chain to answer user questions based on the retrieved context.
- **Experiment with Parameters:** Fine-tune your RAG system by adjusting chunk size, overlap, temperature, or other relevant parameters.

##  **Installing Required Libraries**

This cell installs all the necessary Python libraries required for the project:

- `langchain-pinecone`: Provides integration between LangChain and Pinecone.
- `pinecone-notebooks`: Contains utilities for working with Pinecone in notebooks.
- `langchain-google-genai`: For integrating LangChain with Google Gemini's API.
- `langchain-community`: Adds community-supported integrations.
- `pypdf`: For working with PDFs, including text extraction.

In [1]:
%pip install -qU langchain-pinecone pinecone-notebooks langchain-google-genai langchain-community pypdf


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/41.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.5/41.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/2.5 MB[0m [31m13.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━[0m [32m2.0/2.5 MB[0m [31m31.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m34.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━

##  **Load Environment Variables**

- This cell sets up the environment variables required for authentication:
  - `GOOGLE_API_KEY`: For authenticating with the Google Gemini API.
  - `PINECONE_API_KEY`: For accessing Pinecone services.
- `userdata.get()` ensures secure fetching of keys in Google Colab.



In [2]:
import getpass
import os
import time
from google.colab import userdata
from pinecone import Pinecone, ServerlessSpec

# Set environment variables
os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY')
pinecone_api_key = userdata.get("PINECONE_API_KEY")

pc = Pinecone(api_key=pinecone_api_key)

##  **Initialize Pinecone**

- Purpose: Set up the Pinecone index for vector storage.
- Checks if an index named `project-2-index` exists.
  - If not, creates the index with:
    - `dimension=768`: Matching the embedding size of the Gemini model.
    - `metric="cosine"`: For similarity-based retrieval.
- Waits until the index is ready.


In [3]:
# Initialize Pinecone
import time

index_name = "project-2-index"

existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]

if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=768,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

index = pc.Index(index_name)

##  **Embedding Initialization**

- GoogleGenerativeAIEmbeddings: Sets up an embedding model using Google Gemini Flash.
- The model is identified by `model="models/embedding-001"`.

In [9]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
vector = embeddings.embed_query("hello, world!")
vector[:5]

[0.05168594419956207,
 -0.030764883384108543,
 -0.03062233328819275,
 -0.02802734263241291,
 0.01813093200325966]

##  **Creating the Vector Store**

- Uses LangChain's `PineconeVectorStore` to integrate Pinecone and the Gemini embedding model.
- This vector store will be used to add and retrieve embedded documents.

In [5]:
# Initialize the vector store and embeddings
from langchain_pinecone import PineconeVectorStore

vector_store = PineconeVectorStore(index=index, embedding=embeddings)

##  **Loading the PDF**

- **Purpose**: Load a PDF document for processing.
- Uses `PyPDFLoader` to extract text from `/content/Q1 2024 Report.pdf`.

In [6]:
from langchain.document_loaders import PyPDFLoader

# Load the PDF
loader = PyPDFLoader("/content/Q1 2024 Report.pdf")
documents = loader.load()

ValueError: File path /content/Q1 2024 Report.pdf is not a valid file or url

##  **Splitting Documents into Chunks**


- Uses `RecursiveCharacterTextSplitter` to split the loaded document:
  - `chunk_size=800`: Each chunk has 800 characters.
  - `chunk_overlap=100`: 100 characters overlap between consecutive chunks.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split into Chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
docs = text_splitter.split_documents(documents)
len(docs)

##  **Assigning Unique IDs**

- Assigns a unique identifier to each document chunk using `uuid4`.

In [None]:
from uuid import uuid4

# Generate Unique IDs for Each Chunk
uuids = [str(uuid4()) for _ in range(len(docs))]

##  **Adding Chunks to the Vector Store**

- Adds each document chunk to the vector store along with its unique ID.

In [None]:
# Add Chunks to Vector Store
for i, doc in enumerate(docs):
    vector_store.add_documents(
        documents=[doc],
        ids=[uuids[i]],
    )


##  **Configuring the Retriever**

- Configures a retriever to fetch relevant chunks from the vector store:
  - **Search Type**: `similarity_score_threshold`.
  - **Search Parameters**: Retrieves top `k=5` chunks with a minimum score of `0.7`.

In [None]:
retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 5, "score_threshold": 0.7},
)
# retriever.invoke("earning per share")

##  **Initializing Google Gemini LLM**

- Purpose: Sets up the Gemini Flash language model for answering queries.
- Parameters:
  - `model='gemini-1.5-flash'`: Specifies the Gemini model.
  - `temperature=0.7`: Controls randomness in responses.

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI

# Configure the Google Gemini
llm = ChatGoogleGenerativeAI(
    model = 'gemini-1.5-flash',
    temperature = 0.7,
)

##  **Setting Up the Retrieval QA Chain**

- Creates a Retrieval QA Chain:
  - Combines the LLM and retriever.
  - Uses `map_reduce` as the chain type.
  - Enables verbose output for debugging.

In [None]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="map_reduce",
    retriever=retriever,
    return_source_documents=True,
    verbose=True,
)

##  **User Query and Response Display**


- **User Query**: Asks a question about "earnings per share in 2024."
- The chain retrieves relevant document chunks, generates a response, and displays:
  - The query.
  - The model's answer.
  - The source documents used to answer the query.

In [None]:
from IPython.display import display, Markdown

# User Query
query = "what is earning per share in 2024"
response = qa_chain.invoke(query)

# Display query
print("---> User Query....")
print(response.get("query"))

# Display response
print("---> Answer....")
display(Markdown(response.get("result")))
print("---> Source Documents....")
for document in response.get("source_documents"):
    print(document.metadata)
