<a href="https://colab.research.google.com/github/raheelam98/generative_ai/blob/main/RAG/pinecone/pinecone_vector_db.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Project : LangChain RAG with Google Gemini Flash and Pinecone**

Creating a Retrieval-Augmented Generation (RAG) system using LangChain with Google Gemini Flash and Pinecone. This system will retrieve relevant context from a vector database and use that context to generate a more accurate and informed response from the Gemini model.



#### **Implementation of RAG with Pinecone Vector DB**

[Class-05: AI-201- Fundamentals of Agentic AI: Implementation of RAG with Pinecone Vector DB](https://www.youtube.com/watch?v=xQojOkqRbsU)

Model : Google
Vector DB : Pinecone

#### **Tutorials**

[LangChain Pinecone](https://python.langchain.com/docs/integrations/vectorstores/pinecone/)

In [10]:
# Install the required packages:
%%capture --no-stderr
%pip install -qU langchain-pinecone langchain-google-genai

**Credentials**

In [11]:
from google.colab import userdata

from pinecone import Pinecone, ServerlessSpec

# save key in varaiable
pinecone_api_key = userdata.get("PINECONE_API_KEY")

# initialize pinecone
pc = Pinecone(api_key=pinecone_api_key)


In [None]:
print(pc)

<pinecone.control.pinecone.Pinecone object at 0x7dfd3b719fd0>


### **Indexing**

**Connect Vector Store to a Pinecone index**

dimension is 768 becuse we are using google model

Createing index through pinecone key

In [12]:
import time

# Defining Index Name
index_name="rag-pinecone-project-2"

# Creating a Pinecone Index
pc.create_index(
    name=index_name,
    dimension=768,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)

# Accessing the Index
index = pc.Index(index_name)

In [13]:
print(index)

<pinecone.data.index.Index object at 0x7e08ccbabb10>


### **Embedding**

Use Google Gemini embeddings to vectorize documents.

LangChain embeddings are vector representations of text used for semantic understanding and retrieval in AI workflows.

Semantic refers to the meaning or interpretation of words, phrases, or symbols in a specific context.

**Langchain Google Embeddings**

[Google Generative AI Embeddings](https://python.langchain.com/docs/integrations/text_embedding/google_generative_ai/)

In [14]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os

os.environ['GOOGLE_API_KEY'] = userdata.get('GOOGLE_API_KEY')

embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")


In [15]:
vector = embeddings.embed_query("Pinecone Vector DB!")
vector[:5]  # print last 5

[0.012879692018032074,
 -0.06462990492582321,
 -0.05366327241063118,
 -0.021048594266176224,
 0.035343751311302185]

In [16]:
from langchain_pinecone import PineconeVectorStore

vector_store = PineconeVectorStore(index=index, embedding=embeddings)

### **Manage vector store**

**Add items to vector store**

Add items to our vector store by using the **`add_documents`** function.

In [17]:
from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocalate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

In [18]:
print(document_1)

page_content='I had chocalate chip pancakes and scrambled eggs for breakfast this morning.' metadata={'source': 'tweet'}


In [21]:
from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocalate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]
# uuids = [str(uuid4()) for _ in range(len(documents))]

# vector_store.add_documents(documents=documents, ids=uuids)

In [22]:
len(documents)

10

In [27]:
from uuid import uuid4

uuid4()

UUID('58063efb-fe70-4f67-ab14-bdfd491e34e1')

#### Add data to the vector database using  **`add_documents()`**

**`from langchain_pinecone import PineconeVectorStore`**

**`vector_store = PineconeVectorStore(index=index, embedding=embeddings)`**

**`vector_store.add_documents(documents=documents, ids=uuids)`**


In [28]:
uuids = [str(uuid4()) for _ in range(len(documents))]

In [33]:
vector_store.add_documents(documents=documents, ids=uuids)

['139c3175-2dcb-41db-bf54-c6a40b6d9068',
 '90f1ea7d-104e-4194-8652-3cdc0360abee',
 '3bb149e5-2572-45cf-aed5-1cd000f4cc44',
 'be7c528b-63f0-4da2-a600-d604944f535e',
 'a50e03d2-9859-4dd6-95f2-0257eb4fda7e',
 '8d1f908d-61ca-4485-b089-bd9e388943b2',
 '99a76ddc-2352-42cf-abd8-624d38e60a0d',
 'ff8ad444-77bb-42a6-9df1-82e1f638c38d',
 'bac5c206-4abc-45ee-aedf-bd28b0be98f7',
 '3d972d51-a226-45d9-807b-80850f5b4655']

#### Data Retrival

**`similarity_search()`**

In [39]:
results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=3,
    filter={"source": "tweet"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]
* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]
* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]


**`similarity_search_with_score()`**

In [40]:
results = vector_store.similarity_search_with_score(
    "Will it be hot tomorrow?"
)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

* [SIM=0.668031] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]
* [SIM=0.667716] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]
* [SIM=0.577411] I have a bad feeling I am going to get deleted :( [{'source': 'tweet'}]
* [SIM=0.577374] I have a bad feeling I am going to get deleted :( [{'source': 'tweet'}]


In [41]:
results = vector_store.similarity_search_with_score(
    "Will it be hot tomorrow?", k=2, filter={"source": "news"}
)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

* [SIM=0.668031] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]
* [SIM=0.668031] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]
