Links

- [langchain and pinecone](https://python.langchain.com/docs/integrations/vectorstores/pinecone/)
- [Tokenizer used for openai-text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings/how-can-i-tell-how-many-tokens-a-string-has-before-i-embed-it#faq)

### Using Pinecone as vectorestore

In [None]:
import os
from dotenv import load_dotenv
from pinecone import Pinecone
load_dotenv()
pinecone_api_key = os.environ.get("PINECONE_API_KEY")

pc = Pinecone(api_key=pinecone_api_key)

In [None]:
os.environ["LANGSMITH_PROJECT"] = "llm-training-05-rag-p3"


In [None]:
index_name = "rag-class"
index = pc.Index(index_name)

In [6]:
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = PineconeVectorStore(index=index, embedding=embeddings)

In [10]:
# read json file
import json
confluence_file = "resources_rag/cf_bts_pages.json"
with open(confluence_file, "r") as file:
    confluence_data = json.load(file)
print("total items",len(confluence_data))
print("first item",confluence_data[0])
print("keys", confluence_data[0].keys())


total items 53
keys dict_keys(['page_id', 'title', 'content', 'source_url'])


In [40]:
# for each document in the confluence data, chunk the content and add to the vector store
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

#(chunk_size=10, chunk_overlap=0)
text_splitter = RecursiveCharacterTextSplitter\
.from_tiktoken_encoder(model_name="text-embedding-3-small",
                        chunk_size=500, chunk_overlap=0)
chunks_by_page = {}
for item in confluence_data:
    content = item.get("content", "")
    if content:
        chunks = text_splitter.split_text(content)
        chunks_by_page[item["page_id"]] = [
            Document(page_content=chunk, metadata={"source": "confluence",
                                                   "title": item.get("title", ""),
                                                   "page_id": item.get("page_id", ""),
                                                   "source_url": item.get("source_url", "")
                                                   }) for chunk in chunks if chunk.strip()
        ]


In [42]:
for key in chunks_by_page.keys():
    if len(chunks_by_page[key]) > 1:
        print("page_id:", key, "chunks:", len(chunks_by_page[key]))

page_id: 84181086 chunks: 4
page_id: 3781820418 chunks: 5
page_id: 3732996097 chunks: 5
page_id: 3640098830 chunks: 3
page_id: 3639377935 chunks: 2
page_id: 3305373697 chunks: 4
page_id: 3305177089 chunks: 2
page_id: 3227648001 chunks: 7
page_id: 3226894344 chunks: 4
page_id: 3157917717 chunks: 4
page_id: 3014721557 chunks: 3
page_id: 2606006486 chunks: 2
page_id: 232652946 chunks: 9
page_id: 2095841353 chunks: 2
page_id: 206321 chunks: 3
page_id: 205857 chunks: 4
page_id: 205508 chunks: 2
page_id: 205288 chunks: 2
page_id: 1615757337 chunks: 2
page_id: 1612087419 chunks: 8
page_id: 1599930482 chunks: 4
page_id: 1536688133 chunks: 2
page_id: 1475608680 chunks: 8
page_id: 2993979393 chunks: 10
page_id: 3158540324 chunks: 4


In [44]:
documents = [doc for docs in chunks_by_page.values() for doc in docs]
documents

 Document(metadata={'source': 'confluence', 'title': '🗂 The Mentoring Program', 'page_id': '84181086', 'source_url': 'https://bluetrailsoft.atlassian.net/wiki/spaces/BTS/pages/84181086/The+Mentoring+Program'}, page_content='Mentoring Program Presentation\n------------------------------\n\n* Technical Mentoring Program.pdf\n\nFirst Mentoring Meeting Agenda\n------------------------------\n\n1. What do you wish to be mentored in?\n2. How do you wish to be mentored?\n3. When do you wish to start?\n4. What days and times of the week you could spend on mentoring tasks?\n5. Anything else you wish to discuss?\n\nCurrent Mentees\n---------------'),
 Document(metadata={'source': 'confluence', 'title': '🗂 The Mentoring Program', 'page_id': '84181086', 'source_url': 'https://bluetrailsoft.atlassian.net/wiki/spaces/BTS/pages/84181086/The+Mentoring+Program'}, page_content='| Mentee | Mentors | Start Date |\n| --- | --- | --- |\n| [Delgado, Miguel](https://bluetrailsoft.atlassian.net/wiki/spaces/BTS

In [46]:
from uuid import uuid4

uuids = [str(uuid4()) for _ in range(len(documents))]
vector_store.add_documents(documents=documents, ids=uuids)

['d8dcc470-0e7f-4c21-a5f7-0cd4ad765df9',
 'cea2ff6f-d2eb-439c-9e08-5d419094219a',
 'fce0a370-65d9-44c4-be6d-84054da90269',
 'a0f4326f-0ae4-4269-8347-393f4008f128',
 '0d4256ba-5666-4d25-af23-f46ea6ad373c',
 '5f61b838-f958-48bb-8001-5aec88338a49',
 '20dc6feb-2f0d-4374-80ed-ffd0f02ff577',
 'f164c5b0-d36b-428f-8f1b-512e4d85bf59',
 '3952446d-7b10-4d2f-9f7a-ec48a54dec63',
 'c40ee5dd-084e-4f72-8bf7-2f7843c20b68',
 '5ff0d691-e058-4607-9df1-3843196c127f',
 '8cd4f493-b269-4728-bd35-b7d0e0a12c91',
 '6d9dce46-d486-44f4-90c2-9ccf5b9e6266',
 'a8c9eb3f-5d89-4dc9-bc74-2c2c1bf09af9',
 'fd5d8830-9bf9-49ee-afe1-49ce02b5d9f4',
 '49e9224a-00fa-4810-9efb-05d43862e8a2',
 '81784d4c-32df-4ed4-b10d-9cac11df72a4',
 '22d4120d-e90d-4a90-baf1-1f24eb5a840e',
 'ba22d7b4-3c23-454c-88cd-8c417a16578c',
 'ec893c0e-88d0-41ff-bf70-ff4b51bf68e2',
 'bb9a4cbf-a6fe-4d62-8280-cd1819207f3b',
 'be0df98e-8168-4ae4-8e28-b6946ccf9607',
 '25897968-1167-40c8-b3c2-f921a55a54c0',
 '788ba767-b7d4-40bf-b05a-4600390727c6',
 '4b78cc77-3058-

#### Testing pinecone

In [53]:
#some questions
q = [
   "What are the holidays in Peru?",
    "How do I request vacation time?",
    "Where can I place an order?" 
]

In [54]:
question = q[0]

Testing with similarity search

In [None]:
results = vector_store.similarity_search(
    query=question,
    k=2,
    filter={"source": "confluence"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* | **2025** | | |
| --- | --- | --- |
| **Date** | **Day** | **Holiday** |
| January 1st | Wednesday | New Year |
| April 17th | Thursday | Holy Thursday |
| April 18th | Friday | Holy Friday |
| May 1st | Thursday | Labour Day |
| June 7th | Thursday | Arica´s Battle & Flag´s Day |
| July 28th | Monday | National Holidays |
| July 29th | Tuesday | National Holidays |
| October 8th | Wednesday | Combate de Angamos |
| December 9th | Tuesday | Commemoration of the Battle of Ayacucho |
| December 25th | Thursday | Christmas |

| **2024** | | |
| --- | --- | --- |
| **Date** | **Day** | **Holiday** |
| January 1st | Monday | New Year |
| March 28th | Thursday | Holy Thursday |
| March 29th | Friday | Holy Friday |
| May 1st | Wednesday | Worker´s Day |
| July 29th | Monday | National Holidays |
| August 30th | Friday | Santa Rosa de Lima |
| October 8th | Tuesday | Combate de Angamos |
| November 1st | Friday | All Saints' Day |
| December 9th | Monday | Commemoration of the Battle of Ay

Testing with similarity search with score

In [57]:
results = vector_store.similarity_search_with_score(
    question, k=1, filter={"source": "confluence"}
)
for res, score in results:
    print(f"* [SIM={score:3f}] [{res.metadata}]")

* [SIM=0.597108] [{'page_id': '3221258329', 'source': 'confluence', 'source_url': 'https://bluetrailsoft.atlassian.net/wiki/spaces/BTS/pages/3221258329/Approved+Holiday+List+Peruvians+Consultants', 'title': 'Approved Holiday List: Peruvians Consultants'}]
