<a href="https://colab.research.google.com/github/siddhesh1503/NLP/blob/main/Document_Retrieval_HuggingFace_Pinecone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üìù NLP Experiment: Document Retrieval with Pinecone & HuggingFace Embeddings

This experiment demonstrates how to load documents, preprocess them, generate embeddings using **HuggingFace models**, and store/retrieve them using **Pinecone Vector Database**.

## Step 1: Install Dependencies

In [None]:

!pip install --upgrade pip
!pip install langchain-community langchain-pinecone sentence-transformers pypdf pinecone-client




## Step 2: Import Required Libraries

In [None]:

import os
from dotenv import load_dotenv
load_dotenv()

from langchain.document_loaders import PyPDFDirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Pinecone
from langchain_community.embeddings import HuggingFaceEmbeddings
from pinecone import Pinecone as PineconeClient


## Step 3: Load Documents (PDF/TXT)

In [None]:
def load_documents(directory):
    docs = []
    try:
        pdf_loader = PyPDFDirectoryLoader(directory)
        docs.extend(pdf_loader.load())
    except Exception as e:
        print(f"Error loading PDF documents: {e}")

    try:
        txt_loader = TextLoader(os.path.join(directory, "*.txt"))
        docs.extend(txt_loader.load())
    except Exception as e:
        print(f"Error loading TXT documents: {e}")


    print(f"‚úÖ Loaded {len(docs)} documents")
    return docs

import os, requests

url = "https://arxiv.org/pdf/1706.03762.pdf"  # example PDF
os.makedirs("/content/documents", exist_ok=True)

r = requests.get(url)
with open("/content/documents/paper.pdf", "wb") as f:
    f.write(r.content)

docs = load_documents("/content/documents/")

Error loading TXT documents: Error loading /content/documents/*.txt
‚úÖ Loaded 15 documents


## Step 4: Split Documents into Chunks

In [None]:

def chunk_data(docs, chunk_size=500, chunk_overlap=100):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )
    chunks = splitter.split_documents(docs)
    print(f"‚úÖ Split into {len(chunks)} chunks")
    return chunks

chunks = chunk_data(docs)


‚úÖ Split into 0 chunks


## Step 5: Initialize HuggingFace Embeddings & Pinecone

In [None]:
# Initialize HuggingFace embeddings
from langchain_huggingface import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    init_args={'device': 'cpu'} # You can change this to 'cuda' if you have a GPU
)

# Initialize Pinecone client
from google.colab import userdata
PINECONE_API_KEY = userdata.get('PINECONE_API_KEY')
pc = PineconeClient(api_key=PINECONE_API_KEY)

index_name = "huggingface-doc-index"

# Create index if not exists
if index_name not in pc.list_indexes().names():
    pc.create_index(name=index_name, dimension=384, metric="cosine")

# Connect LangChain Pinecone wrapper
from langchain_pinecone import PineconeVectorStore

vectorstore = PineconeVectorStore.from_documents(
    chunks,
    embedding_model,
    index_name=index_name,
    namespace="exp10",
)
print("‚úÖ Pinecone index ready")

## Step 6: Query the Vector Database

In [None]:

query = "Explain the key idea of the documents."
results = vectorstore.similarity_search(query, k=5)

print("üîé Top Retrieved Chunks:")
for i, res in enumerate(results):
    print(f"--- Result {i+1} ---")
    print(res.page_content[:300], "...\n")


## Step 7: Wrap Up

In [None]:

print("Experiment Completed ‚úÖ - You can now test with different queries.")
