# **Semantic Document Analysis with OpenAI, Pinecone, and LangChain**  
## **Overview**  

This Jupyter notebook provides a  tool for performing **semantic search** across a collection of **PDF document**.
It's designed to help researchers, analysts, and students efficiently find conceptually related information without relying on exact keyword matches.
Leveraging **Pinecone** for vector storage, **OpenAI embeddings** for generating dense semantic vectors, the notebook enables advanced extraction, indexing, and querying of document content.
By leveraging state-of-the-art AI models and a vector database, this notebook enables **semantic reading** and transforms your static PDFs into a fully searchable, intelligent knowledge base.

---

### **Technical Approach**

#### **Dense Vector Semantic Search**
- Uses **OpenAI text-embedding-3-large** model (3072 dimensions) 
- Performs **nearest neighbor search** in high-dimensional vector space
- Enables **similarity search** based on semantic meaning and context
- Excellent for finding conceptually related content even with different wording
- **Limitation**: May miss exact keyword matches, especially domain-specific terms

### **Key Features**  

#### **Document Processing**  
- Extracts text from **PDF documents** while maintaining **page-level references**.  
- Handles files with **year-based naming conventions** (e.g., "1946-document-name.pdf").  
- Skips unwanted pages (e.g., covers, introductions) for focused analysis.  
- Splits text into **manageable chunks** for efficient indexing and search.  

#### **Semantic Search**  
- Uses **OpenAI embeddings** to enable **semantic understanding** of queries.  
- Stores processed documents in **Pinecone**, a vector database, for fast and scalable search.  
- Supports **customizable queries** to find relevant passages based on meaning, not just keywords.  

#### **Analysis & Output**  
- Extracts **contextual excerpts** around matching terms or concepts.  
- Preserves **metadata** such as source file, page number, and publication year.  
- Outputs results in a **readable format** for both console and file saving.  

---

### **How It Works**  
1. **Document Extraction**: Text is extracted from PDFs, skipping specified pages and splitting content into chunks.  
2. **Indexing**: Each text chunk is vectorized and stored in Pinecone for fast retrieval.  
3. **Querying**: Users can search for specific terms, phrases, or concepts, and the tool retrieves semantically similar results.  
4. **Context Preservation**: Matching passages are displayed with surrounding context for better understanding.  

---

### **Getting Started**  
1. Install the required Python libraries.  
2. Upload your **PDF documents** to the specified directory.  
3. Configure your **API keys** for Pinecone and OpenAI when prompted.  
4. Run the notebook to process, index, and query your documents.  

---

# **Pinecone and OpenAI Embeddings**

This Jupyter notebook demonstrates how to use **Pinecone** and **OpenAI embeddings** to process and search PDF documents. Students will learn how to:

1. Set up Pinecone and OpenAI APIs.
2. Extract and split text from PDFs.
3. Store and query embeddings in Pinecone.
4. Modify parameters like `top_k` and query text to experiment with semantic search results.


## **Setup**

First, install the required libraries and set up your API keys.

In [None]:
!pip install pinecone langchain-pinecone langchain-openai langchain openai pymupdf

## Step 2: Import Libraries

In [None]:
import os
import datetime
import hashlib
import re
import fitz  # For PDF processing
from pinecone import Pinecone, ServerlessSpec
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore  # Updated import
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

print("✅ All libraries imported successfully!")


## **Step 3: API Key Configuration**

This step configures the API keys required for Pinecone (vector database) and OpenAI (embeddings). Both services require free account registration and API key generation.

### **🔐 Required API Keys**

You'll need API keys from two services:
1. **Pinecone** - For vector database storage and retrieval
2. **OpenAI** - For text embedding generation

### **📋 Pinecone API Key Setup**

#### **What is Pinecone?**
Pinecone is a vector database service that stores and searches high-dimensional vectors efficiently. It's perfect for semantic search applications.

#### **How to get your Pinecone API Key:**

1. **Sign Up for Pinecone**
   - Visit: [https://www.pinecone.io/](https://www.pinecone.io/)
   - Click **"Sign Up"** (free tier available)
   - Create your account with email verification

2. **Access API Keys**
   - Log into your [Pinecone Console](https://app.pinecone.io/)
   - Navigate to **"API Keys"** in the left sidebar
   - Click **"Create API Key"**

3. **Copy Your API Key**
   - Copy the generated API key (starts with `pc-...`)
   - **⚠️ Important**: Save it securely - you won't see it again!

4. **Choose Your Environment**
   - Default region: `us-east-1` (recommended for beginners)
   - Other options: `us-west-2`, `eu-west-1`, etc.

#### **💡 Pinecone Free Tier Includes:**
- 1 project
- 1 index
- 5M vector dimensions
- Enough for this tutorial and small projects

---

### **🤖 OpenAI API Key Setup**

#### **What is OpenAI API?**
OpenAI's API provides access to powerful language models including embedding models that convert text into numerical vectors for semantic search.

#### **How to get your OpenAI API Key:**

1. **Sign Up for OpenAI**
   - Visit: [https://platform.openai.com/](https://platform.openai.com/)
   - Click **"Sign up"** 
   - Create account or sign in

2. **Access API Section**
   - Go to [OpenAI API Platform](https://platform.openai.com/api-keys)
   - Or navigate to **"API"** → **"API keys"**

3. **Create New API Key**
   - Click **"+ Create new secret key"**
   - Give it a descriptive name (e.g., "Text Analysis Tool")
   - Copy the key (starts with `sk-...`)
   - **⚠️ Critical**: Store securely - this is shown only once!

4. **Set Up Billing (Required)**
   - Go to [Billing Settings](https://platform.openai.com/account/billing)
   - Add a payment method
   - Set usage limits to control costs
   - **💰 Cost**: Embedding API is very


In [None]:
# API Key Configuration with User Input
def get_api_keys():
    """Get API keys from environment variables or user input."""
    api_keys = {}
    
    # Get Pinecone API key
    # Get your free API key from: https://www.pinecone.io/
    pinecone_api_key = os.getenv('PINECONE_API_KEY')
    if not pinecone_api_key:
        pinecone_api_key = input("Enter your Pinecone API key: ").strip()
    
    # Get OpenAI API key  
    # Get your API key from: https://platform.openai.com/api-keys
    openai_api_key = os.getenv('OPENAI_API_KEY')
    if not openai_api_key:
        openai_api_key = input("Enter your OpenAI API key: ").strip()
    
    # Get Pinecone Environment (optional, defaults to us-east-1)
    pinecone_env = os.getenv('PINECONE_ENV')
    if not pinecone_env:
        pinecone_env = input("Enter your Pinecone environment (default: us-east-1): ").strip()
        if not pinecone_env:
            pinecone_env = "us-east-1"
    
    # Validate API keys
    if pinecone_api_key and openai_api_key:
        api_keys['pinecone'] = pinecone_api_key
        api_keys['openai'] = openai_api_key
        api_keys['pinecone_env'] = pinecone_env
        print("🔑 All API keys configured!")
        return api_keys
    else:
        print("❌ Need both Pinecone and OpenAI API keys to continue.")
        print("Get Pinecone API key from: https://www.pinecone.io/")
        print("Get OpenAI API key from: https://platform.openai.com/api-keys")
        return None

# Get API keys
api_keys = get_api_keys()

if api_keys:
    PINECONE_API_KEY = api_keys['pinecone']
    OPENAI_API_KEY = api_keys['openai']
    PINECONE_ENV = api_keys['pinecone_env']
    print("🚀 Text Pattern Analysis Tool ready!")
else:
    print("❌ Cannot continue without API keys.")
    raise Exception("API keys required to continue")


## Step 4: Initialize Pinecone and OpenAI embeddings


In [None]:
try:
    # Initialize Pinecone
    pc = Pinecone(api_key=PINECONE_API_KEY)
    
    # Initialize OpenAI embeddings
    model_name = 'text-embedding-3-large'  # 3072 dimensions
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model=model_name)
    
    # Set up the index name
    index_name = "tdm-test3"
    
    print("🔧 Initializing Pinecone and OpenAI embeddings...")
    print(f"📊 Using OpenAI model: {model_name}")
    print(f"🗂️ Pinecone index: {index_name}")
    print("✅ Initialization successful!")
    
except Exception as e:
    print(f"❌ Error initializing services: {str(e)}")
    print("Please check your API keys and try again.")
    raise


## Step 5: Process PDFs
This section extracts text from PDFs, splits it into chunks, and prepares it for embedding.
- **⚠️ PDF size Warning**: Indexing large PDF files may take a while. For testing purposes, please upload smaller files (preferably under **50 MB**).

To create your first index, enter the directory path containing your PDFs (e.g., **./rag-input**)

In [None]:
# Check if the index exists and create if necessary
try:
    existing_indexes = pc.list_indexes().names()
    
    if index_name in existing_indexes:
        print(f"✅ Index '{index_name}' already exists. Connecting to existing index...")
        # Initialize vector store directly with existing index
        vector_store = PineconeVectorStore(
            index_name=index_name,
            embedding=embeddings,
            pinecone_api_key=PINECONE_API_KEY
        )
        # Get index reference for direct querying
        index = pc.Index(index_name)
        print(f"✅ Connected to existing index '{index_name}'")
        
    else:
        print(f"🆕 Index '{index_name}' does not exist. Creating new index...")
        pc.create_index(
            name=index_name,
            dimension=3072,
            metric='cosine',
            spec=ServerlessSpec(
                cloud='aws',
                region=PINECONE_ENV
            )
        )
        
        print(f"✅ Index '{index_name}' created successfully!")
        print("⏳ Waiting for index to be ready...")
        
        # Wait for index to be ready
        import time
        time.sleep(10)
        
        # Initialize Pinecone vector store with the new approach
        vector_store = PineconeVectorStore(
            index_name=index_name,
            embedding=embeddings,
            pinecone_api_key=PINECONE_API_KEY
        )
        
        # Get index reference for direct querying
        index = pc.Index(index_name)

        # Initialize text splitter
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1500,
            chunk_overlap=500
        )

        # Define the starting pages offsets for specific filenames
        file_start_pages = {
            '1946-jahrbuch-des-lutherbundes.pdf': 5,  # Skip first 5 pages
            '1947-jahrbuch-des-lutherbundes.pdf': 4,
            '1948-jahrbuch-des-lutherbundes.pdf': 5,
        }

        def extract_text_from_pdf(pdf_path):
            """Extract text and basic page numbers from PDF."""
            pdf_document = fitz.open(pdf_path)
            pdf_text = {}
            page_numbers = {}
            
            filename = os.path.basename(pdf_path)
            skip_pages = file_start_pages.get(filename, 0)
            
            # Start counting from after the skipped pages
            for page_num in range(pdf_document.page_count):
                page = pdf_document.load_page(page_num)
                text = page.get_text().strip()
                
                if text:
                    if page_num >= skip_pages:
                        adjusted_page_num = page_num - skip_pages + 1
                        pdf_text[adjusted_page_num] = text
                        page_numbers[adjusted_page_num] = adjusted_page_num
            
            pdf_document.close()
            return pdf_text, page_numbers

        def process_pdfs_in_directory(directory_path):
            """Process all PDFs in directory with duplicate prevention."""
            documents = []
            processed_chunks = set()
            
            pdf_files = [f for f in os.listdir(directory_path) if f.endswith(".pdf")]
            
            if not pdf_files:
                print(f"⚠️ No PDF files found in {directory_path}")
                return documents
                
            for filename in pdf_files:
                pdf_path = os.path.join(directory_path, filename)
                print(f"📄 Processing {filename}...")
                
                try:
                    pdf_text, page_numbers = extract_text_from_pdf(pdf_path)
                    
                    if not pdf_text:
                        print(f"⚠️ No text extracted from {filename}")
                        continue
                    
                    chunk_count = 0
                    for page_num, text in pdf_text.items():
                        chunks = text_splitter.split_text(text)
                        
                        for i, chunk in enumerate(chunks):
                            chunk_content = chunk.strip()
                            if chunk_content and chunk_content not in processed_chunks:
                                processed_chunks.add(chunk_content)
                                
                                year = filename[:4] if filename[:4].isdigit() else "Unknown"
                                
                                doc = Document(
                                    page_content=chunk,
                                    metadata={
                                        "source": filename,
                                        "page": page_num,
                                        "chunk": i + 1,
                                        "publishYear": year,
                                        "chunk_id": f"{filename}_{page_num}_{i}_{hash(chunk)}"
                                    }
                                )
                                documents.append(doc)
                                chunk_count += 1
                    
                    print(f"✅ Processed {filename}: {chunk_count} unique chunks created")
                    
                except Exception as e:
                    print(f"❌ Error processing {filename}: {str(e)}")
                    continue
            
            return documents

        def process_and_index_pdfs():
            """Main function to process and index PDFs."""
            pdf_directory = input("Enter the path to your PDF directory:e.g. './rag-input' ").strip()
            
            if not pdf_directory:
                pdf_directory = "./rag-input"
                print(f"Using default directory: {pdf_directory}")
            
            if not os.path.exists(pdf_directory):
                print(f"❌ Directory '{pdf_directory}' does not exist!")
                return False
            
            try:
                print("🔄 Processing PDFs...")
                documents = process_pdfs_in_directory(pdf_directory)
                
                if documents:
                    print(f"\n📤 Starting to add {len(documents)} documents to Pinecone...")
                    batch_size = 100
                    for i in range(0, len(documents), batch_size):
                        batch = documents[i:i + batch_size]
                        vector_store.add_documents(batch)
                        print(f"📦 Added batch {i//batch_size + 1}/{(len(documents)-1)//batch_size + 1} to Pinecone")
                    
                    print(f"\n✅ Successfully processed and stored {len(documents)} unique chunks in Pinecone.")
                    
                    print("\n📋 Sample of processed documents:")
                    for doc in documents[:2]:
                        print(f"\nMetadata: {doc.metadata}")
                        print(f"Content preview: {doc.page_content[:100]}...")
                    return True
                else:
                    print("⚠️ No documents were processed.")
                    return False
                    
            except Exception as e:
                print(f"❌ An error occurred: {str(e)}")
                return False

        # Run the processing for new index
        process_and_index_pdfs()

except Exception as e:
    print(f"❌ Error with Pinecone operations: {str(e)}")
    raise


## Step 6: Query Pinecone

In [None]:
def query_pinecone(query_text, top_k=10, output_dir="output"):
    """Query Pinecone index for similar results."""
    try:
        print(f"🔍 Searching for: '{query_text}'")
        
        # Use vector store similarity search for better integration
        results = vector_store.similarity_search_with_score(
            query=query_text,
            k=top_k
        )
        
        seen_results = set()
        unique_matches = []
        
        for doc, score in results:
            content = doc.page_content.strip()
            if content and content not in seen_results:
                seen_results.add(content)
                # Convert to match format for compatibility
                match = {
                    'score': score,
                    'metadata': {
                        'text': content,
                        **doc.metadata
                    }
                }
                unique_matches.append(match)
        
        # Ensure output directory exists
        os.makedirs(output_dir, exist_ok=True)
        
        # Save query results
        timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
        query_results_file = os.path.join(output_dir, f"query_results_{timestamp}.txt")
        with open(query_results_file, 'w', encoding='utf-8') as f:
            f.write(f"Query: {query_text}\n")
            f.write(f"Top K: {top_k}\n")
            f.write(f"Results found: {len(unique_matches)}\n\n")
            for i, match in enumerate(unique_matches, 1):
                f.write(f"Result {i}:\n")
                f.write(f"Score: {match['score']}\n")
                f.write(f"Content: {match['metadata'].get('text', '')}\n")
                f.write(f"Metadata: {match['metadata']}\n")
                f.write("-" * 80 + "\n\n")
        
        print(f"💾 Query results saved to {query_results_file}")
        return unique_matches
        
    except Exception as e:
        print(f"❌ Error during query: {str(e)}")
        return []


## Step 7: Display Query Results

In [None]:
# ANSI escape codes for colors
RED = "\033[91m"
RESET = "\033[0m"
BOLD = "\033[1m"

def find_context(full_text, context_size=600):
    """Find context around matching text."""
    total_length = len(full_text)
    
    if total_length <= context_size * 3:
        part_size = total_length // 3
        return full_text[:part_size], full_text[part_size:2*part_size], full_text[2*part_size:]
    
    middle = total_length // 2
    match_start = max(0, middle - context_size // 2)
    match_end = min(total_length, match_start + context_size)
    matching = full_text[match_start:match_end]
    
    before_start = max(0, match_start - context_size)
    before = full_text[before_start:match_start]
    
    after_end = min(total_length, match_end + context_size)
    after = full_text[match_end:after_end]
    
    return before, matching, after

def format_context(before, matching, after):
    return f"...{RESET}{before}{BOLD}{RED}{matching}{RESET}{after}..."

def format_context_for_file(before, matching, after):
    return f"...{before}<<< {matching} >>>{after}..."

def display_query_results(results):
    """Display query results in a readable format."""
    if not results:
        print("❌ No results to display.")
        return None
        
    console_output = []
    file_output = []
    seen_results = set()
    
    print(f"\n📊 Displaying {len(results)} results:")
    print("=" * 80)
    
    for i, match in enumerate(results, 1):
        result_key = match['metadata'].get('text', '').strip()
        if result_key and result_key not in seen_results:
            seen_results.add(result_key)
            
            source = match['metadata'].get('source', 'Unknown')
            page = match['metadata'].get('page', 'Unknown')
            publish_year = match['metadata'].get('publishYear', 'Unknown')
            chunk = match['metadata'].get('chunk', 'Unknown')
            score = match['score']
            
            full_text = match['metadata'].get('text', '')
            before, matching, after = find_context(full_text)
            formatted_text = format_context(before, matching, after)
            
            console_output.append(f"🔎 Result {i}:")
            console_output.append(f"📈 Score: {score:.4f}")
            console_output.append(f"📄 Source: {source}")
            console_output.append(f"📋 Page: {page}")
            console_output.append(f"📅 Publish Year: {publish_year}")
            console_output.append(f"📦 Chunk: {chunk}")
            console_output.append(formatted_text)
            console_output.append("-" * 50 + "\n")
            
            file_output.append(f"Result {i}:")
            file_output.append(f"Score: {score:.4f}")
            file_output.append(f"Source: {source}")
            file_output.append(f"Page: {page}")
            file_output.append(f"Publish Year: {publish_year}")
            file_output.append(f"Chunk: {chunk}")
            file_text = format_context_for_file(before, matching, after)
            file_output.append(file_text)
            file_output.append("-" * 50 + "\n")
    
    print("\n".join(console_output))
    
    output_dir = "rag-output"
    os.makedirs(output_dir, exist_ok=True)
    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"formatted_results_{timestamp}.txt"
    filepath = os.path.join(output_dir, filename)
    
    with open(filepath, "w", encoding="utf-8") as f:
        f.write("\n".join(file_output))
    
    print(f"💾 Formatted results saved to {filepath}")
    return filepath


## Step 8: Interactive Query Interface

In [None]:
def interactive_query():
    """Interactive function to query the database."""
    print("\n🎯 Interactive Query Mode")
    print("-" * 40)
    
    while True:
        query_text = input("\nEnter your search query (or 'quit' to exit): ").strip()
        
        if query_text.lower() in ['quit', 'exit', 'q']:
            print("👋 Goodbye!")
            break
            
        if not query_text:
            print("⚠️ Please enter a valid query.")
            continue
            
        try:
            top_k_input = input("Enter number of results (default 10): ").strip()
            top_k = int(top_k_input) if top_k_input else 10
        except ValueError:
            top_k = 10
            print("⚠️ Invalid number, using default (10)")
        
        try:
            print(f"\n🔄 Searching...")
            results = query_pinecone(query_text, top_k=top_k)
            if results:
                display_query_results(results)
            else:
                print("❌ No results found.")
        except Exception as e:
            print(f"❌ Error during search: {str(e)}")

# Run the interactive query
print("\n🚀 Setup complete! Ready to search your documents.")
interactive_query()
