# Load pdfs and create a Vector Database

This notebook processes research articles PDFs stored in ./data/pdfs, splits the text into manageable chunks, and embeds them into a local FAISS vector database. This database facilitates fast access to the data needed for the Retrieval-Augmented Generation (RAG) Streamlit application (`app.py`).

In [17]:
# !pip install langchain langchain-community pypdf faiss-cpu sentence-transformers langchain-huggingface
import warnings
warnings.filterwarnings("ignore")

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS


- Load the pdf documents.
- To make sure LLM processes equations, long and complex sentences, I make 1000 character chunks of the texts with 200 overlapping characters.
- Create a vector index database for each pdf document in the directory using `LangChain`'s `HuggingFaceEmbeddings`

In [18]:
os.environ['HF_HUB_DISABLE_SYMLINKS_WARNING'] = '1'  # hide ignorable warning

# Directory paths
pdf_dir = "../data/pdfs"
index_dir = "../data/faiss_index"

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

# Initialize the HuggingFace Embedding model
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


pdf_files = [f for f in os.listdir(pdf_dir) if f.endswith('.pdf')]
print(f"Found {len(pdf_files)} PDFs in {pdf_dir}.")

for i, f_name in enumerate(pdf_files, 1):
    # Create a folder for each paper
    paper_name = os.path.splitext(f_name)[0]
    index_path = os.path.join(index_dir, paper_name)
    
    if os.path.exists(index_path):
        print(f"Skipping file [{i}/{len(pdf_files)}]. Index already exists at {index_path}.")
        continue
        
    print(f"[{i}/{len(pdf_files)}] Processing: {f_name}...")
    
    try:
        pdf_path = os.path.join(pdf_dir, f_name)
        loader = PyPDFLoader(pdf_path)
        documents = loader.load()
        
        try:
            print(f"    ... Paper Title: {documents[0].metadata['title']}")
        except:
            print(f"    ... Title not found for {f_name}")
        # Split the text
        chunks = text_splitter.split_documents(documents)
        print(f"    ... Created {len(chunks)} chunks.")
        
        if len(chunks) == 0:
            print("    ... Warning: No text extracted. Moving to the next paper.")
            continue
            
        # Create the vector index
        vector_idx = FAISS.from_documents(chunks, embeddings)
        
        # Save vector to its directory
        vector_idx.save_local(index_path)
        print(f"    ... Successfully saved FAISS index to {index_path}.")
        
    except Exception as e:
        print(f"    ... Error processing '{f_name}': {e}")
        print("    ... Moving to the next paper.")

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Found 9 PDFs in ../data/pdfs.
Skipping file [1/9]. Index already exists at ../data/faiss_index\aa52495-24.
Skipping file [2/9]. Index already exists at ../data/faiss_index\aa55599-25.
Skipping file [3/9]. Index already exists at ../data/faiss_index\Adhikari_2024_ApJ_965_124.
Skipping file [4/9]. Index already exists at ../data/faiss_index\PeB1il_2025_ApJ_980_38.
Skipping file [5/9]. Index already exists at ../data/faiss_index\PeB1il_2025_ApJ_985_199.
Skipping file [6/9]. Index already exists at ../data/faiss_index\staf1108.
Skipping file [7/9]. Index already exists at ../data/faiss_index\staf1598.
Skipping file [8/9]. Index already exists at ../data/faiss_index\staf482.
Skipping file [9/9]. Index already exists at ../data/faiss_index\staf783.
