## Step 1. Install and import Necessary Libraries
You need to install the required libraries:

In [6]:
#!pip install langchain_community langchain_huggingface faiss-cpu


## Step 2. Extract Text from the PDF Document
We can use PyPDF to extract text from a PDF. You can parse the PDF and split it into chunks for later use.

In [7]:

from langchain_community.document_loaders import PyPDFLoader

def extract_text_from_pdf(pdf_path: str):
    """
    Extracts text from a PDF document and loads it into LangChain's document structure.
    :param pdf_path: Path to the PDF file
    :return: List of LangChain Document objects
    """
    loader = PyPDFLoader(file_path=pdf_path)
    docs = loader.load()
    return docs
# docs = extract_text_from_pdf("../data/Responsible_data_sharing.pdf")
# print(len(docs))
# print(f"{docs[0].page_content[:200]}\n")
# print(docs[0].metadata)

In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from typing import List


def split_texts_to_chunks(

    docs: List[Document], chunk_size: int = 1000, chunk_overlap: int = 200

):

    """

    Splits the text from documents into smaller chunks for processing.

    :param docs: List of LangChain Document objects

    :param chunk_size: Size of each chunk

    :param chunk_overlap: Overlap between consecutive chunks

    :return: List of split documents

    """

    text_splitter = RecursiveCharacterTextSplitter(

        chunk_size=chunk_size, chunk_overlap=chunk_overlap, add_start_index=True

    )

    all_splits = text_splitter.split_documents(docs)
    return all_splits


# all_splits = split_texts_to_chunks(docs)

# len(all_splits)

## Step 3. Embed the Text Using Hugging Face Transformers and store them in FAISS vectorstore
We will use Hugging Face's transformer model (e.g., sentence-transformers/all-MiniLM-L6-v2) to convert the text into embeddings.FAISS is used for efficient similarity search in high-dimensional spaces.

In [9]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_huggingface import HuggingFaceEmbeddings
import faiss

def create_vector_store(all_splits: List[Document], model_name: str):
    """
    Creates a vector store from document chunks.
    :param all_splits: List of split LangChain Document objects
    :param model_name: Name of the embedding model
    :return: Initialized FAISS vector store
    """
    embeddings = HuggingFaceEmbeddings(model_name=model_name)
    sample_vector = embeddings.embed_query(all_splits[0].page_content)
    index = faiss.IndexFlatL2(len(sample_vector))
    vector_store = FAISS(
        embedding_function=embeddings,
        index=index,
        docstore=InMemoryDocstore(),
        index_to_docstore_id={},
    )
    vector_store.add_documents(documents=all_splits)
    return vector_store

## Step 5. Create a RAG-based Retrieval System with Langchain
Langchain helps in creating complex chains of logic for RAG systems. It integrates both the document retrieval and generation process.

In [10]:

def retrieve_documents(vector_store, queries: List[str], k: int = 1):
    """
    Retrieves the most similar documents for each query.
    :param vector_store: FAISS vector store
    :param queries: List of queries
    :param k: Number of results to retrieve
    :return: List of retrieved documents
    """
    retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": k},
    )
    results = retriever.batch(queries)
    return results

## Step 6. Putting It All Together
Now, you can combine everything into a complete pipeline.

In [11]:
def main(
    pdf_path: str,
    queries: List[str],
    model_name: str = "sentence-transformers/all-mpnet-base-v2",
):
    """
    Main function to extract text from a PDF, split into chunks, create a vector store, and retrieve documents.
    :param pdf_path: Path to the PDF file
    :param queries: List of queries
    :param model_name: Embedding model name
    """
    # Step 1: Extract text from PDF
    docs = extract_text_from_pdf(pdf_path)

    # Step 2: Split documents into smaller chunks
    all_splits = split_texts_to_chunks(docs)

    # Step 3: Create a vector store using embeddings
    vector_store = create_vector_store(all_splits, model_name)

    # Step 4: Retrieve documents based on queries
    results = retrieve_documents(vector_store, queries)

    return results

In [12]:
# Example usage
if __name__ == "__main__":
    pdf_path = "../data/Responsible_data_sharing.pdf"
    queries = [
        "give the summary of this document",
        "what is the meaning of responsible data sharing?",
    ]
    retrieved_docs = main(pdf_path, queries)

    for query, docs in zip(queries, retrieved_docs):
        print(f"Query: {query}")
        for doc in docs:
            print(f"Document: {doc.page_content}\n")

  from .autonotebook import tqdm as notebook_tqdm


Query: give the summary of this document
Document: request the information required to meet the specified purpose for which it is being requested and 
should indicate a timeline for destruction of the data. Humanitarian organisations should document 
all requests for data and ensure consistency in responding to these requests over time. 
• Investing in data management capacities of staff and organizations
Donors and humanitarian organizations should identify opportunities to invest in building data 
management expertise especially for non-technical staff. The donor community is uniquely positioned 
to encourage data responsibility by  providing additional  resources for training and capacity building. 
• Adopting common principles for donor data management
The sector already has a range of principles and commitments to inform different aspects of 
humanitarian donorship.15 However, these do not sufficiently address concerns related to data 
responsibility. Donors and partners should en