# Querying a PDF file

An small example of querying a PDF with Retrieval Augmented Generation using Langchain and Open AI API.
The instructions were taken from the standard Langchain documentation at:
- https://python.langchain.com/docs/use_cases/question_answering/quickstart/
- https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf/#using-pypdf

Their example was modified here to query a PDF instead of a webpage.

Notes: 
- To end the program, pass an empty string to the user prompt
- The below dependencies need to be installed before this works.

In [1]:
%pip install --upgrade --quiet langchain langchain-community langchainhub langchain-openai langchain-chroma bs4 pypdf

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os

os.environ["OPENAI_API_KEY"] = "sk-proj-g5KqlpspBY8g2aXlu7YNT3BlbkFJZzhmCcun9A8cYu99NY84"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "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"

import bs4
from langchain import hub
from langchain_chroma import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

In [3]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("Wells and Choi - 2019 - Transcriptional Profiling of Stem Cells Moving fr.pdf")
docs = loader.load_and_split()

In [4]:
# Split text into smaller fragments
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

In [5]:
# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

In [6]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
)

In [7]:
# Prompt user for input until there is an empty string
while True:
    query = input("How can I help you?\n")
    if query == '':
        break

    rag_chain.invoke(query)
    print("\n")
    for chunk in rag_chain.stream(query):
        print(chunk, end="", flush=True)
    print("\n")

How can I help you?
 How has mixOmics been used?




MixOmics has been used to benchmark stem-like properties of cells in various studies. The tool has been utilized to evaluate the similarities and differences between different cell types in stem cell research. Despite its widespread adoption, there are challenges in reproducibility and data interpretation that may require computational interventions.



How can I help you?
 Who are the authors of this pdf?




The authors of this pdf are C.A.W. and J.C. as they shared discussions, coauthored, and edited the manuscript.



How can I help you?
 What are the full names of the authors of this pdf?




The full names of the authors mentioned in the context are Wells, C.A., Mosbergen, R., Korn, O., Choi, J., Seidenman, N., Matigian, N.A., Vitale, A.M., and Shepherd, J. Unfortunately, the full names of the other authors mentioned in the context are not provided.



How can I help you?
 What email can we contact in this document?




The email contact information is not provided in the given context.



How can I help you?
 What correspondence contact is available from this document?




The document provides the DOI links for further information on the studies mentioned in the references. These links can be used for correspondence contact or to access the full texts of the articles. However, specific contact information for correspondence is not directly available within the document.



How can I help you?
 What are the DOI links?




DOI links are unique digital object identifiers that provide a permanent link to a specific online document. They are commonly used in academic research to ensure reliable access to scholarly articles or publications.



How can I help you?
 Please print the exact DOI links




I don't know.



How can I help you?
 What is molecular profiling?




Molecular profiling involves comparing in-vitro and in-vivo cells through transcriptional profiling to identify key drivers of cell types and predict cell fates. RNA sequencing is commonly used for molecular profiling due to its scalability and reliability in annotating gene expression. This approach provides insights into molecular heterogeneity, though sparse single-cell transcriptome data and data integration challenges need to be considered.



How can I help you?
 
