# Querying a PDF file

An small example of querying a PDF with Retrieval Augmented Generation using Langchain and Open AI API.
The instructions were taken from the standard Langchain documentation at:
- https://python.langchain.com/docs/use_cases/question_answering/quickstart/
- https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf/#using-pypdf

Their example was modified here to query a PDF instead of a webpage.

Notes: 
- To end the program, pass an empty string to the user prompt
- The below dependencies need to be installed before this works.

In [1]:
%pip install --upgrade --quiet langchain langchain-community langchainhub langchain-openai langchain-chroma bs4 pypdf

Note: you may need to restart the kernel to use updated packages.


In [3]:
import os

os.environ["OPENAI_API_KEY"] = "<INSERT OPENAI API KEY HERE>"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "<INSERT LANGCHAIN API KEY HERE>"

import bs4
from langchain import hub
from langchain_chroma import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

In [4]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("Wells and Choi - 2019 - Transcriptional Profiling of Stem Cells Moving fr.pdf")
docs = loader.load_and_split()

In [5]:
# Split text into smaller fragments
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

In [6]:
# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

In [7]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
)

In [None]:
# Prompt user for input until there is an empty string
while True:
    query = input("How can I help you?\n")
    if query == '':
        break

    rag_chain.invoke(query)
    print("\nQuery: " + query)
    for chunk in rag_chain.stream(query):
        print(chunk, end="", flush=True)
    print("\n")

How can I help you?
 What is transcriptional profiling?



Query: What is transcriptional profiling?
Transcriptional profiling is a tool used to compare and analyze gene expression in stem cells and their progeny. It can be used to identify key transcriptional drivers of cell types and make predictions about cell fates. RNA sequencing is commonly used for transcriptional profiling due to its scalability and reliability.



How can I help you?
 What are the future opportunities and challenges?



Query: What are the future opportunities and challenges?
Future opportunities in the stem cell field include improved data analysis methods, enhanced lineage tracing techniques, and the potential for new insights into cell differentiation and reprogramming. Challenges include technical issues related to single-cell sequencing, genetic confounding factors, and the need for improved data integration and analysis methodologies. Overall, advancements in technology and data integration will likely lead to a better understanding of cell identity, activation, and transitions in the future.



How can I help you?
 How will data integration methods change in the future?



Query: How will data integration methods change in the future?
Data integration methods in the future are expected to rely more on exemplar atlases, improved data integration and comparison techniques, and move towards generalizable and reproducible observations. New methods for lineage tracing, integrated perturbation, chromatin profiling, and single-cell RNA sequencing will address molecular program questions in differentiation and reprogramming. Improvements in long-read sequencing will lead to better molecular resolution in stem cell pathways and data integration will increasingly use variable selection methodologies.

