# Data Extraction from PDF Documents Project

## Setting up Necessary Packages

In [1]:
!pip3 install --upgrade --quiet langchain langchain-community langchain-openai chromadb
!pip3 install --upgrade --quiet pypdf pandas streamlit python-dotenv

You should consider upgrading via the '/Users/valesanchez/Documents/VS_Code/rag_llms/myenv/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/valesanchez/Documents/VS_Code/rag_llms/myenv/bin/python3 -m pip install --upgrade pip' command.[0m


Insight on what these packages are for:
- **langchain**: A flexible framework for building applications with large language models.
- **langchain-community**: Contains essential third-party integrations for langchain.
- **langchain-openai**: Connects langchain with OpenAI.
- **chromadb**: An open-source vector database.
- **pypdf**: A package for reading and parsing PDF documents in Python.
- **pandas**: For data wrangling.
- **streamlit**: For building quick web apps in Python.
- **python-dotenv**: For managing environment variables in Python applications.

In [2]:
# import Langchain modules
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field

# other necessary modules and packages
import os
import tempfile
import streamlit as st  
import pandas as pd
from dotenv import load_dotenv


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


## Setting up the LLM

There are many models available, such as APIs from Anthropic and Mistral.AI, as well as local open-source LLMs like Llama 3. For local models, we are often only able to run smaller versions, which may compromise output quality.

A future project could involve comparing these models with a complex document to observe how accurate their outputs are. We could could even go further into testing if some models might be better suited for more complex materials, such as research papers, while others might excel at analyzing contracts or literary works.

Now back to the project current...

In [3]:
load_dotenv() # read all key value pairs from my .env file

True

In [4]:
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

In [5]:
llm = ChatOpenAI(model="gpt-4o-mini", api_key=OPENAI_API_KEY) # integration with openai models
# llm.invoke("Tell me a joke about dogs") # testing the llm

## Process PDF Document

In [6]:
# loading the pdf document
pdf_path1 = "/Users/valesanchez/Documents/VS_Code/rag_llms/data/Capstone3_Skincare_Hybrid_Recommendation_System.pdf"
loader = PyPDFLoader(pdf_path1)
pages = loader.load()
pages # document objects represent a page in the pdf

[Document(metadata={'source': '/Users/valesanchez/Documents/VS_Code/rag_llms/data/Capstone3_Skincare_Hybrid_Recommendation_System.pdf', 'page': 0}, page_content='Hybrid\nSkincare\nRecommendation\nSystem\nCapstone\n3\nFinal\nReport\nBy:\nValentina\nSanchez\nIntroduction\nDataset\nOverview\nExploratory\nData\nAnalysis\nDistribution\nof\nProducts\nby\nthe\nPrimary\nand\nSecondary\nCategory\nDistribution\nof\nNumber\nof\nReviews\nper\nUnique\nAuthor\n(Threshold\n=\n20)\nPercentage\nof\nReviews\nby\nSkin\nType\nDistribution\nof\nAverage\nRating,\nAuthor\nRating\nand\nNumber\nof\nReviews\nHeatmap\nof\nCorrelations\nData\nPreprocessing\nModel\nDevelopment\n&\nEvaluation\nConclusion\nFurther\nWork'),
 Document(metadata={'source': '/Users/valesanchez/Documents/VS_Code/rag_llms/data/Capstone3_Skincare_Hybrid_Recommendation_System.pdf', 'page': 1}, page_content="Introduction\nUnderstanding\nthat\nskincare\nefficacy\nis\nhighly\nindividualized.\nThe\nimpact\nof\na\nproduct\nvaries\nsignificantly\n

## Creating a Vector Database

### Chunks of the Document

A single page can be exceptionally long, making it impractical to use as the context for our LLM when answering questions. The LLM we are using has a token limit for each API request. For example, the "gpt-4o" model allows up to 128k tokens shared between the prompt and the chat completion. To put this into perspective, that is roughly equivalent to a 200-page book.

This involves breaking our document into smaller chunks because the answer to a prompt is usually found in specific sections of the document. By providing the LLM with only the most important or relevant parts of the document as context, we can enhance the process's efficiency and accuracy.

The chunk size can be adjusted, and if we are familiar with the specific format of our PDF documents, we can target each paragraph more effectively. For instance, a research paper may have paragraphs around 500 to 1,000 words, allowing us to break the document into chunks that align with each paragraph. 

However, it's important to find the right balance. Chunks that are too large might include redundant or irrelevant information, while chunks that are too small could overlook crucial details within the document.

In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, # how many characters we want a chunk to contain
                                            chunk_overlap=200, # overlap between the chunks
                                            length_function=len, # how we want to count the characters
                                            separators=["\n\n", "\n", " "]) # avoiding separation in the middle of a word
chunks = text_splitter.split_documents(pages)
chunks

[Document(metadata={'source': '/Users/valesanchez/Documents/VS_Code/rag_llms/data/Capstone3_Skincare_Hybrid_Recommendation_System.pdf', 'page': 0}, page_content='Hybrid\nSkincare\nRecommendation\nSystem\nCapstone\n3\nFinal\nReport\nBy:\nValentina\nSanchez\nIntroduction\nDataset\nOverview\nExploratory\nData\nAnalysis\nDistribution\nof\nProducts\nby\nthe\nPrimary\nand\nSecondary\nCategory\nDistribution\nof\nNumber\nof\nReviews\nper\nUnique\nAuthor\n(Threshold\n=\n20)\nPercentage\nof\nReviews\nby\nSkin\nType\nDistribution\nof\nAverage\nRating,\nAuthor\nRating\nand\nNumber\nof\nReviews\nHeatmap\nof\nCorrelations\nData\nPreprocessing\nModel\nDevelopment\n&\nEvaluation\nConclusion\nFurther\nWork'),
 Document(metadata={'source': '/Users/valesanchez/Documents/VS_Code/rag_llms/data/Capstone3_Skincare_Hybrid_Recommendation_System.pdf', 'page': 1}, page_content="Introduction\nUnderstanding\nthat\nskincare\nefficacy\nis\nhighly\nindividualized.\nThe\nimpact\nof\na\nproduct\nvaries\nsignificantly\n

### Embedding the Chunk Data

Another step to consider for providing better context to the LLM is to use text embeddings.

Text embeddings represent words or documents as numerical vectors that capture their meanings. By converting text data into this numerical format, computers can process and work with the information more effectively.

Embedding vectors are essentially lists of numbers, or coordinates, in a multi-dimensional space. While the individual values in a vector don’t have specific meanings, the relationships between vectors are crucial. Texts with similar meanings will have vectors that are close together, while texts with different meanings will have vectors that are farther apart.

The distance between these vectors can be measured using methods like cosine similarity or Euclidean distance.

A well-designed embedding model is key to capturing the meaning of text accurately, so having high-quality embeddings is crucial for our retrieval system to perform effectively.

In [8]:
def get_embedding_func():
    embeddings = OpenAIEmbeddings(
        model="text-embedding-ada-002", openai_api_key=OPENAI_API_KEY
    )
    return embeddings
embedding_func = get_embedding_func()
testing_vec = embedding_func.embed_query("dog")
testing_vec

[-0.0034820924047380686,
 -0.01784995011985302,
 -0.01628131791949272,
 -0.017484836280345917,
 -0.01810687966644764,
 0.021933801472187042,
 -0.012501725926995277,
 -0.0227857306599617,
 -0.02147402986884117,
 -0.017958130687475204,
 0.012210988439619541,
 0.038891252130270004,
 0.0016049741534516215,
 -0.006984469015151262,
 -0.013766097836196423,
 0.02425970323383808,
 0.039810795336961746,
 0.0012745134299620986,
 0.00951321143656969,
 -0.01196081843227148,
 -0.020027102902531624,
 0.00603449996560812,
 0.011203547939658165,
 -0.025584926828742027,
 -0.007599750999361277,
 0.010284004732966423,
 0.009837755933403969,
 -0.008370544761419296,
 -0.005679529160261154,
 -0.009594348259270191,
 0.007518615107983351,
 -0.009175144135951996,
 -0.025260383263230324,
 -0.021244144067168236,
 -0.0058654663152992725,
 -0.019094036892056465,
 -0.007484808564186096,
 -0.016146089881658554,
 -0.011656558141112328,
 -0.0210142582654953,
 0.004476009868085384,
 0.01097366213798523,
 0.0117985457181

In [9]:
# calculate the distance between two pieces of text to test the embedding model
from langchain.evaluation import load_evaluator

eval =load_evaluator(evaluator = "embedding_distance",
                     embeddings=embedding_func)
#eval.evaluate_strings(prediction="coffee", reference="fish")


In [10]:
#eval.evaluate_strings(prediction="coffee", reference="plant")

Text chunks generate numerous embeddings, so managing these vectors efficiently is crucial for effective querying. Vector databases, like Chroma Database, help store and organize these embeddings for quick similarity searches.

When you query, such as "What is the key information in this paper?" the database creates an embedding for the question and compares it with stored vectors based on a defined distance metric. The most relevant chunks are then combined and sent to an LLM to generate a comprehensive response.

In [11]:
import uuid
def create_vector_library(chunks, embedding_function, vector_library_path):

    # create a list of unique ids for each document based on the content
    ids = [str(uuid.uuid5(uuid.NAMESPACE_DNS, doc.page_content)) for doc in chunks]
    
    # ensure that only unique docs with unique ids are kept (avoiding duplicates)
    unique_ids = set()
    unique_chunks = []
    
    unique_chunks = [] 
    for chunk, id in zip(chunks, ids):     
        if id not in unique_ids:       
            unique_ids.add(id)
            unique_chunks.append(chunk) 

    # create a new Chroma database from the documents
    vector_library = Chroma.from_documents(documents=unique_chunks, 
                                        ids=list(unique_ids),
                                        embedding=embedding_function, 
                                        persist_directory = vector_library_path)

    vector_library.persist()
    
    return vector_library

In [12]:
# create vector library
vector_library = create_vector_library(chunks=chunks,
                                       embedding_function=embedding_func,
                                       vector_library_path="vectordb_chroma"
                                       )

  vector_library.persist()


## Querying data

In [13]:
# loading the database
vector_library = Chroma(persist_directory="vectordb_chroma", embedding_function=embedding_func)

  vector_library = Chroma(persist_directory="vectordb_chroma", embedding_function=embedding_func)


In [14]:
# creating a retriever to retrieve the relevant chunks
retriever = vector_library.as_retriever(search_type="similarity") # cosine similarity
relevant_chunks = retriever.invoke("What is the title of the paper?")
relevant_chunks # retrieves the most relevant chunks to the query

[Document(metadata={'page': 6, 'source': '/Users/valesanchez/Documents/VS_Code/rag_llms/data/Anomaly_detection_in_crowded_scenes.pdf'}, page_content='[19] S. Kullback. Information Theory and Statistics . Dover Pub-\nlications, New York, 1968.[20] V . Mahadevan and N. Vasconcelos. Background subtracti on\nin highly dynamic scenes. CVPR , 1, 2008.\n[21] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd be-\nhavior detection using social force model. In CVPR , pages\n935–942, 2009.\n[22] N. Siebel and S. Maybank. Fusion of multiple tracking al -\ngorithms for robust people tracking. In ECCV , page IV: 373\nff., 2002.\n[23] C. Stauffer and W. Grimson. Adaptive background mixtur e\nmodels for real-time tracking. In CVPR , volume 2, pages\n2246–2252, 1999.\n[24] T. Zhang, H. Lu, and S. Li. Learning semantic scene model s\nby object classiﬁcation and trajectory clustering. In CVPR ,\npages 1940–1947, 2009.\n1981\nAuthorized licensed use limited to: Jacobs University Bremen. Downloaded on May 1

In [15]:
# prompt template we can use to specify to avoid hallucinations
PROMPT_TEMPLATE = """
You are an assistant for a question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you do not know the answer, say that you do not know, DO NOT MAKE UP ANYTHING.

{context}

---
Answer the question based on the above context: {question}
"""

In [16]:
# concatenate context text
context_text = "\n\n---\n\n".join([doc.page_content for doc in relevant_chunks])

# create prompt
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, 
                                question="What is the title of the paper?")
print(prompt)

Human: 
You are an assistant for a question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you do not know the answer, say that you do not know, DO NOT MAKE UP ANYTHING.

[19] S. Kullback. Information Theory and Statistics . Dover Pub-
lications, New York, 1968.[20] V . Mahadevan and N. Vasconcelos. Background subtracti on
in highly dynamic scenes. CVPR , 1, 2008.
[21] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd be-
havior detection using social force model. In CVPR , pages
935–942, 2009.
[22] N. Siebel and S. Maybank. Fusion of multiple tracking al -
gorithms for robust people tracking. In ECCV , page IV: 373
ff., 2002.
[23] C. Stauffer and W. Grimson. Adaptive background mixtur e
models for real-time tracking. In CVPR , volume 2, pages
2246–2252, 1999.
[24] T. Zhang, H. Lu, and S. Li. Learning semantic scene model s
by object classiﬁcation and trajectory clustering. In CVPR ,
pages 1940–1947, 2009.
1981
Authorized licensed use limited t

## Answering the Query

In [17]:
llm.invoke(prompt)

AIMessage(content='The title of the paper is "Hybrid Skincare Recommendation System."', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 1473, 'total_tokens': 1486, 'completion_tokens_details': {'audio_tokens': None, 'reasoning_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_f85bea6784', 'finish_reason': 'stop', 'logprobs': None}, id='run-e795e5b0-d2de-4a55-8a60-f6e8646a88eb-0', usage_metadata={'input_tokens': 1473, 'output_tokens': 13, 'total_tokens': 1486})

In [18]:
# langchain can help us chain everything together
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt_template
            | llm
        )
rag_chain.invoke("What's the title of this paper?")

AIMessage(content='The title of the paper is "Hybrid Skincare Recommendation System Capstone 3 Final Report."', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 1467, 'total_tokens': 1486, 'completion_tokens_details': {'audio_tokens': None, 'reasoning_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_f85bea6784', 'finish_reason': 'stop', 'logprobs': None}, id='run-382b0cf9-da70-42db-bb14-1fa990361031-0', usage_metadata={'input_tokens': 1467, 'output_tokens': 19, 'total_tokens': 1486})

In [19]:
# data validation library from Pydantic will help us structure the output
class AnswerWithSources(BaseModel):
    """an answer to the question(specifying the data type), with sources and reasoning."""
    answer: str = Field(description="Answer to question") 
    sources: str = Field(description="Full direct text chunk from the context used to answer the question")
    reasoning: str = Field(description="Explain the reasoning of the answer based on the sources")
    
class ExtractedInfo(BaseModel):
    """extracted information about the research article"""
    paper_title: AnswerWithSources
    paper_summary: AnswerWithSources
    publication_year: AnswerWithSources
    paper_authors: AnswerWithSources

In [20]:
rag_chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt_template
            | llm.with_structured_output(ExtractedInfo, strict=True)
        )

rag_chain.invoke("Give me the title, summary, publication date, authors of the research paper.")

ExtractedInfo(paper_title=AnswerWithSources(answer='Hybrid Skincare Recommendation System', sources='Hybrid Skincare Recommendation System Capstone 3 Final Report By: Valentina Sanchez', reasoning='The title is explicitly mentioned at the beginning of the provided context.'), paper_summary=AnswerWithSources(answer='The report discusses the development of a skincare recommendation system that incorporates various dataset attributes and models to enhance personalization and user engagement.', sources='Further Work: For future enhancements of the Hybrid Skincare Recommendation System, several strategic improvements and explorations could be made...', reasoning='The summary is derived from the context discussing the recommendations for future improvements to the system.'), publication_year=AnswerWithSources(answer='2022', sources='Authorized licensed use limited to: Jacobs University Bremen. Downloaded on May 13,2022 at 15:04:38 UTC from IEEE Xplore.', reasoning='The publication year can b

## Query Answers as Table Format

In [21]:
structured_response = rag_chain.invoke("Give me the title, summary, publication date, authors of the research paper.")
df = pd.DataFrame([structured_response.dict()])

# transforming into a table with two rows: 'answer' and 'source'
answer_row = []
source_row = []
reasoning_row = []

for col in df.columns:
    answer_row.append(df[col][0]['answer'])
    source_row.append(df[col][0]['sources'])
    reasoning_row.append(df[col][0]['reasoning'])

# create new dataframe with two rows: 'answer' and 'source'
structured_response_df = pd.DataFrame([answer_row, source_row, reasoning_row], columns=df.columns, index=['answer', 'source', 'reasoning'])
structured_response_df

Unnamed: 0,paper_title,paper_summary,publication_year,paper_authors
answer,Hybrid Skincare Recommendation System,The report discusses the development of a Hybr...,2022,Valentina Sanchez
source,Hybrid Skincare Recommendation System Capstone...,Further Work: For future enhancements of the H...,Authorized licensed use limited to: Jacobs Uni...,Hybrid Skincare Recommendation System Capstone...
reasoning,The title is explicitly mentioned at the begin...,The summary is derived from the mention of the...,"The context indicates a download date in 2022,...",The author is explicitly mentioned in the cont...
