## PDF Document Loaders
- Load various kind of documents from the web and local files.
- Apply LLM to the documents for summarization and question answering.

### Project 1: Question Answering from PDF Document
- We will load the document from the local file and apply LLM to answer the questions.
- Lets use research paper published on the missuse of the health supplements for workout. 

rag-dataset: git@github.com:laxmimerit/rag-dataset.git

```bash
git clone git@github.com:laxmimerit/rag-dataset.git
```

***

In [None]:
# !git clone https://github.com/laxmimerit/rag-dataset.git

Cloning into 'rag-dataset'...


***

In [None]:
# !pip install pymupdf tiktoken 



***
***

In [4]:
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

True

In [1]:
import os
from langchain.document_loaders import PyMuPDFLoader
from langchain.schema import Document
from typing import List

In [2]:
loader = PyMuPDFLoader(file_path="./rag-dataset/health supplements/1. dietary supplements - for whom.pdf")

In [3]:
docs = loader.load()

In [4]:
docs

[Document(metadata={'source': './rag-dataset/health supplements/1. dietary supplements - for whom.pdf', 'file_path': './rag-dataset/health supplements/1. dietary supplements - for whom.pdf', 'page': 0, 'total_pages': 17, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'iLovePDF', 'creationDate': '', 'modDate': 'D:20241021113754Z', 'trapped': ''}, page_content='International  Journal  of\nEnvironmental Research\nand Public Health\nReview\nDietary Supplements—For Whom? The Current State of\nKnowledge about the Health Effects of Selected\nSupplement Use\nRegina Ewa Wierzejska\n\x01\x02\x03\x01\x04\x05\x06\x07\x08\x01\n\x01\x02\x03\x04\x05\x06\x07\nCitation: Wierzejska, R.E. Dietary\nSupplements—For Whom? The\nCurrent State of Knowledge about the\nHealth Effects of Selected Supplement\nUse. Int. J. Environ. Res. Public Health\n2021, 18, 8897. https://doi.org/\n10.3390/ijerph18178897\nAcademic Editor: Paul B. Tchounwou\nReceived: 15 

In [5]:
len(docs)

17

In [6]:
# print(docs[0].page_content)

In [60]:
def load_pdfs_from_directory(directory_path: str, verbose: bool = False) -> List[Document]:
    """
    Loads all PDF files from the specified directory and its subdirectories using LangChain's PyMuPDFLoader.

    Args:
        directory_path (str): Path to the directory containing PDF files or subdirectories.
        verbose (bool): If True, displays progress messages. Default is False.

    Returns:
        List[Document]: A list of Document objects with content and metadata extracted from the PDFs.

    Raises:
        FileNotFoundError: If the directory does not exist or is not accessible.
        ValueError: If the directory and its subdirectories do not contain any PDF files.
        Exception: For any unexpected issues during PDF loading.
    """
    if not os.path.exists(directory_path):
        raise FileNotFoundError(f"The directory '{directory_path}' does not exist.")

    if not os.path.isdir(directory_path):
        raise ValueError(f"The path '{directory_path}' is not a directory.")

    # Recursively find all PDF files in the directory and its subdirectories
    pdf_files = []
    for root, _, files in os.walk(directory_path):
        for file in files:
            if file.lower().endswith('.pdf'):
                pdf_files.append(os.path.join(root, file))

    if not pdf_files:
        raise ValueError(f"No PDF files found in the directory '{directory_path}' and its subdirectories.")

    if verbose:
        print(f"Found {len(pdf_files)} PDF file(s) in directory '{directory_path}' and its subdirectories.\n")

    documents = []
    for pdf_file in pdf_files:
        try:
            if verbose:
                print(f"Loading PDF: {pdf_file}...")
            loader = PyMuPDFLoader(pdf_file)
            documents.extend(loader.load())
            if verbose:
                print(f"Successfully loaded PDF: {pdf_file}\n")
        except Exception as e:
            print(f"Error loading PDF '{pdf_file}': {e}")

    if not documents:
        raise Exception("No documents were successfully loaded from the PDF files.")

    if verbose:
        print(f"\nLoaded {len(documents)} document(s) from the all the PDFs")

    return documents


In [61]:
docs = load_pdfs_from_directory(directory_path="rag-dataset/gym supplements",verbose=True)

Found 2 PDF file(s) in directory 'rag-dataset/gym supplements' and its subdirectories.

Loading PDF: rag-dataset/gym supplements\1. Analysis of Actual Fitness Supplement.pdf...
Successfully loaded PDF: rag-dataset/gym supplements\1. Analysis of Actual Fitness Supplement.pdf

Loading PDF: rag-dataset/gym supplements\2. High Prevalence of Supplement Intake.pdf...
Successfully loaded PDF: rag-dataset/gym supplements\2. High Prevalence of Supplement Intake.pdf


Loaded 26 document(s) from the all the PDFs


In [62]:
def format_docs(docs: List[Document], verbose: bool = False) -> str:
    """
    Formats a list of Document objects into a single string with their content joined by double newlines.

    Args:
        docs (List[Document]): List of Document objects to format.
        verbose (bool): If True, displays progress messages. Default is False.

    Returns:
        str: A single string containing the page content of all documents, separated by double newlines.

    Raises:
        ValueError: If the input is not a list or contains non-Document objects.
        Exception: If the list is empty or no valid content is found.
    """
    if not isinstance(docs, list):
        raise ValueError("The input 'docs' must be a list of Document objects.")

    if not all(isinstance(doc, Document) for doc in docs):
        raise ValueError("All elements in 'docs' must be instances of the Document class.")

    if verbose:
        print(f"Received {len(docs)} document(s) to format.")

    content_list = [doc.page_content for doc in docs if doc.page_content]

    if not content_list:
        raise Exception("No valid content found in the provided documents.")

    result = "\n\n".join(content_list)

    return result


In [63]:
context = format_docs(docs=docs,verbose=True)

Received 26 document(s) to format.


In [27]:
import tiktoken

In [28]:
encoding = tiktoken.encoding_for_model("gpt-4o-mini")

In [29]:
encoding

<Encoding 'o200k_base'>

In [30]:
encoding.encode("congratulations")

[542, 111291, 14571]

In [31]:
len(encoding.encode(docs[0].page_content))

969

In [32]:
len(encoding.encode(context))

23039

## Question Aswering with LLM

In [33]:
from langchain_ollama import ChatOllama
from langchain_core.prompts import (SystemMessagePromptTemplate,
                                    HumanMessagePromptTemplate,
                                    ChatPromptTemplate)
from langchain_core.output_parsers import StrOutputParser

In [34]:
base_url = "http://localhost:11434"
model = "llama3.2:3b"

llm = ChatOllama(
    base_url=base_url,
    model=model
)

In [72]:
system = SystemMessagePromptTemplate.from_template("""You are a precise and methodical AI assistant with the following key guidelines:

1. Answer ONLY based on the provided context
2. Maintain strict adherence to the specified word limit
3. If no relevant information exists in the context, clearly state "Insufficient context to answer"
4. Prioritize accuracy and relevance over verbosity
5. Structure your answer for maximum clarity:
   - Use direct language
   - Break down complex information
   - Highlight key points
   - Avoid speculation or external knowledge

Word limit: {words} words
Context-driven response is mandatory""")

human = """### Precise Context Analysis Instructions:
- Carefully read and analyze the provided context
- Extract ONLY relevant information
- Construct your answer using ONLY the information within this context
- If context is insufficient, write "Insufficient context to answer"

### Context:
```{context}```

### Specific Question:
```{question}```

### Structured Answer Requirements:
- Be concise
- Match the word limit exactly
- Use clear, direct language
- Cite context references if possible
- Format for readability

### Answer:"""

In [73]:
# system = SystemMessagePromptTemplate.from_template("""You are helpful AI assistant who answer user question based on the provided context. 
#                                                     Do not answer in more than {words} words""")

# prompt = """Answer user question based on the provided context ONLY! If you do not know the answer, just say "I don't know".
#             ### Context:
#             ```{context}```

#             ### Question:
#             ```{question}```

#             ### Answer:"""

human = HumanMessagePromptTemplate.from_template(human)

messages = [system, human]

template = ChatPromptTemplate(messages=messages)

qna_chain = template | llm | StrOutputParser()

In [74]:
qna_chain

ChatPromptTemplate(input_variables=['context', 'question', 'words'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['words'], input_types={}, partial_variables={}, template='You are a precise and methodical AI assistant with the following key guidelines:\n\n1. Answer ONLY based on the provided context\n2. Maintain strict adherence to the specified word limit\n3. If no relevant information exists in the context, clearly state "Insufficient context to answer"\n4. Prioritize accuracy and relevance over verbosity\n5. Structure your answer for maximum clarity:\n   - Use direct language\n   - Break down complex information\n   - Highlight key points\n   - Avoid speculation or external knowledge\n\nWord limit: {words} words\nContext-driven response is mandatory'), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='###

In [75]:
response = qna_chain.invoke(
    {
        'context': context,
        'question': "Why Fitness supplements products are designed ?",
        'words': 50
    }
)

print(response)

**Design Purpose of Fitness Supplements**

Fitness supplements are designed to support athletic performance and muscle growth. The primary purpose is to provide additional nutrients that enhance training and recovery. However, the effectiveness of these products varies greatly.

Despite a desire for high-quality evidence-based information, many individuals misuse fitness supplements. This can lead to excessive consumption (e.g., 100 servings per week) and the use of ineffective products (e.g., fat-burning supplements). Conversely, some users rely on scientifically-backed supplements like creatine or beta-alanine only sporadically.

The widespread design of fitness supplements with variable effectiveness raises questions about their intended purpose and marketing. While they aim to support athletic performance, many fail to deliver results due to ineffective ingredients or misuse.


In [76]:
def chat_with_bot(question,context=context, words=50):
    response =  qna_chain.invoke(
        {
            'context': context,
            'question': question,
            'words': words
        }
    )

    print(response)

In [77]:
chat_with_bot(question="What is the full form of TPB, HBM and TEMPA")

The full forms are:

* **TPB**: Theory of Planned Behaviour
* **HBM**: Health Belief Model
* **TEMPA**: Theory of Efficacy Models in Physical Activity


In [78]:
chat_with_bot(question="What is widely known supplement for athletes and fitness enthusiasts?")

Protein supplements are widely used among athletes and fitness enthusiasts. This is in line with other studies that have reported a prevalence of protein intake among individuals who engage in physical activity. The high use of protein supplements can be attributed to their potential to enhance muscle mass, strength, and recovery after exercise. However, the use of protein supplements without adequate knowledge of their efficacy and safety has been linked to suboptimal results and potential health risks.


In [79]:
chat_with_bot(question="Which supplement is not known to increase fat mass and remains effective even when taken in recommended dose")

Protein supplements are not known to increase fat mass. In fact, adequate protein intake is essential for maintaining and losing body weight, as it helps preserve muscle mass during weight loss (1). Conversely, some supplements like Fat Burners or appetite suppressants may lead to unintended weight gain if taken excessively (2).

References:
(1) West, D. W., et al. "International Society of Sports Nutrition position stand: protein and exercise." Journal of the International Society of Sports Nutrition 11.1 (2014): 25.
(2) Hains, M. S., & Tarnopolsky, M. A. "Effects of a low-calorie diet with supplements on weight loss in obese subjects." Obesity Research 14.12 (2006): 1863-1870.


***

### Project 2: PDF Document Summarization

In [70]:
# system = SystemMessagePromptTemplate.from_template("""You are helpful AI assistant who works as document summarizer. 
#                                                    You must not hallucinate or provide any false information.""")

# prompt = """Summarize the given context in {words}.
#             ### Context:
#             {context}

#             ### Summary:"""

system = SystemMessagePromptTemplate.from_template("""

You are an advanced AI document summarization specialist with the following critical guidelines:

1. Absolute Accuracy Requirements:
   - Zero tolerance for hallucination
   - Use ONLY information present in the source document
   - Maintain factual integrity

2. Summarization Principles:
   - Capture core message and key insights
   - Preserve original meaning and context
   - Eliminate redundant or trivial information
   - Focus on substantive content

3. Structural Summarization Approach:
   - Extract primary arguments/points
   - Identify critical supporting details
   - Maintain logical flow of original text
   - Ensure coherence and readability

Constraints:
- Limit summary to {words} words
- Strictly fact-based
- No external information
- No personal interpretations""")

prompt = """### Comprehensive Summarization Instructions:

1. Analysis Phase:
   - Thoroughly examine entire context
   - Identify main thesis/central argument
   - Note critical supporting evidence
   - Understand overall document structure

2. Summary Construction Guidelines:
   - Distill document to its essence
   - Preserve original tone and intent
   - Use precise, clear language
   - Avoid unnecessary elaboration

3. Specific Requirements:
   - Exactly {words} words
   - Third-person perspective
   - Neutral, objective tone
   - Focus on substantive content

### Source Context:
```{context}```

### (Precise Summary):"""

human = HumanMessagePromptTemplate.from_template(prompt)

messages = [system, human]

template = ChatPromptTemplate(messages)

summary_chain = template | llm | StrOutputParser()

In [71]:
response = summary_chain.invoke({
    'context': context,
    'words': 50
})

print(response)

The study analyzed the supplement usage among Swiss fitness center users, finding:

1. **High prevalence**: 71% of participants used supplements.
2. **Low information quality**: Only 34% of participants reported using high-quality evidence-based supplements.
3. **Discrepancy between desire and use**: Participants expressed a strong desire for high-quality information, but many used ineffective or excessive amounts of supplements.

The study highlights the need for education on safe and effective supplement usage, as well as promoting evidence-based information dissemination among users.


In [80]:
def get_summary(directory_path: str, words=100):
    docs = load_pdfs_from_directory(directory_path=directory_path,verbose=False)
    context = format_docs(docs=docs,verbose=False)

    system = SystemMessagePromptTemplate.from_template("""You are helpful AI assistant who works as document summarizer. 
                                                   You must not hallucinate or provide any false information.""")

    prompt = """Summarize the given context in {words}.
                ### Context:
                {context}

                ### Summary:"""

    human = HumanMessagePromptTemplate.from_template(prompt)

    messages = [system, human]

    template = ChatPromptTemplate(messages)

    summary_chain = template | llm | StrOutputParser()

    print("Generating the summary ...")
    response = summary_chain.invoke({
    'context': context,
    'words': words
    })

    return response

In [48]:
path = "rag-dataset/gym supplements"

summary = get_summary(
    directory_path=path
)

print(summary)

Generating the summary ...
This study investigated the supplement use habits of Swiss fitness center users, finding that 59% of respondents used supplements, with protein and creatine being the most commonly used. The results suggest that athletes often rely on inadequate or biased sources of information for making decisions about their supplement regimens, which may lead to safety risks.

Key findings:

* 59% of respondents use supplements
* Protein and creatine are the most common supplements used
* Athletes often seek information from unreliable sources, such as trainers, coaches, and online websites
* There is a significant disparity between the desire for high-quality evidence-based information and the actual quality of the information available

Limitations:

* The study only screened a limited number of fitness centers, which may not be representative of the broader population.
* The questionnaire was intentionally kept short and simple to focus on return rates rather than compr

In [49]:
path = "rag-dataset/health supplements"

summary = get_summary(
    directory_path=path
)

print(summary)

Generating the summary ...
This article discusses the potential interactions between herbal supplements and medications, specifically focusing on botanicals such as garlic and ginkgo biloba. The authors review various studies that have investigated the effects of these herbs on platelet aggregation, bleeding risk, and metabolism, highlighting potential herb-drug interactions and mechanisms involved. They also note that more research is needed to fully understand the risks and benefits of using herbal supplements in conjunction with medications.

The article highlights several key points:

1. Garlic and ginkgo biloba can increase the risk of bleeding due to their effects on platelet aggregation.
2. Aged garlic extract has been shown to inhibit platelet aggregation, which may contribute to an increased risk of bleeding.
3. Gingko biloba extracts have also been linked to spontaneous bleeding in some individuals.
4. The activation of certain enzymes, such as PXR and AhR, can affect the met

In [82]:
path = "rag-dataset"

summary = get_summary(
    directory_path=path
)

print(f"Here is the summary:\n {summary}")

Generating the summary ...
Here is the summary:
 


***

### Project 3: Report Generation from PDF Document

In [64]:
chat_with_bot(context=context,
              question="Provide a detailed report from the provided context. Write answer in Markdown.",
              words=2000)

**Insufficient Safety Knowledge among Fitness Center Users**

A recent study conducted at fitness centers identified a significant demand for reliable information sources among users regarding dietary supplement use. However, the study found that many users were misinformed about risk factors associated with supplement intake.

**Key Findings:**

*   The majority of subjects (59%) reported being informed by non-medical professionals, such as family members, friends, or training peers.
*   These individuals often cited personal experience with a product as evidence-based information, despite lacking formal education or qualifications in nutrition.
*   The study also found that fitness instructors and physiotherapists were frequently cited as sources of information, despite being unqualified to provide substantiated nutrition advice.

**Limitations:**

*   The questionnaire was not designed to evaluate the appropriateness of supplement use, highlighting a need for broad information disse

In [65]:
chat_with_bot(context=context,
              question="How to repair a Laptop ?",
              words=2000)

**Repairing a Laptop**

Repairing a laptop can be challenging, but it's not impossible. Before starting, ensure you have backed up your data and identified the issue.

**Common Causes:**

* Battery damage or swelling
* Water damage
* Screen cracking or shattering
* Keyboard or trackpad issues
* Hard drive failure

**Basic Repairs:**

1. **Battery Replacement**: If the battery is swollen or damaged, replace it with a new one (approx. $50-$100).
2. **Water Damage**: Dry the laptop and inspect for corrosion. Use a desoldering wick to remove corrosion from components (context reference: iFixit's water damage repair guide).
3. **Screen Repair**: Replace the screen or use a screen protector to prevent further damage (approx. $100-$300).

**More Advanced Repairs:**

1. **Hard Drive Replacement**: Replace the hard drive with an SSD (Solid State Drive) for improved performance (approx. $200-$500).
2. **Motherboard Replacement**: Repair or replace the motherboard if damaged (approx. $300-$600). 