# Efficient Information Extraction: Q&A and Summarization over PDF Documents using LLM

# PDF Loading

In [2]:
# loading PDF file
from langchain_community.document_loaders import PyPDFLoader

In [3]:
pdf_load = PyPDFLoader('Jurnal_Machine learning.pdf', extract_images=True)
pdf_file = pdf_load.load()


In [4]:
# Check data type
type(pdf_file)

list

In [5]:
# Check elements  in data list
len(pdf_file)

21

In [6]:
# Check content in data list
print(pdf_file[2].page_content)

M. Bansal, A. Goyal and A. Choudhary Decision Analytics Journal 3 (2022) 100071
Fig. 1. Classification of Machine Learning Algorithms [2].
exact outputs with some random data. The theory of supervised type
of learning is centered on the word ‘supervision’, where it aims at
mapping the data associated with the input to that associated with
the output. This method undoubtedly needs a substantial amount of
human application to construct the model, but eventually leads to faster
performance of an otherwise tedious task. Supervised machine learning
is a widely adopted category of Machine learning. This is further
classified into Regression algorithms and Classification algorithms [1].
1.1.2. Unsupervised learning
Unsupervised learning enables the machine to learn without any
supervision. In unsupervised learning, an unsegregated and unlabeled
data set is provided to the machine, and the algorithm is supposed
to perform on the data without any supervision. This theory aims
at regrouping the 

# Splitting

This process to create smaller chunks from previously extracted content. Smaller chunks are easier to maintain, store, and process by the LLM.

In this splitting process, we're utilizing RecursiveCharacterTextSplitter(). Splitting technique with RecursiveCharacterTextSplitter() is recommended for initiating splitting from a considerably large text.

Next, we input the pdf_data list into the .split_documents() method. This method will create smaller chunks from pdf_data.

In [7]:
# To split the content of a PDF file into smaller chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [8]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500, # max character
    chunk_overlap = 250, # The maximum number of characters that are the same
    separators=[
                "\n\n",
                "\n",
                " ",
                ".",
                ",",
                "\u200b",  # zero-width space
                "\uff0c",  # full-width comma
                "\u3001",  # ideographic comma
                "\uff0e",  # full-width full stop
                "\u3002",  # ideographic full stop
                "",
            ])

splits = text_splitter.split_documents(pdf_file)

In [9]:
# Check elements  in data list after split process
len(splits)

414

We got 414 chunks as a result of splitting 21 elements from the pdf_file.

# Embedding and Storing

The Embedding process is a procedure to create numerical representations of text so that the text can be understood by computers. LangChain provides embedding options, and the one we will use is GoogleGenerativeEmbeddings() for the LLM model developed by Google.

The content of the PDF file, which is already in numerical form, will then be stored in the vector database.
The vector database we will use is Chroma. Chroma is a vector database capable of storing unstructured data, such as the content of PDF files.

In [10]:
# embedding
from langchain_google_genai import GoogleGenerativeAIEmbeddings # embedding Google Generative AI

# save embedding to database
from langchain_community.vectorstores import Chroma

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
#dvectorstore_gemini = create_vectorstore_folder(
     #documents = splits,
     #embedding = GoogleGenerativeAIEmbeddings(model="models/embedding-001"),
    #persist_directory = 'data_input/chroma_gemini'
 #)

In [12]:
# call embedding from directory in database
vec_gemini = Chroma(persist_directory= 'data_input/chroma_gemini',
                    embedding_function= GoogleGenerativeAIEmbeddings(model="models/embedding-001"))

# Retrieving

Creating a chain that allows us to query the LLM involves the following steps: The LLM will provide answers based on the information stored in a vector database. When we present a question (query) to the LLM, it will be converted into a numerical representation (vector). This numerical representation will then be used to search for relevant answers based on the information in the vector database. This process involves calculating vector similarity between the query vector and the vectors representing information in the vector database. Information with a high level of similarity will be returned as the answer.

In [13]:
# LLM
from langchain_google_genai import ChatGoogleGenerativeAI

# prepare prompt
from langchain_core.prompts import PromptTemplate
# Question input
from langchain_core.runnables import RunnablePassthrough
# sow output
from langchain_core.output_parsers import StrOutputParser

import textwrap

In [14]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [15]:
qa_template = """
    You are the great assistant in understanding additional context

    Use the following pieces of context to answer the question at the end.
    Use the minimum of three sentences to answer the question. 
    Try your best to answer as complete as possible with easy style of English.
    Always say "thanks for asking!" at the end of the answer.

    {context}

    Question: {question}

    Helpful Answer:"""

custom_rag_prompt = PromptTemplate.from_template(qa_template)

In [16]:
def create_qa_chain(retriever, llm):

    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()} 
        | custom_rag_prompt # prompt for direct to output & LLM.
        | llm               
        | StrOutputParser() # will capture the output from the LLM
    )

    return rag_chain

In [17]:
gemini_chain = create_qa_chain(retriever= vec_gemini.as_retriever(),
                               llm= ChatGoogleGenerativeAI(model= "gemini-pro"))

In [18]:
print(
    textwrap.fill(
        gemini_chain.invoke('What machine learning algorithms are discussed in this paper?'),
        width=90
    )
)

The paper discusses various machine learning algorithms including Decision Tree (DT),
Support Vector Machine (SVM), and many others. The DT algorithm is used for classification
and regression tasks, while the SVM algorithm is used for classification tasks. The paper
also highlights the future scope of ML algorithms and artificial intelligence in the
coming times and their roles in automation and holistic development. Thanks for asking!


In [19]:
print(
    textwrap.fill(
        gemini_chain.invoke('What is Genetic algorithm?'),
        width=90
    )
)

Genetic algorithm is mainly a probability-based optimization algorithm. Similar to
genetics from biology, here, the multiple solutions form a population. Each solution in
the population has a set of properties which can be mutated and altered. Genetic
algorithms have both advantages and drawbacks.  Thanks for asking!


# Make Chatbot-Like Interaction with `while` Loop

In [31]:
import sys

print("\033[1mAsk The PDF!\033[0m")
print('')
while True:
    
    print('\033[1mQuestion:\033[0m')

    query = input('')
    print(query)

    #To exit: use 'exit', 'quit', 'q', or Ctrl-D.",
    if query.lower() in ["exit", "quit", "q"]:
        print('Exiting')
        sys.exit()

    print('\033[1mResponse:\033[0m')
    response = textwrap.fill(gemini_chain.invoke(query), width=90)
    print(response)
    print('')

[1mAsk The PDF![0m

[1mQuestion:[0m
what is support vector machines?
[1mResponse:[0m
Support vector machines (SVMs) are a powerful machine learning algorithm used for
classification and regression tasks. They construct a hyperplane or set of hyperplanes in
a high-dimensional space to separate different classes of data points. The goal is to find
the hyperplane that best separates the data while maximizing the margin, which is the
distance between the hyperplane and the closest data points of different classes. SVMs are
known for their ability to handle complex and non-linear data, making them a popular
choice for a wide range of applications, including image classification, text
classification, and bioinformatics. Thanks for asking!

[1mQuestion:[0m
exit
Exiting


SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


# Summarization

Besides being to create Q&A systems, LLM is also frequently utilized for task summarization. We will define a chain with LCEL to perform summarization on the "Journal_Machine learning".

In [37]:
# load file
pdf_load= PyPDFLoader('Jurnal_Machine learning.pdf')
Jurnal_content = pdf_load.load()

In [38]:
# indentify promt to make summary
summary_template = """
    You are the great assistant in summarizing the following passage.

    Provide a summary of the following passage. 
    The summary should be general and no longer than five sentences.

    Passage:
    ```{text}```
    
    Summary:
"""

custom_summary_prompt = PromptTemplate.from_template(summary_template)

In [39]:
# identify a fuction that will return the chain for summarizer
def create_summary_chain(llm):
    summary_chain = {'text':RunnablePassthrough()} | custom_summary_prompt | llm | StrOutputParser()
    return summary_chain

In [40]:
# create chain
gemini_summarizer = create_summary_chain(llm= ChatGoogleGenerativeAI(model='gemini-pro'))

In [41]:
# running chain
print(
    textwrap.fill(
        gemini_summarizer.invoke(Jurnal_content)
    )
)

1. Machine learning algorithms are a new-age thriving technology,
which facilitates computers to read and interpret data automatically.
2. Five machine learning algorithms namely- K-Nearest Neighbor (K-NN),
Genetic Algorithm (GA), Support Vector Machine (SVM), Decision Tree
(DT), and Long Short-Term Memory (LSTM) algorithms in machine learning
are discussed in detail.  3. K-NN algorithm is an easy-to-use
algorithm that is tolerant and resistant to noise prevailing in the
data set used for training.  4. GA is a subset of a relatively much
larger domain of computation known as Evolutionary Computation.  5.
SVM algorithm is intended for regression and classification problems.
6. DT algorithm is mostly preferred for solving classification
problems but either way, it may be used in classifying as well as in
regressing cases.  7. LSTM algorithm is a special case recurrent
neural network (RNN) that is well equipped to handle long-term
dependencies by default.  8. LSTM network and the SVM algo

# Return Source of QnA

We will create a chain where, besides providing an answer, it will also offer the resources used to derive that answer. Thus, it is expected to verify the accuracy of the LLM's response.

In [20]:
from langchain_core.runnables import RunnableParallel

In [22]:
def create_qa_chain_with_source(retriever, llm):

    rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | custom_rag_prompt
    | llm
    | StrOutputParser()
    )

    rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
    ).assign(answer=rag_chain_from_docs)

    return rag_chain_with_source

In [23]:
qa_gemini_source = create_qa_chain_with_source(retriever = vec_gemini.as_retriever(),
                                               llm = ChatGoogleGenerativeAI(model = 'gemini-pro'))

In [24]:
qa_gemini_source.invoke('What is  Long Short-Term Memory (LSTM) algorithm?')

{'context': [Document(page_content='6. Long Short-Term Memory (LSTM) algorithm ........................................................................................................................................................ 14\n6.1. Structure of LSTM .................................................................................................................................................................................. 14', metadata={'page': 1, 'source': 'Jurnal_Machine learning.pdf'}),
  Document(page_content='6.4. Advantages of LSTM algorithm................................................................................................................................................................ 16\n6.5. Drawbacks of LSTM algorithm................................................................................................................................................................. 16', metadata={'page': 1, 'source': 'Jurnal_Machine learning.pdf'}),
  Docum

# Evaluating Summarization using ROUGE

Metrics that we can use to evaluate the output of LLM is ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE works by comparing the similarity of two texts. In this case, the text generated by GEMINI and the reference text serve as our standards for assessing the quality of LLM output."


In [25]:
# install package ROUGE
!pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [26]:
from rouge import Rouge

In [27]:
abstract_text = '''
Transformers have achieved superior performances
in many tasks in natural language processing and
computer vision, which also triggered great interest in the time series community. 
Among multiple advantages of Transformers, the ability to capture
long-range dependencies and interactions is especially attractive for time series modeling, leading
to exciting progress in various time series applications. 
In this paper, we systematically review Transformer schemes for time series modeling by
highlighting their strengths as well as limitations.
In particular, we examine the development of time
series Transformers in two perspectives. From the
perspective of network structure, we summarize the
adaptations and modifcations that have been made
to Transformers in order to accommodate the challenges in time series analysis. From the perspective
of applications, we categorize time series Transformers based on common tasks including forecasting, anomaly detection, and classifcation. Empirically, we perform robust analysis, model size analysis, and seasonal-trend decomposition analysis to
study how Transformers perform in time series. Finally, we discuss and suggest future directions to
provide useful research guidance.
'''

In [28]:
llm_generated_summary = '''
Transformers have shown significant advancements in various domains,
including natural language processing and computer vision. They are
now being applied to time series modeling due to their ability to
capture long-range dependencies. Various adaptations and modifications
have been made to Transformers to address challenges in time series
analysis. These modifications include positional encodings, attention
modules, and architecture-level innovations. Transformers have been
successfully applied to tasks such as forecasting, anomaly detection,
and classification in time series data. Future research opportunities
include exploring inductive biases, combining Transformers with graph
neural networks, developing pre-trained models for time series,
designing architecture-level variants, and utilizing neural
architecture search for optimal Transformer design.Machine learning algorithms are a new-age thriving technology,
which facilitates computers to read and interpret data automatically.
2. Five machine learning algorithms namely- K-Nearest Neighbor (K-NN),
Genetic Algorithm (GA), Support Vector Machine (SVM), Decision Tree
(DT), and Long Short-Term Memory (LSTM) algorithms in machine learning
are discussed in detail.  3. K-NN algorithm is an easy-to-use
algorithm that is tolerant and resistant to noise prevailing in the
data set used for training.  4. GA is a subset of a relatively much
larger domain of computation known as Evolutionary Computation.  5.
SVM algorithm is intended for regression and classification problems.
6. DT algorithm is mostly preferred for solving classification
problems but either way, it may be used in classifying as well as in
regressing cases.  7. LSTM algorithm is a special case recurrent
neural network (RNN) that is well equipped to handle long-term
dependencies by default.  8. LSTM network and the SVM algorithm have
rendered one of the best results when it comes to predictive analytics
in real-time applications related to multidisciplinary spheres like
medicine, bank frauds, face detection, student performance prediction,
electricity usage prediction, etc. 9. The future scope highlights, the
expected demand and popularity of machine learning and artificial
intelligence in the future, which is anticipated to either support
humans in multiple fields or completely replace them and bring in
automation at a large scale and pace with the help of more advanced
and rigorous research.
'''

In [29]:
# Evaluation output with ROUGE
evaluator = Rouge()
evaluator.get_scores(llm_generated_summary, abstract_text)

[{'rouge-1': {'r': 0.4036697247706422,
   'p': 0.19213973799126638,
   'f': 0.2603550252160289},
  'rouge-2': {'r': 0.17647058823529413,
   'p': 0.08307692307692308,
   'f': 0.11297070694446892},
  'rouge-l': {'r': 0.3761467889908257,
   'p': 0.17903930131004367,
   'f': 0.24260354592608807}}]