# Week 3 - Loader & Splitter Test

References:


*   [Langchain PDF loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf)
*   [Langchain PDF splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)
*   [Langchain text splitter](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html#langchain.text_splitter.RecursiveCharacterTextSplitter)
*   [Camelot: PDF Table Extraction for Humans](https://camelot-py.readthedocs.io/en/master/)
*   [competition](https://tianchi.aliyun.com/competition/entrance/532126/information)
*   [competition sample answer](https://github.com/RonaldJEN/FinanceChatGLM/tree/main)



## 0. Installation and Setup

In [1]:
# hide output
%%capture output

! pip install pdfplumber
! pip install sentence-transformers
! pip install langchain
! pip install faiss-gpu
! pip install pypdf
! pip install layoutparser
! pip install pdfminer.six
! pip install unstructured
! pip install transformers
! pip install rapidocr-onnxruntime
! pip install pymupdf

In [2]:
import os
from google.colab import drive
# Access drive
drive.mount('/content/drive')
path = '/content/drive/MyDrive/Capstone/'


# companies
companies = os.listdir(os.path.join(path, 'Company Reports'))
for i, comp in enumerate(companies):
    print(i, ": ", comp)


# get reports
def get_reports(comp, year:int, rep_type:int = 1):
    """
    comp:       string or index
    year:       specific year or # recent year, 0 for all
    rep_type:   report type, 1 for annual report, 2 for sustainability report, 0 for both
    ret:        list of report pathes or a single report path
    """
    if type(comp) == str:
        if comp not in companies:
            print("Error: ", comp, " does not exist")
            return
    elif type(comp) == int:
        if comp not in range(len(companies)):
            print("Error: invalid index")
            return
        comp = companies[comp]
    else:
        print("Error: invalid company")
        return

    file_path = os.path.join(path, 'Company Reports', comp)
    files = os.listdir(file_path)
    files.sort(reverse=True)

    years = range(2013,2023)
    if year in range(11):
        if year:
            years = years[-year:]
    else:
        years = [year]

    if rep_type == 0:
        reps = ["", "_sus"]
    elif rep_type == 1:
        reps = [""]
    elif rep_type == 2:
        reps = ["_sus"]
    else:
        print("Error: invalid report type")
        return

    ret = []
    for year in years:
        for rep in reps:
            file = comp + '_' + str(year) + rep + '.pdf'
            if file in files:
                ret.append(file)

    ret_p = [os.path.join(file_path, file) for file in ret]
    if len(ret_p) == 1:
        return ret_p[0]
    else:
        return ret_p

Mounted at /content/drive
0 :  ExxonMobil
1 :  Shell plc
2 :  BP PLC
3 :  Saudi Aramco
4 :  Chevron
5 :  TotalEnergies
6 :  Valero Energy
7 :  Marathon Petroleum Corporation
8 :  Sinopec
9 :  PetroChina


In [3]:
file = get_reports(0, 2022)
file

'/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2022.pdf'

## 1. Load Data
In Langchiain, we use document_loaders to load our data. We can simply import langchain.document_loaders and specify the data type.
1. folder: DirectoryLoader
2. Azure: AzureBlobStorageContainerLoader
3. CSV file: CSVLoader
4. Google Drive: GoogleDriveLoader
5. Website: UnstructuredHTMLLoader
6. PDF: PyPDFLoader
7. Youtube: YoutubeLoader

For more data loader refer to the following link:
https://python.langchain.com/docs/modules/data_connection/document_loaders.html

In [None]:
loaders = {}

### 1.1 Pypdf

In [None]:
from langchain.document_loaders import PyPDFLoader

loader_pypdf = PyPDFLoader(file)
loaders['PyPDFLoader'] = loader_pypdf

### 1.2 Unstructured File Loader

In [None]:
from langchain.document_loaders import UnstructuredFileLoader

loader_unf = UnstructuredFileLoader(file)
loaders['Unstructured_file'] = loader_unf

### 1.3 Unstructured PDF Loader

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader

loader_unp = UnstructuredPDFLoader(file)
loaders['Unstructured_pdf'] = loader_unp

### 1.4 Unstructured PDF Loader version2

In [None]:
from langchain.document_loaders import UnstructuredFileLoader

loader_unp2 = UnstructuredFileLoader(file, mode="elements")
loaders['Unstructured_pdf2'] = loader_unp2

### 1.5 PyPDFium2

In [None]:
from langchain.document_loaders import PyPDFium2Loader

loader_ium = PyPDFium2Loader(file)
loaders['PyPDFium2'] = loader_ium

### 1.6 PDFMiner

In [None]:
from langchain.document_loaders import PDFMinerLoader

loader_min = PDFMinerLoader(file)
loaders['PDFMiner'] = loader_min

In [None]:
# from pdfminer.high_level import extract_text
# from langchain.document_loaders import TextLoader

# text = extract_text(file)

# # write to a txt file
# file_txt = file[:-4] + '.txt'
# # with open(file_txt, 'w') as f:
# #     f.write(text)


# loader_txt = TextLoader(file_txt)
# loader_txt.load()

### 1.7 PyMuPDF

In [None]:
from langchain.document_loaders import PyMuPDFLoader

loader_mu = PyMuPDFLoader(file)
loaders['PyMuPDF'] = loader_mu

### 1.8 MathPix



*   Failed, api key required



In [None]:
from langchain.document_loaders import MathpixPDFLoader

# loader_mathpix = MathpixPDFLoader(file)
# loaders.append(loader_mathpix)

## 2. Split the data
Once we loaded documents, we need to transform them to better suit our application. The simplest example is to split a long document into smaller chunks that can fit into our model's context window. The most common Splitter in LangChain includes:

1. RecursiveCharacterTextSplitter()
2. CharacterTextSplitter()

The paramether of above functions:
 - length_function: how the length of chunks is calculated. Defaults to just counting number of characters, but it's pretty common to pass a token counter here.
 - chunk_size: the maximum size of your chunks (as measured by the length function).
 - chunk_overlap: the maximum overlap between chunks. It can be nice to have some overlap to maintain some continuity between chunks (e.g. do a sliding window).
 - add_start_index: whether to include the starting position of each chunk within the original document in the metadata.

In [None]:
text_spliters = {}

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# original spliter
text_splitter0 = RecursiveCharacterTextSplitter(
    #separators = ["\n\n", "\n", " ", ""],
    chunk_size = 500,
    chunk_overlap = 100
)
text_spliters['rec_500_100_def'] = text_splitter0


text_splitter1 = RecursiveCharacterTextSplitter(
    separators = ["\n\n", "\n", " ", "", "."],
    chunk_size = 500,
    chunk_overlap = 100
)
text_spliters['rec_500_100_sep1'] = text_splitter1


text_splitter2 = RecursiveCharacterTextSplitter(
    separators = ["\n\n", " ", "", "."],
    chunk_size = 500,
    chunk_overlap = 100
)
text_spliters['rec_500_100_sep2'] = text_splitter2

## 3. Vectorstore

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

embeddings = HuggingFaceEmbeddings()

data = {}
vss = {}

In [None]:
import time
def transform(comprehensive = True):
    print('-'*80)
    print('Performing data loading, text splitting, and vectorstore transforming')
    print(len(loaders), 'loaders,', len(text_spliters), 'splitters')
    print('-'*80)
    for name_l, loader in loaders.items():
        for name_s, ts in text_spliters.items():
            name = name_l + ' + ' + name_s
            print('|', name, '|')

            s1 = time.time()
            if name_l == 'PyPDFLoader':
                data_tmp = loader.load_and_split(ts)
            else:
                data_tmp = loader.load()
                data_tmp = ts.split_documents(data_tmp)

            t1 = round(time.time() - s1, 2)

            s2 = time.time()
            vs_faiss_tmp = FAISS.from_documents(data_tmp, embeddings)
            t2 = round(time.time() - s2, 2)

            print("         loading & splitting time: ", t1, 's')
            print("         transformation time:      ", t2, 's')
            data[name] = data_tmp
            vss[name] = vs_faiss_tmp
            print('-'*80)
        if comprehensive == False:
            break

## 4. Testing

In [None]:
def print_doc(idx, q, a):
    s, vs = list(vss.items())[idx]
    print('-'*100)
    print('|', s, '|')
    print('-'*32)
    print(q)
    for i, d in enumerate(vs.similarity_search(q)):
        print('-'*100)
        if a in d.page_content:
            found = 'Found: \x1b[31mTrue\x1b[0m'
        else:
            found = 'Found: False'

        if 'page' in d.metadata:
            print('|', str(i+1)+'. Page', d.metadata['page']+1, '|', found, '|')
        else:
            print('|', str(i+1), '|', found, '|')
        #print('|', str(i+1)+'. Page |')
        print('-'*32)
        print(d.page_content)
    print('-'*100)

### 4.1 Splitter test

In [None]:
transform(False)

--------------------------------------------------------------------------------
Performing data loading, text splitting, and vectorstore transforming
8 loaders, 3 splitters
--------------------------------------------------------------------------------
| PyPDFLoader + rec_500_100_def |
         loading & splitting time:  287.09 s
         transformation time:       10.45 s
--------------------------------------------------------------------------------
| PyPDFLoader + rec_500_100_sep1 |
         loading & splitting time:  284.64 s
         transformation time:       1.16 s
--------------------------------------------------------------------------------
| PyPDFLoader + rec_500_100_sep2 |
         loading & splitting time:  289.56 s
         transformation time:       1.38 s
--------------------------------------------------------------------------------


In [None]:
q = 'What is ExxonMobil’s worldwide environmental expenditures in 2022?'

In [None]:
print_doc(0, q)

----------------------------------------------------------------------------------------------------
| PyPDFLoader + rec_500_100_def |
-----------------------------------
What is ExxonMobil’s worldwide environmental expenditures in 2022?
----------------------------------------------------------------------------------------------------
| 1. Page 13 |
--------------
reduce air, water, and waste emissions, and expenditures for asset retirement obligations. Using definitions and guidelines established 
by the American Petroleum Institute, ExxonMobil' s 2022 worldwide environmental expenditures for all such preventative and 
remediation steps, including ExxonMobil's share of equity company expenditures, were $5.7 billion, of which $3.8 billion were
----------------------------------------------------------------------------------------------------
| 2. Page 11 |
--------------
* Not included with the 2022 Annual Report to Shareholders but available on the Investor section of our website a

In [None]:
print_doc(1, q)

----------------------------------------------------------------------------------------------------
| PyPDFLoader + rec_500_100_sep1 |
-----------------------------------
What is ExxonMobil’s worldwide environmental expenditures in 2022?
----------------------------------------------------------------------------------------------------
| 1. Page 13 |
--------------
reduce air, water, and waste emissions, and expenditures for asset retirement obligations. Using definitions and guidelines established 
by the American Petroleum Institute, ExxonMobil' s 2022 worldwide environmental expenditures for all such preventative and 
remediation steps, including ExxonMobil's share of equity company expenditures, were $5.7 billion, of which $3.8 billion were
----------------------------------------------------------------------------------------------------
| 2. Page 11 |
--------------
* Not included with the 2022 Annual Report to Shareholders but available on the Investor section of our website 

In [None]:
print_doc(2, q)

----------------------------------------------------------------------------------------------------
| PyPDFLoader + rec_500_100_sep2 |
-----------------------------------
What is ExxonMobil’s worldwide environmental expenditures in 2022?
----------------------------------------------------------------------------------------------------
| 1. Page 13 |
--------------
and 
reduce air, water, and waste emissions, and expenditures for asset retirement obligations. Using definitions and guidelines established 
by the American Petroleum Institute, ExxonMobil' s 2022 worldwide environmental expenditures for all such preventative and 
remediation steps, including ExxonMobil's share of equity company expenditures, were $5.7 billion, of which $3.8 billion were 
included in expenses with the remainder in capital expenditures. As the Corporation progresses its
----------------------------------------------------------------------------------------------------
| 2. Page 11 |
--------------
* Not i

#### Conclusion
A good setting of separators are found. However, chunk overlap needs to be greater in actual application; chunk size needs to be tuned according to the max input size of model.

### 4.2 Loader test

In [None]:
text_spliters = {}

text_splitter = RecursiveCharacterTextSplitter(
    separators = ["\n\n", " ", "", "."],
    chunk_size = 500,
    chunk_overlap = 100
)
text_spliters['rec_500_100'] = text_splitter

vss = {}
data = {}

transform()

--------------------------------------------------------------------------------
Performing data loading, text splitting, and vectorstore transforming
7 loaders, 1 splitters
--------------------------------------------------------------------------------
| PyPDFLoader + rec_500_100 |
         loading & splitting time:  300.86 s
         transformation time:       1.45 s
--------------------------------------------------------------------------------
| Unstructured_file + rec_500_100 |
         loading & splitting time:  499.64 s
         transformation time:       1.33 s
--------------------------------------------------------------------------------
| Unstructured_pdf + rec_500_100 |
         loading & splitting time:  497.5 s
         transformation time:       1.33 s
--------------------------------------------------------------------------------
| Unstructured_pdf2 + rec_500_100 |
         loading & splitting time:  496.66 s
         transformation time:       1.96 s
--------------

In [None]:
text_spliters = {}

text_splitter = RecursiveCharacterTextSplitter(
    separators = ["\n\n", " ", "", "."],
    chunk_size = 500,
    chunk_overlap = 100
)
text_spliters['rec_500_100'] = text_splitter

vss = {}

transform()

--------------------------------------------------------------------------------
Performing data loading, text splitting, and vectorstore transforming
7 loaders, 1 splitters
--------------------------------------------------------------------------------
| PyPDFLoader + rec_500_100 |
         loading & splitting time:  301.05 s
         transformation time:       1.41 s
--------------------------------------------------------------------------------
| Unstructured_file + rec_500_100 |
         loading & splitting time:  512.55 s
         transformation time:       1.3 s
--------------------------------------------------------------------------------
| Unstructured_pdf + rec_500_100 |
         loading & splitting time:  510.46 s
         transformation time:       1.34 s
--------------------------------------------------------------------------------
| Unstructured_pdf2 + rec_500_100 |
         loading & splitting time:  509.28 s
         transformation time:       1.65 s
--------------

#### 4.2.1 data query

In [None]:
# Answer at page 14
q = 'What is ExxonMobil’s worldwide environmental expenditures in 2022?'
a = '5.7'

In [None]:
print_doc(0, q, a)

----------------------------------------------------------------------------------------------------
| PyPDFLoader + rec_500_100 |
--------------------------------
What is ExxonMobil’s worldwide environmental expenditures in 2022?
----------------------------------------------------------------------------------------------------
| 1. Page 14 | Found: [31mTrue[0m |
--------------------------------
and 
reduce air, water, and waste emissions, and expenditures for asset retirement obligations. Using definitions and guidelines established 
by the American Petroleum Institute, ExxonMobil' s 2022 worldwide environmental expenditures for all such preventative and 
remediation steps, including ExxonMobil's share of equity company expenditures, were $5.7 billion, of which $3.8 billion were 
included in expenses with the remainder in capital expenditures. As the Corporation progresses its
----------------------------------------------------------------------------------------------------
| 2.

In [None]:
print_doc(1, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_file + rec_500_100 |
--------------------------------
What is ExxonMobil’s worldwide environmental expenditures in 2022?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
16. Source: ExxonMobil analysis of EPA Facility Level Information on Greenhouse Gases Tool, 2019 data as of Feb. 15, 2022. 17. Statements of potential future earnings and cash ﬂow assume $60/bbl Brent crude prices and $3/mmbtu Henry Hub gas prices, adjusted for inﬂation from 2022; Energy, Chemical, and Specialty Product margins at historical averages for the 10- year period from 2010-2019; and before tax Corporate & Financing expenses between $2.3 and $2.5 billion annually. 2019 baseline
----------------------------------------------------------------------------------------------------
| 2 | Found: 

In [None]:
print_doc(2, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_pdf + rec_500_100 |
--------------------------------
What is ExxonMobil’s worldwide environmental expenditures in 2022?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
16. Source: ExxonMobil analysis of EPA Facility Level Information on Greenhouse Gases Tool, 2019 data as of Feb. 15, 2022. 17. Statements of potential future earnings and cash ﬂow assume $60/bbl Brent crude prices and $3/mmbtu Henry Hub gas prices, adjusted for inﬂation from 2022; Energy, Chemical, and Specialty Product margins at historical averages for the 10- year period from 2010-2019; and before tax Corporate & Financing expenses between $2.3 and $2.5 billion annually. 2019 baseline
----------------------------------------------------------------------------------------------------
| 2 | Found: [

In [None]:
print_doc(3, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_pdf2 + rec_500_100 |
--------------------------------
What is ExxonMobil’s worldwide environmental expenditures in 2022?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
16. Source: ExxonMobil analysis of EPA Facility Level Information on Greenhouse Gases Tool, 2019 data as of Feb. 15, 2022. 17. Statements of potential future earnings and cash ﬂow assume $60/bbl Brent crude prices and $3/mmbtu Henry Hub gas prices, adjusted for inﬂation from 2022; Energy, Chemical, and Specialty Product margins at historical averages for the 10- year period from 2010-2019; and before tax Corporate & Financing expenses between $2.3 and $2.5 billion annually. 2019 baseline
----------------------------------------------------------------------------------------------------
| 2 | Found: F

In [None]:
print_doc(4, q, a)

----------------------------------------------------------------------------------------------------
| PyPDFium2 + rec_500_100 |
--------------------------------
What is ExxonMobil’s worldwide environmental expenditures in 2022?
----------------------------------------------------------------------------------------------------
| 1. Page 14 | Found: [31mTrue[0m |
--------------------------------
and 
reduce air, water, and waste emissions, and expenditures for asset retirement obligations. Using definitions and guidelines established 
by the American Petroleum Institute, ExxonMobil' s 2022 worldwide environmental expenditures for all such preventative and 
remediation steps, including ExxonMobil's share of equity company expenditures, were $5.7 billion, of which $3.8 billion were 
included in expenses with the remainder in capital expenditures. As the Corporation progresses its
----------------------------------------------------------------------------------------------------
| 

In [None]:
print_doc(5, q, a)

----------------------------------------------------------------------------------------------------
| PDFMiner + rec_500_100 |
--------------------------------
What is ExxonMobil’s worldwide environmental expenditures in 2022?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
16. Source: ExxonMobil analysis of EPA Facility Level Information on Greenhouse Gases Tool, 2019 data as of Feb. 15, 2022.
17. Statements of potential future earnings and cash ﬂow assume $60/bbl Brent crude prices and $3/mmbtu Henry Hub gas
prices, adjusted for inﬂation from 2022; Energy, Chemical, and Specialty Product margins at historical averages for the 10-
year period from 2010-2019; and before tax Corporate & Financing expenses between $2.3 and $2.5 billion annually. 2019
baseline
----------------------------------------------------------------------------------------------------
| 2 | Found: False |
--

In [None]:
print_doc(6, q, a)

----------------------------------------------------------------------------------------------------
| PyMuPDF + rec_500_100 |
--------------------------------
What is ExxonMobil’s worldwide environmental expenditures in 2022?
----------------------------------------------------------------------------------------------------
| 1. Page 14 | Found: [31mTrue[0m |
--------------------------------
and 
reduce air, water, and waste emissions, and expenditures for asset retirement obligations. Using definitions and guidelines established 
by the American Petroleum Institute, ExxonMobil' s 2022 worldwide environmental expenditures for all such preventative and 
remediation steps, including ExxonMobil's share of equity company expenditures, were $5.7 billion, of which $3.8 billion were 
included in expenses with the remainder in capital expenditures. As the Corporation progresses its
----------------------------------------------------------------------------------------------------
| 2. Pag



*   bad: unstructured2, PDFMiner



#### 4.2.2 table query

In [None]:
# Answer at page 142
q = 'What is ExxonMobil’s Future production cost in Europe?'
a = '1,815'

In [None]:
print_doc(0, q, a)

----------------------------------------------------------------------------------------------------
| PyPDFLoader + rec_500_100 |
--------------------------------
What is ExxonMobil’s Future production cost in Europe?
----------------------------------------------------------------------------------------------------
| 1. Page 9 | Found: False |
--------------------------------
EXXON MOBIL CORPORATION  |  2022 ANNUAL REPORT
Our winning proposition
Upstream Low Carbon Solutions Product Solutions
~500K
40-50 %oil-equivalent barrels o f expected 
growth by 2027 versus 202 3
reduction in U pstream
greenhouse gas intensity
by 203 0182X
1Bvolume o f high-value products
with di fferentiated per formance
by 2027 versus 201 9
pounds per year of advanced 
recyclin g capacity expected 
by 202 6>10%
~1Boverall return on the port folio
of investments from 2022-202719
cubic feet o
----------------------------------------------------------------------------------------------------
| 2. Page 13 | Fou

In [None]:
print_doc(1, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_file + rec_500_100 |
--------------------------------
What is ExxonMobil’s Future production cost in Europe?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
16. Source: ExxonMobil analysis of EPA Facility Level Information on Greenhouse Gases Tool, 2019 data as of Feb. 15, 2022. 17. Statements of potential future earnings and cash ﬂow assume $60/bbl Brent crude prices and $3/mmbtu Henry Hub gas prices, adjusted for inﬂation from 2022; Energy, Chemical, and Specialty Product margins at historical averages for the 10- year period from 2010-2019; and before tax Corporate & Financing expenses between $2.3 and $2.5 billion annually. 2019 baseline
----------------------------------------------------------------------------------------------------
| 2 | Found: False |
-----

In [None]:
print_doc(2, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_pdf + rec_500_100 |
--------------------------------
What is ExxonMobil’s Future production cost in Europe?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
16. Source: ExxonMobil analysis of EPA Facility Level Information on Greenhouse Gases Tool, 2019 data as of Feb. 15, 2022. 17. Statements of potential future earnings and cash ﬂow assume $60/bbl Brent crude prices and $3/mmbtu Henry Hub gas prices, adjusted for inﬂation from 2022; Energy, Chemical, and Specialty Product margins at historical averages for the 10- year period from 2010-2019; and before tax Corporate & Financing expenses between $2.3 and $2.5 billion annually. 2019 baseline
----------------------------------------------------------------------------------------------------
| 2 | Found: False |
------

In [None]:
print_doc(3, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_pdf2 + rec_500_100 |
--------------------------------
What is ExxonMobil’s Future production cost in Europe?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
16. Source: ExxonMobil analysis of EPA Facility Level Information on Greenhouse Gases Tool, 2019 data as of Feb. 15, 2022. 17. Statements of potential future earnings and cash ﬂow assume $60/bbl Brent crude prices and $3/mmbtu Henry Hub gas prices, adjusted for inﬂation from 2022; Energy, Chemical, and Specialty Product margins at historical averages for the 10- year period from 2010-2019; and before tax Corporate & Financing expenses between $2.3 and $2.5 billion annually. 2019 baseline
----------------------------------------------------------------------------------------------------
| 2 | Found: False |
-----

In [None]:
print_doc(4, q, a)

----------------------------------------------------------------------------------------------------
| PyPDFium2 + rec_500_100 |
--------------------------------
What is ExxonMobil’s Future production cost in Europe?
----------------------------------------------------------------------------------------------------
| 1. Page 9 | Found: False |
--------------------------------
EXXON MOBIL CORPORATION | 2022 ANNUAL REPORT
Our winning proposition
Upstream Low Carbon Solutions Product Solutions
~500K
40-50%
oil-equivalent barrels of expected 
growth by 2027 versus 2023
reduction in Upstream
greenhouse gas intensity
by 203018
2X
1B
volume of high-value products
with differentiated performance
by 2027 versus 2019
pounds per year of advanced 
recycling capacity expected 
by 2026
>10%
~1B
overall return on the portfolio
of investments from
----------------------------------------------------------------------------------------------------
| 2. Page 13 | Found: False |
---

In [None]:
print_doc(5, q, a)

----------------------------------------------------------------------------------------------------
| PDFMiner + rec_500_100 |
--------------------------------
What is ExxonMobil’s Future production cost in Europe?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
16. Source: ExxonMobil analysis of EPA Facility Level Information on Greenhouse Gases Tool, 2019 data as of Feb. 15, 2022.
17. Statements of potential future earnings and cash ﬂow assume $60/bbl Brent crude prices and $3/mmbtu Henry Hub gas
prices, adjusted for inﬂation from 2022; Energy, Chemical, and Specialty Product margins at historical averages for the 10-
year period from 2010-2019; and before tax Corporate & Financing expenses between $2.3 and $2.5 billion annually. 2019
baseline
----------------------------------------------------------------------------------------------------
| 2 | Found: False |
--------------

In [None]:
print_doc(6, q, a)

----------------------------------------------------------------------------------------------------
| PyMuPDF + rec_500_100 |
--------------------------------
What is ExxonMobil’s Future production cost in Europe?
----------------------------------------------------------------------------------------------------
| 1. Page 150 | Found: False |
--------------------------------
time of this report and we assume no duty to update these statements as of any future date. Unless 
otherwise specified, data shown is for 2022. Prior years’ data have been reclassified in certain cases to conform to the 2022 presentation basis. Unless 
otherwise stated, resources, production rates, and project capacities are gross. References to “emissions” refer to energy-related emissions.
Investor information
Shareholder services
1 3 9
Sign up to learn more about ExxonMobil
----------------------------------------------------------------------------------------------------
| 2. Page 13 | Found: False |
------

*look into page 142*

In [None]:
def print_doc1(idx, page):
    tmp = [d for d in list(data.items())[idx][1] if d.metadata['page'] == page]
    for i, d in enumerate(tmp):
        print('-'*100)
        print('|', str(i+1)+'. Page', d.metadata['page']+1, '|')
        print('-'*14)
        print(d.page_content)
    print('-'*100)

In [None]:
print_doc1(0, 142)

----------------------------------------------------------------------------------------------------


In [None]:
print_doc1(4, 142)

----------------------------------------------------------------------------------------------------


In [None]:
print_doc1(6, 142)

----------------------------------------------------------------------------------------------------


In [None]:
# Answer at page 142
q = 'How is ExxonMobil’s standadized measure prepared?'
a = 'average prices'

In [None]:
print_doc(0, q, a)

----------------------------------------------------------------------------------------------------
| PyPDFLoader + rec_500_100 |
--------------------------------
How is ExxonMobil’s standadized measure prepared?
----------------------------------------------------------------------------------------------------
| 1. Page 12 | Found: False |
--------------------------------
* Not included with the 2022 Annual Report to Shareholders but available on the Investor section of our website at www.exxonmobil.com*
**
----------------------------------------------------------------------------------------------------
| 2. Page 14 | Found: False |
--------------------------------
"Item IA. Risk Factors" and "Item 2. Properties" in this report. 
ExxonMobil maintains a website at exxonmobil.com. Our annual report on Form 10-K, quarterly reports on Form 10-Q, current 
reports on Form 8-K, and any amendments to those reports filed or furnished pursuant to Section 13(a) of the Securities Exchange Ac

In [None]:
print_doc(1, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_file + rec_500_100 |
--------------------------------
How is ExxonMobil’s standadized measure prepared?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
118

108

89

116

137

129

141

98

192 Fiscal years ended December 31

133

135

DEFINITIONS

Listed below are deﬁnitions of several of ExxonMobil’s key business and ﬁnancial performance measures and other terms. These deﬁnitions are provided to facilitate understanding of the terms and their calculation.
----------------------------------------------------------------------------------------------------
| 2 | Found: False |
--------------------------------
and individual and emission-reduction

participants,

including

ExxonMobil,

energy

in lower-emission

and technologies. markets and international

The Corpor

In [None]:
print_doc(2, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_pdf + rec_500_100 |
--------------------------------
How is ExxonMobil’s standadized measure prepared?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
118

108

89

116

137

129

141

98

192 Fiscal years ended December 31

133

135

DEFINITIONS

Listed below are deﬁnitions of several of ExxonMobil’s key business and ﬁnancial performance measures and other terms. These deﬁnitions are provided to facilitate understanding of the terms and their calculation.
----------------------------------------------------------------------------------------------------
| 2 | Found: False |
--------------------------------
and individual and emission-reduction

participants,

including

ExxonMobil,

energy

in lower-emission

and technologies. markets and international

The Corpora

In [None]:
print_doc(3, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_pdf2 + rec_500_100 |
--------------------------------
How is ExxonMobil’s standadized measure prepared?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
Listed below are deﬁnitions of several of ExxonMobil’s key business and ﬁnancial performance measures and other terms. These deﬁnitions are provided to facilitate understanding of the terms and their calculation.
----------------------------------------------------------------------------------------------------
| 2 | Found: False |
--------------------------------
regulations, and affect the production With of ExxonMobil. and minimize the These regulations. and to monitor and guidelines and were of which $3.8 billion plans, expenditures $8.2 billion,
-------------------------------------------------------------------

In [None]:
print_doc(4, q, a)

----------------------------------------------------------------------------------------------------
| PyPDFium2 + rec_500_100 |
--------------------------------
How is ExxonMobil’s standadized measure prepared?
----------------------------------------------------------------------------------------------------
| 1. Page 12 | Found: False |
--------------------------------
* Not included with the 2022 Annual Report to Shareholders but available on the Investor section of our website at www.exxonmobil.com
*
*
*
----------------------------------------------------------------------------------------------------
| 2. Page 14 | Found: False |
--------------------------------
expected to account for approximately 51 percent of the total. 
Information concerning the source and availability of raw materials used in the Corporation's business, the extent of seasonality in the 
business, the possibility of renegotiation of profits or termination of contracts at the election of governments,

In [None]:
print_doc(5, q, a)

----------------------------------------------------------------------------------------------------
| PDFMiner + rec_500_100 |
--------------------------------
How is ExxonMobil’s standadized measure prepared?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
66

310

2021

104

399

2022

195

327

98

133

192
Fiscal years ended December 31

DEFINITIONS

Listed below are deﬁnitions of several of ExxonMobil’s key business and ﬁnancial performance measures and other
terms. These deﬁnitions are provided to facilitate understanding of the terms and their calculation.
----------------------------------------------------------------------------------------------------
| 2 | Found: False |
--------------------------------
was 62 thousand, 

and are covered 

executive, 

plans and programs. 

at years ended 2022, 2021, and 2020, respectively. 

technical, 

and wage employees 

who wor

In [None]:
print_doc(6, q, a)

----------------------------------------------------------------------------------------------------
| PyMuPDF + rec_500_100 |
--------------------------------
How is ExxonMobil’s standadized measure prepared?
----------------------------------------------------------------------------------------------------
| 1. Page 12 | Found: False |
--------------------------------
* Not included with the 2022 Annual Report to Shareholders but available on the Investor section of our website at www.exxonmobil.com
*
*
*
----------------------------------------------------------------------------------------------------
| 2. Page 14 | Found: False |
--------------------------------
"Item IA. Risk Factors" and "Item 2. Properties" in this report. 
ExxonMobil maintains a website at exxonmobil.com. Our annual report on Form 10-K, quarterly reports on Form 10-Q, current 
reports on Form 8-K, and any amendments to those reports filed or furnished pursuant to Section 13(a) of the Securities Exchange Act 

In [None]:
# Answer at page 142
q = 'What is ExxonMobil’s Long-term debt in 2021?'
a = '43,428'

*None of the loader parsed the table or found the answer*

#### 4.2.3 image query

In [None]:
# Answer at page 147
q = 'What is ExxonMobil’s five-year cumulative total shareholder returns in 2021?'
a = '92'

In [None]:
print_doc(0, q, a)

----------------------------------------------------------------------------------------------------
| PyPDFLoader + rec_500_100 |
--------------------------------
What is ExxonMobil’s five-year cumulative total shareholder returns in 2021?
----------------------------------------------------------------------------------------------------
| 1. Page 147 | Found: [31mTrue[0m |
--------------------------------
2022
ExxonMobil 100 85 91 58 92 171
S&P 500 100 96 126 149 192 157
Industry Group 100 94 103 71 97 140
Fiscal years ended December 31
TEN-YEAR CUMULATIVE TOTAL SHAREHOLDER RETURNS4
$400
300200
1000(value of $100 invested at year-end 2012)
ExxonMobil
Industry GroupS&P 500
2012
ExxonMobil 100 113 118 97 66 195
S&P 500 100 151 171 199 310 327
Industry Group 100 108 116 129 98 192
Fiscal years ended December 312014
120
132
1182013
99
153
892015
114
208
1372017
104
262
1412019
104
399
1332021 2016
----------------------------------------------------------------------------------------


*   need to further increase chunk size


In [None]:
print_doc(1, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_file + rec_500_100 |
--------------------------------
What is ExxonMobil’s five-year cumulative total shareholder returns in 2021?
----------------------------------------------------------------------------------------------------
| 1 | Found: [31mTrue[0m |
--------------------------------
FIVE-YEAR CUMULATIVE TOTAL SHAREHOLDER RETURNS 4 (value of $100 invested at year-end 2017)

$250

200

ExxonMobil

150

S&P 500

100

Industry Group

50

0

2017

2018

2019

2020

2021

2022

ExxonMobil

100

85

91

58

92

171

S&P 500

100

96

126

149

192

157

Industry Group

100

94

103

71

97

140 Fiscal years ended December 31

TEN-YEAR CUMULATIVE TOTAL SHAREHOLDER RETURNS 4 (value of $100 invested at year-end 2012)

$400

300

S&P 500

ExxonMobil

200

Industry Group

100

0
------------------------------------------------------------------------------------------------

In [None]:
print_doc(2, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_pdf + rec_500_100 |
--------------------------------
What is ExxonMobil’s five-year cumulative total shareholder returns in 2021?
----------------------------------------------------------------------------------------------------
| 1 | Found: [31mTrue[0m |
--------------------------------
FIVE-YEAR CUMULATIVE TOTAL SHAREHOLDER RETURNS 4 (value of $100 invested at year-end 2017)

$250

200

ExxonMobil

150

S&P 500

100

Industry Group

50

0

2017

2018

2019

2020

2021

2022

ExxonMobil

100

85

91

58

92

171

S&P 500

100

96

126

149

192

157

Industry Group

100

94

103

71

97

140 Fiscal years ended December 31

TEN-YEAR CUMULATIVE TOTAL SHAREHOLDER RETURNS 4 (value of $100 invested at year-end 2012)

$400

300

S&P 500

ExxonMobil

200

Industry Group

100

0
-------------------------------------------------------------------------------------------------

In [None]:
print_doc(3, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_pdf2 + rec_500_100 |
--------------------------------
What is ExxonMobil’s five-year cumulative total shareholder returns in 2021?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
Not included with the 2022 Annual Report to Shareholders but available on the Investor section of our website at www.exxonmobil.com
----------------------------------------------------------------------------------------------------
| 2 | Found: False |
--------------------------------
Important shareholder information is available at exxonmobil.com/investors:
----------------------------------------------------------------------------------------------------
| 3 | Found: False |
--------------------------------
The annual total shareholder return (TSR) to ExxonMobil shareholders was 87.0 pe

In [None]:
print_doc(4, q, a)

----------------------------------------------------------------------------------------------------
| PyPDFium2 + rec_500_100 |
--------------------------------
What is ExxonMobil’s five-year cumulative total shareholder returns in 2021?
----------------------------------------------------------------------------------------------------
| 1. Page 147 | Found: [31mTrue[0m |
--------------------------------
500
2017 2018 2019 2020 2021 2022
ExxonMobil 100 85 91 58 92 171
S&P 500 100 96 126 149 192 157
Industry Group 100 94 103 71 97 140
Fiscal years ended December 31
TEN-YEAR CUMULATIVE TOTAL SHAREHOLDER RETURNS4
$400
300
200
100
0
(value of $100 invested at year-end 2012)
ExxonMobil
Industry Group
S&P 500
2012
ExxonMobil 100 113 118 97 66 195
S&P 500 100 151 171 199 310 327
Industry Group 100 108 116 129 98 192
Fiscal years ended December
----------------------------------------------------------------------------------------------------
| 2. Page 12 | Found: Fals

In [None]:
print_doc(5, q, a)

----------------------------------------------------------------------------------------------------
| PDFMiner + rec_500_100 |
--------------------------------
What is ExxonMobil’s five-year cumulative total shareholder returns in 2021?
----------------------------------------------------------------------------------------------------
| 1 | Found: [31mTrue[0m |
--------------------------------
FIVE-YEAR CUMULATIVE TOTAL SHAREHOLDER RETURNS 4
(value of $100 invested at year-end 2017)

$250

200

150

100

50

0

S&P 500

ExxonMobil

Industry Group

ExxonMobil

S&P 500

Industry Group

2017

100

100

100

2018

85

96

94

2019

91

126

103

2020

58

149

71

2021

92

192

2022

171

157

97

140
Fiscal years ended December 31

TEN-YEAR CUMULATIVE TOTAL SHAREHOLDER RETURNS 4
(value of $100 invested at year-end 2012)

$400

300

200

100

0

2012

2013

2014

2015

2016

2017

2018
----------------------------------------------------------------------------------------------------

In [None]:
print_doc(6, q, a)

----------------------------------------------------------------------------------------------------
| PyMuPDF + rec_500_100 |
--------------------------------
What is ExxonMobil’s five-year cumulative total shareholder returns in 2021?
----------------------------------------------------------------------------------------------------
| 1. Page 147 | Found: [31mTrue[0m |
--------------------------------
500
100
96
126
149
192
157
Industry Group
100
94
103
71
97
140
Fiscal years ended December 31
TEN-YEAR CUMULATIVE TOTAL SHAREHOLDER RETURNS4
$400
300
200
100
0
(value of $100 invested at year-end 2012)
ExxonMobil
Industry Group
S&P 500
2012
ExxonMobil
100
113
118
97
66
195
S&P 500
100
151
171
199
310
327
Industry Group
100
108
116
129
98
192
Fiscal years ended December 31
2014
120
132
118
2013
99
153
89
2015
114
208
137
2017
104
262
141
2019
104
399
133
2021
2016
2018
2020
2022
----------------------------------------------------------------------------------------------------
| 2. P

In [None]:
# Answer at page 2
q = 'What is ExxonMobil’s ROCE in 2021?'
a = '23B'

In [None]:
print_doc(0, q, a)

----------------------------------------------------------------------------------------------------
| PyPDFLoader + rec_500_100 |
--------------------------------
What is ExxonMobil’s ROCE in 2021?
----------------------------------------------------------------------------------------------------
| 1. Page 10 | Found: False |
--------------------------------
EXXON MOBIL CORPORATION  |  2022 ANNUAL REPORT
Our winning proposition
Upstream Low Carbon Solutions Product Solutions
~500K
40-50 %oil-equivalent barrels o f expected 
growth by 2027 versus 202 3
reduction in U pstream
greenhouse gas intensity
by 203 0182X
1Bvolume o f high-value products
with di fferentiated per formance
by 2027 versus 201 9
pounds per year of advanced 
recyclin g capacity expected 
by 202 6>10%
~1Boverall return on the port folio
of investments from 2022-202719
cubic feet o
----------------------------------------------------------------------------------------------------
| 2. Page 12 | Found: False |
-------

In [None]:
print_doc(1, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_file + rec_500_100 |
--------------------------------
What is ExxonMobil’s ROCE in 2021?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
16. Source: ExxonMobil analysis of EPA Facility Level Information on Greenhouse Gases Tool, 2019 data as of Feb. 15, 2022. 17. Statements of potential future earnings and cash ﬂow assume $60/bbl Brent crude prices and $3/mmbtu Henry Hub gas prices, adjusted for inﬂation from 2022; Energy, Chemical, and Specialty Product margins at historical averages for the 10- year period from 2010-2019; and before tax Corporate & Financing expenses between $2.3 and $2.5 billion annually. 2019 baseline
----------------------------------------------------------------------------------------------------
| 2 | Found: False |
-------------------------

In [None]:
print_doc(2, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_pdf + rec_500_100 |
--------------------------------
What is ExxonMobil’s ROCE in 2021?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
16. Source: ExxonMobil analysis of EPA Facility Level Information on Greenhouse Gases Tool, 2019 data as of Feb. 15, 2022. 17. Statements of potential future earnings and cash ﬂow assume $60/bbl Brent crude prices and $3/mmbtu Henry Hub gas prices, adjusted for inﬂation from 2022; Energy, Chemical, and Specialty Product margins at historical averages for the 10- year period from 2010-2019; and before tax Corporate & Financing expenses between $2.3 and $2.5 billion annually. 2019 baseline
----------------------------------------------------------------------------------------------------
| 2 | Found: False |
--------------------------

In [None]:
print_doc(3, q, a)

----------------------------------------------------------------------------------------------------
| Unstructured_pdf2 + rec_500_100 |
--------------------------------
What is ExxonMobil’s ROCE in 2021?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
16. Source: ExxonMobil analysis of EPA Facility Level Information on Greenhouse Gases Tool, 2019 data as of Feb. 15, 2022. 17. Statements of potential future earnings and cash ﬂow assume $60/bbl Brent crude prices and $3/mmbtu Henry Hub gas prices, adjusted for inﬂation from 2022; Energy, Chemical, and Specialty Product margins at historical averages for the 10- year period from 2010-2019; and before tax Corporate & Financing expenses between $2.3 and $2.5 billion annually. 2019 baseline
----------------------------------------------------------------------------------------------------
| 2 | Found: False |
-------------------------

In [None]:
print_doc(4, q, a)

----------------------------------------------------------------------------------------------------
| PyPDFium2 + rec_500_100 |
--------------------------------
What is ExxonMobil’s ROCE in 2021?
----------------------------------------------------------------------------------------------------
| 1. Page 10 | Found: False |
--------------------------------
EXXON MOBIL CORPORATION | 2022 ANNUAL REPORT
Our winning proposition
Upstream Low Carbon Solutions Product Solutions
~500K
40-50%
oil-equivalent barrels of expected 
growth by 2027 versus 2023
reduction in Upstream
greenhouse gas intensity
by 203018
2X
1B
volume of high-value products
with differentiated performance
by 2027 versus 2019
pounds per year of advanced 
recycling capacity expected 
by 2026
>10%
~1B
overall return on the portfolio
of investments from
----------------------------------------------------------------------------------------------------
| 2. Page 12 | Found: False |
----------------------



*   good



In [None]:
print_doc(5, q, a)

----------------------------------------------------------------------------------------------------
| PDFMiner + rec_500_100 |
--------------------------------
What is ExxonMobil’s ROCE in 2021?
----------------------------------------------------------------------------------------------------
| 1 | Found: False |
--------------------------------
16. Source: ExxonMobil analysis of EPA Facility Level Information on Greenhouse Gases Tool, 2019 data as of Feb. 15, 2022.
17. Statements of potential future earnings and cash ﬂow assume $60/bbl Brent crude prices and $3/mmbtu Henry Hub gas
prices, adjusted for inﬂation from 2022; Energy, Chemical, and Specialty Product margins at historical averages for the 10-
year period from 2010-2019; and before tax Corporate & Financing expenses between $2.3 and $2.5 billion annually. 2019
baseline
----------------------------------------------------------------------------------------------------
| 2 | Found: False |
--------------------------------
2

In [None]:
print_doc(6, q, a)

----------------------------------------------------------------------------------------------------
| PyMuPDF + rec_500_100 |
--------------------------------
What is ExxonMobil’s ROCE in 2021?
----------------------------------------------------------------------------------------------------
| 1. Page 12 | Found: False |
--------------------------------
* Not included with the 2022 Annual Report to Shareholders but available on the Investor section of our website at www.exxonmobil.com
*
*
*
----------------------------------------------------------------------------------------------------
| 2. Page 151 | Found: False |
--------------------------------
time of this report and we assume no duty to update these statements as of any future date. Unless 
otherwise specified, data shown is for 2022. Prior years’ data have been reclassified in certain cases to conform to the 2022 presentation basis. Unless 
otherwise stated, resources, production rates, and project capacities are gross. R



*   good: Pypdf, PyPDFium2



#### Conclusion

PyPDFium2 performs well on 2 out of 3 tasks, especially on data extraction from image.

*   data query: all except unstructured2 and PDFMiner
*   table query: none
*   image query: PyPDF and PyPDFium2

No matter which one we choose, PyPDF or PyPDFium2, chunk overlap needs to be further increased to capture image more accurately.