In [6]:
import os

from langchain.document_loaders import PyPDFLoader
from langchain.chains.summarize import load_summarize_chain
from langchain.llms import OpenAI

In [14]:
loader = PyPDFLoader('https://arxiv.org/pdf/2110.07686.pdf')
pages = loader.load()
print(len(pages))
pages

15


[Document(page_content='Making Document-Level Information Extraction\nRight for the Right Reasons\nLiyan Tang1, Dhruv Rajan1, Suyash Mohan2, Abhijeet Pradhan3, R. Nick Bryan1, Greg Durrett1\n1The University of Texas at Austin\n2University of Pennsylvania\n3Galileo CDS Inc.\nlytang@utexas.edu, dhruv.rajan@utexas.edu\nSuyash.Mohan@pennmedicine.upenn.edu, ap@galileocds.com\nnick.bryan@austin.utexas.edu, gdurrett@cs.utexas.edu\nAbstract\nDocument-level models for information ex-\ntraction tasks like slot-ﬁlling are ﬂexible: they\ncan be applied to settings where information\nis not necessarily localized in a single sen-\ntence. For example, key features of a diag-\nnosis in a radiology report may not be ex-\nplicitly stated in one place, but nevertheless\ncan be inferred from parts of the report’s text.\nHowever, these models can easily learn spuri-\nous correlations between labels and irrelevant\ninformation. This work studies how to en-\nsure that these models make correct inferences\nfr

In [34]:
for page in pages:
    page.page_content = page.page_content.replace('-\n', '').replace('\n', ' ')

In [35]:
pages

[Document(page_content='Making Document-Level Information Extraction Right for the Right Reasons Liyan Tang1, Dhruv Rajan1, Suyash Mohan2, Abhijeet Pradhan3, R. Nick Bryan1, Greg Durrett1 1The University of Texas at Austin 2University of Pennsylvania 3Galileo CDS Inc. lytang@utexas.edu, dhruv.rajan@utexas.edu Suyash.Mohan@pennmedicine.upenn.edu, ap@galileocds.com nick.bryan@austin.utexas.edu, gdurrett@cs.utexas.edu Abstract Document-level models for information extraction tasks like slot-ﬁlling are ﬂexible: they can be applied to settings where information is not necessarily localized in a single sentence. For example, key features of a diagnosis in a radiology report may not be explicitly stated in one place, but nevertheless can be inferred from parts of the report’s text. However, these models can easily learn spurious correlations between labels and irrelevant information. This work studies how to ensure that these models make correct inferences from complex text and make those inf

In [4]:
print(pages[0].page_content)

Making Document-Level Information Extraction
Right for the Right Reasons
Liyan Tang1, Dhruv Rajan1, Suyash Mohan2, Abhijeet Pradhan3, R. Nick Bryan1, Greg Durrett1
1The University of Texas at Austin
2University of Pennsylvania
3Galileo CDS Inc.
lytang@utexas.edu, dhruv.rajan@utexas.edu
Suyash.Mohan@pennmedicine.upenn.edu, ap@galileocds.com
nick.bryan@austin.utexas.edu, gdurrett@cs.utexas.edu
Abstract
Document-level models for information ex-
traction tasks like slot-ﬁlling are ﬂexible: they
can be applied to settings where information
is not necessarily localized in a single sen-
tence. For example, key features of a diag-
nosis in a radiology report may not be ex-
plicitly stated in one place, but nevertheless
can be inferred from parts of the report’s text.
However, these models can easily learn spuri-
ous correlations between labels and irrelevant
information. This work studies how to en-
sure that these models make correct inferences
from complex text and make those inferences
in a

In [36]:
# from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap = 100,
)
# text_splitter = CharacterTextSplitter(        
#     separator = '\n\n',
#     chunk_size = 15,  # minimum?
#     chunk_overlap = 10,
#     length_function = len,
# )
split_text = text_splitter.split_documents(documents=pages)
print(len(split_text))
split_text

41


[Document(page_content='Making Document-Level Information Extraction Right for the Right Reasons Liyan Tang1, Dhruv Rajan1, Suyash Mohan2, Abhijeet Pradhan3, R. Nick Bryan1, Greg Durrett1 1The University of Texas at Austin 2University of Pennsylvania 3Galileo CDS Inc. lytang@utexas.edu, dhruv.rajan@utexas.edu Suyash.Mohan@pennmedicine.upenn.edu, ap@galileocds.com nick.bryan@austin.utexas.edu, gdurrett@cs.utexas.edu Abstract Document-level models for information extraction tasks like slot-ﬁlling are ﬂexible: they can be applied to settings where information is not necessarily localized in a single sentence. For example, key features of a diagnosis in a radiology report may not be explicitly stated in one place, but nevertheless can be inferred from parts of the report’s text. However, these models can easily learn spurious correlations between labels and irrelevant information. This work studies how to ensure that these models make correct inferences from complex text and make those inf

In [23]:
print(pages[0].page_content)

Making Document-Level Information Extraction
Right for the Right Reasons
Liyan Tang1, Dhruv Rajan1, Suyash Mohan2, Abhijeet Pradhan3, R. Nick Bryan1, Greg Durrett1
1The University of Texas at Austin
2University of Pennsylvania
3Galileo CDS Inc.
lytang@utexas.edu, dhruv.rajan@utexas.edu
Suyash.Mohan@pennmedicine.upenn.edu, ap@galileocds.com
nick.bryan@austin.utexas.edu, gdurrett@cs.utexas.edu
Abstract
Document-level models for information ex-
traction tasks like slot-ﬁlling are ﬂexible: they
can be applied to settings where information
is not necessarily localized in a single sen-
tence. For example, key features of a diag-
nosis in a radiology report may not be ex-
plicitly stated in one place, but nevertheless
can be inferred from parts of the report’s text.
However, these models can easily learn spuri-
ous correlations between labels and irrelevant
information. This work studies how to en-
sure that these models make correct inferences
from complex text and make those inferences
in a

In [50]:
from langchain.chains import AnalyzeDocumentChain
from langchain.callbacks import get_openai_callback

model = OpenAI(temperature=0, max_tokens=1000, openai_api_key=os.environ['OPENAI_TOKEN'])
summary_chain = load_summarize_chain(llm=model, chain_type='map_reduce')

with get_openai_callback() as cb:
    results = summary_chain.run(input_documents=split_text)

results

' This paper examines how to ensure document-level models for information extraction tasks make correct inferences from complex text. It evaluates the performance of the model on two datasets and discusses various interpretation methods, attention regularization, entropy maximization, human rationales, axiomatic attribution, multi-instance multi-label learning, and adversarial multi-lingual neural relation extraction. Results show that DeepLIFT and Input Gradient perform slightly better than the other two methods.'

In [51]:
cb

Tokens Used: 24471
	Prompt Tokens: 20411
	Completion Tokens: 4060
Successful Requests: 4
Total Cost (USD): $0.48941999999999997

In [52]:
print(results)

 This paper examines how to ensure document-level models for information extraction tasks make correct inferences from complex text. It evaluates the performance of the model on two datasets and discusses various interpretation methods, attention regularization, entropy maximization, human rationales, axiomatic attribution, multi-instance multi-label learning, and adversarial multi-lingual neural relation extraction. Results show that DeepLIFT and Input Gradient perform slightly better than the other two methods.


---