## Detailed article explaination

The detailed code explanation for this article is available at the following link:

https://www.daniweb.com/programming/computer-science/tutorials/541151/extracting-information-from-research-papers-using-langchain-openai

For my other articles for Daniweb.com, please see this link:

https://www.daniweb.com/members/1235222/usmanmalik57

## Downloading and Importing Required Libraries

In [38]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken
!pip install rich

Collecting rich
  Obtaining dependency information for rich from https://files.pythonhosted.org/packages/be/be/1520178fa01eabe014b16e72a952b9f900631142ccd03dc36cf93e30c1ce/rich-13.7.0-py3-none-any.whl.metadata
  Downloading rich-13.7.0-py3-none-any.whl.metadata (18 kB)
Downloading rich-13.7.0-py3-none-any.whl (240 kB)
   ---------------------------------------- 0.0/240.6 kB ? eta -:--:--
   ------------- -------------------------- 81.9/240.6 kB 2.2 MB/s eta 0:00:01
   -------------------- ------------------- 122.9/240.6 kB 1.8 MB/s eta 0:00:01
   -------------------------------- ------- 194.6/240.6 kB 1.5 MB/s eta 0:00:01
   ---------------------------------------- 240.6/240.6 kB 1.6 MB/s eta 0:00:00
Installing collected packages: rich
Successfully installed rich-13.7.0


In [2]:
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

In [3]:
import os
os.environ["OPENAI_API_KEY"] = "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"

## Reading and Chunking Text Documents

In [6]:
pdf_reader = PdfReader(r'D:\Datasets\1907.11692.pdf')

In [7]:
from typing_extensions import Concatenate

pdf_text = ''
for i, page in enumerate(pdf_reader.pages):
    page_content = page.extract_text()
    if page_content:
        pdf_text += page_content

In [8]:
pdf_text

'arXiv:1907.11692v1  [cs.CL]  26 Jul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach\nYinhan Liu∗§Myle Ott∗§Naman Goyal∗§Jingfei Du∗§Mandar Joshi†\nDanqi Chen§Omer Levy§Mike Lewis§Luke Zettlemoyer†§Veselin Stoyanov§\n†Paul G. Allen School of Computer Science & Engineering,\nUniversity of Washington, Seattle, WA\n{mandar90,lsz }@cs.washington.edu\n§Facebook AI\n{yinhanliu,myleott,naman,jingfeidu,\ndanqi,omerlevy,mikelewis,lsz,ves }@fb.com\nAbstract\nLanguage model pretraining has led to sig-\nniﬁcant performance gains but careful com-\nparison between different approaches is chal-\nlenging. Training is computationally expen-\nsive, often done on private datasets of different\nsizes, and, as we will show, hyperparameter\nchoices have signiﬁcant impact on the ﬁnal re-\nsults. We present a replication study of BERT\npretraining ( Devlin et al. ,2019 ) that carefully\nmeasures the impact of many key hyperparam-\neters and training data size. We ﬁnd that BERT\nwas signiﬁcantly un

In [9]:
splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
text_chunks = splitter.split_text(pdf_text)
print(f"Total chunks {len(text_chunks)}")
print("============================")
print(text_chunks[0])

Total chunks 61
arXiv:1907.11692v1  [cs.CL]  26 Jul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu∗§Myle Ott∗§Naman Goyal∗§Jingfei Du∗§Mandar Joshi†
Danqi Chen§Omer Levy§Mike Lewis§Luke Zettlemoyer†§Veselin Stoyanov§
†Paul G. Allen School of Computer Science & Engineering,
University of Washington, Seattle, WA
{mandar90,lsz }@cs.washington.edu
§Facebook AI
{yinhanliu,myleott,naman,jingfeidu,
danqi,omerlevy,mikelewis,lsz,ves }@fb.com
Abstract
Language model pretraining has led to sig-
niﬁcant performance gains but careful com-
parison between different approaches is chal-
lenging. Training is computationally expen-
sive, often done on private datasets of different
sizes, and, as we will show, hyperparameter
choices have signiﬁcant impact on the ﬁnal re-
sults. We present a replication study of BERT
pretraining ( Devlin et al. ,2019 ) that carefully
measures the impact of many key hyperparam-
eters and training data size. We ﬁnd that BERT


In [13]:
embeddings = OpenAIEmbeddings()
embedding_vectors = FAISS.from_texts(text_chunks, embeddings)

## Extracting Information from Research Papers

In [14]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

qa_chain = load_qa_chain(OpenAI(), 
                         chain_type="stuff")

In [17]:
question = "Can you give me a list of datasets used in this paper?"

research_paper = embedding_vectors.similarity_search(question)

qa_chain.run(input_documents = research_paper, 
             question = question)

' The datasets used in this paper include Book Corpus and English Wikipedia, CC-NEWS, OpenWebText, and Stories.'

In [26]:
question = "Can you summarize the benchmark results from the paper?"

research_paper = embedding_vectors.similarity_search(question)

qa_chain.run(input_documents = research_paper, 
             question = question)

' The paper found that RoBERTa trained for 500K steps outperformed XLNet LARGE across most tasks on GLUE, SQuaD, and RACE. On GLUE, RoBERTa achieved state-of-the-art results on all 9 of the tasks in the first setting (single-task, dev) and 4 out of 9 tasks in the second setting (ensembles, test), and had the highest average score to date. On SQuAD and RACE, RoBERTa trained for 500K steps achieved significant gains in downstream task performance.'

In [28]:
def get_answer(questions):
    
    answers = []
    for question in questions:
        
        research_paper = embedding_vectors.similarity_search(question)

        answer = qa_chain.run(input_documents = research_paper, 
                 question = question)
        
        answers.append(answer)
    
    return answers

In [44]:
from rich import print

questions = ["Can you give me a list of datasets used in this paper?",
             "What are the evaluation metrics used in the paper?",
             "Can you summarize the benchmark results from the paper?"]

answers = get_answer(questions)

for i in range(len(questions)):
    print(f"[bold]Question: {i+1}: {questions[i]} [/bold]")
    print(f"[bold]Answer:[/bold] {answers[i]}")
    print("==================================================")

