# Langchain PDF finder and summarization

All the needed imports

```
pip install langchain
pip install python-dotenv
pip install pypdf
pip install openai
pip install chromadb
pip install tiktoken
pip install arxiv
pip install pymupdf
```

or you can use the requirements.txt

```
pip install -r requirements.txt
```

In [1]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain import OpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.retrievers import ArxivRetriever
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate

from dotenv import load_dotenv
import os
import re

### Load the API Key stored in the .env file

In [2]:
load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

if not OPENAI_API_KEY or OPENAI_API_KEY is None:
    print('Error: OPENAI_API_KEY is not set in .env file')


### Setup LLM and Initial PDF

In [3]:
pdfpath = './text_classification_with_llm.pdf'

In [4]:
pdf_loader = PyPDFLoader(pdfpath)
doc = pdf_loader.load()
documents = [doc]
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

### Read pdf document into chromaDB

In [5]:
def split_docs(docs):
    return RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50).split_documents(docs)

In [6]:
docstore = Chroma.from_documents(split_docs(doc), embedding=OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY))

### Get relevant sources from PDF

In [24]:
src_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docstore.as_retriever())

In [25]:
def extract_arxiv_id(text):
    matches = re.findall(r"arXiv:(\d{1,4}\.\d{1,5}(?:v\d+)?)", text)
    return matches[0] if matches else ''

In [26]:
sources = src_chain.run(
    """give me a list of the most relevant sources for this document.
       the list should include the arxiv id (e.g 1605.08386)"""
    )
sources = sources.split(',')
arxiv_list_1 = list(map(extract_arxiv_id, sources))
arxiv_list = list(filter(lambda x: x.strip() != '', arxiv_list_1))

In [27]:
sources

[' arXiv:2203.02155',
 ' cs/0506075',
 ' arXiv:2104.06599',
 ' arXiv:1901.02860',
 ' arXiv:2212.10509',
 ' arXiv:1810.04805',
 ' arXiv:2104.08762',
 ' arXiv:1810.04805']

In [29]:
arxiv_list

['2203.02155',
 '2104.06599',
 '1901.02860',
 '2212.10509',
 '1810.04805',
 '2104.08762',
 '1810.04805']

### Download arxiv resources

In [13]:
retriever = ArxivRetriever(load_max_docs=1)

In [14]:
for src in arxiv_list:
    doc = retriever.get_relevant_documents(query=src)
    if doc == []:
        continue
    documents.append(doc)
    print(doc)

[Document(page_content='Training language models to follow instructions\nwith human feedback\nLong Ouyang∗\nJeff Wu∗\nXu Jiang∗\nDiogo Almeida∗\nCarroll L. Wainwright∗\nPamela Mishkin∗\nChong Zhang\nSandhini Agarwal\nKatarina Slama\nAlex Ray\nJohn Schulman\nJacob Hilton\nFraser Kelton\nLuke Miller\nMaddie Simens\nAmanda Askell†\nPeter Welinder\nPaul Christiano∗†\nJan Leike∗\nRyan Lowe∗\nOpenAI\nAbstract\nMaking language models bigger does not inherently make them better at following\na user’s intent. For example, large language models can generate outputs that\nare untruthful, toxic, or simply not helpful to the user. In other words, these\nmodels are not aligned with their users. In this paper, we show an avenue for\naligning language models with user intent on a wide range of tasks by ﬁne-tuning\nwith human feedback. Starting with a set of labeler-written prompts and prompts\nsubmitted through the OpenAI API, we collect a dataset of labeler demonstrations\nof the desired model behavi

### Summarize retrieved documents with GPT

In [15]:
prompt_template = """Write a comprehensive summary of the following:


{text}
"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])

refine_template = (
    "Your job is to produce a final summary\n"
    "We have provided an existing summary up to a certain point: {existing_answer}\n"
    "We have the opportunity to refine the existing summary"
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{text}\n"
    "------------\n"
    "Given the new context, refine the original summary"
    "If the context isn't useful, return the original summary."
)
refine_prompt = PromptTemplate(
    input_variables=["existing_answer", "text"],
    template=refine_template,
)

In [16]:
documents = [item for sublist in documents for item in sublist]
documents

[Document(page_content='Text Classiﬁcation via Large Language Models\nXiaofei Sun\x07, Xiaoya Li|, Jiwei Li\x07;|, Fei Wu\x07\nShangwei GuoN, Tianwei Zhangª, Guoyin WangF\nAbstract\nDespite the remarkable success of large-\nscale Language Models (LLMs) such as\nGPT-3, their performances still signiﬁcantly\nunderperform ﬁne-tuned models in the task of\ntext classiﬁcation. This is due to (1) the lack\nof reasoning ability in addressing complex\nlinguistic phenomena (e.g., intensiﬁcation,\ncontrast, irony etc); (2) limited number of\ntokens allowed in in-context learning.\nIn this paper, we introduce Clue And\nReasoning Prompting (CARP). CARP adopts\na progressive reasoning strategy tailored to\naddressing the complex linguistic phenomena\ninvolved in text classiﬁcation: CARP ﬁrst\nprompts LLMs to ﬁnd superﬁcial clues\n(e.g., keywords, tones, semantic relations,\nreferences, etc), based on which a diagnostic\nreasoning process is induced for ﬁnal\ndecisions. To further address the limited

In [17]:
summary_chain = load_summarize_chain(llm, chain_type="refine", question_prompt=PROMPT, refine_prompt=refine_prompt)

In [18]:
res = summary_chain({"input_documents": documents}, return_only_outputs=True)

In [19]:
res

{'output_text': '\nThis paper introduces Clue And Reasoning Prompting (CARP), a framework for text classification via large language models. CARP adopts a progressive reasoning strategy tailored to addressing the complex linguistic phenomena involved in text classification. CARP first prompts LLMs to find superficial clues (e.g., keywords, tones, semantic relations, references, etc), based on which a diagnostic reasoning process is induced for final decisions. To further address the limited-token issue, CARP uses a fine-tuned model on the supervised dataset for kNN demonstration search in the in-context learning, allowing the model to take the advantage of both LLM’s generalization ability and the task-specific evidence provided by the full labeled dataset. CARP also uses a demonstration sampling strategy that includes random sampling and kNN sampling, which retrieves semantically similar examples. Additionally, CARP uses a progressive reasoning strategy that involves clue collection, 

In [21]:
res['output_text']


'\nThis paper introduces Clue And Reasoning Prompting (CARP), a framework for text classification via large language models. CARP adopts a progressive reasoning strategy tailored to addressing the complex linguistic phenomena involved in text classification. CARP first prompts LLMs to find superficial clues (e.g., keywords, tones, semantic relations, references, etc), based on which a diagnostic reasoning process is induced for final decisions. To further address the limited-token issue, CARP uses a fine-tuned model on the supervised dataset for kNN demonstration search in the in-context learning, allowing the model to take the advantage of both LLM’s generalization ability and the task-specific evidence provided by the full labeled dataset. CARP also uses a demonstration sampling strategy that includes random sampling and kNN sampling, which retrieves semantically similar examples. Additionally, CARP uses a progressive reasoning strategy that involves clue collection, reasoning, and d