#### Here I am using OpenAI API to summarize arxiv research papers using the url of that research paper and reading it from there and making a summary out of that and saving that in a doc file

In [1]:
import requests
from io import BytesIO
import PyPDF2  
import openai
from docx import Document

##### Method to download PDF file and extract its contents

In [2]:
def download_and_extract_text(pdf_url, output_filepath=None):
    response=requests.get(pdf_url,stream=True)
    response.raise_for_status()

    with BytesIO(response.content) as pdf_buffer:
        pdf_reader=PyPDF2.PdfReader(pdf_buffer)
        if pdf_reader.is_encrypted:
            raise ValueError("Research paper is password-protected. Decrypt or provide password.")
        text=""
        for page in pdf_reader.pages:
            try:
                text+=page.extract_text()
            except KeyError:# Handle potential KeyError from PyPDF2
                try:
                    text+=page.extract_text("plain")
                except Exception as e:
                    print(f"Error extracting text from page {page.number}: {e}")
        if output_filepath:
            with open(output_filepath,"w",encoding="utf-8") as f:
                f.write(text)
    return text

##### Method to use openai to generate summary, since I am using gpt-3-turbo-instruct ,its ontext size if 4096 tokens so if length of text in research paper is greater than 4096, it needs to be truncated otherwise it will give error

In [3]:
def summarize_with_openai(text, max_tokens=4096):
    # Ensure text length fits within API limits
    if len(text) > max_tokens:
        print(f"Text exceeds length limit ({max_tokens} tokens). Truncating.")
        text=text[:max_tokens]
    prompt="Summarize the following research paper:\n"+text
    response=openai.completions.create(
        model="gpt-3.5-turbo-instruct",
        prompt=prompt,
        max_tokens=600,  
        temperature=0.7,  
        n=1,
        stop=None,
    )
    summary=response.choices[0].text.strip()
    return summary

##### Running the above 2 functions using arxiv url, one can change value and get summary of any research paper and printing and saving that in a doc file.

In [5]:
# replace with your PDF URL
pdf_url="https://arxiv.org/pdf/1706.03762.pdf"
text=download_and_extract_text(pdf_url)
summary=summarize_with_openai(text)
print("Summary:")
print("\n")
print(summary)
print("\n\n")
print("************************************************************************************************")
print("Saving to a file")
doc=Document()
doc.add_paragraph(summary)
doc.save("research_paper_summarized.docx")

Text exceeds length limit (4096 tokens). Truncating.
Summary:
state-of-the-art sequence-to-sequence models
[9, 11, 6]. Most prominently, the Transformer [ 31] architecture combines a self-attention mechanism
with a convolutional architecture, achieving state-of-the-art results in machine translation and other
sequence tasks. This paper proposes a new model architecture, also based solely on attention
mechanisms, dispensing with recurrence and convolutions entirely. In contrast to the existing
approaches, our model, the Transformer, adopts a global attentional mechanism, where each output
position attends over all input positions. In the Transformer, each input or output position computes
an output as a weighted sum of the input positions, where the weight is determined by a compatibility
function of the two positions. The compatibility function is a dot product between the input and
output positions, divided by a factor that scales the dot product according to the dimensionality of the