Simple proof of concept that we can extract key information from PDF of a research paper. We demonstrate extraction of title, abstract, and first section. 

It appears getting all the sections is tricky with OCR. Also, figures and other visual components are not recoverable. Furthermore, math is not rendered properly or even at all depending on formatting. 

In [1]:
import PyPDF2
import pdfplumber

import re

import pytesseract
from pdf2image import convert_from_path

def extract_text_from_pdf(pdf_path):
    images = convert_from_path(pdf_path)
    text = ''
    for i in range(min(10,len(images))):
        text += pytesseract.image_to_string(images[i])
    return text

def extract_title(text):
    lines = text.split('\n')
    for line in lines:
        if line.strip():
            return line.strip()
    return None

def extract_abstract(text):
    abstract_start = 'Abstract\n\n'
    abstract_end = '\n\n1 Introduction'

    abstract_regex = rf"{abstract_start}(.+?){abstract_end}"
    match = re.search(abstract_regex, text, re.S)
    
    if match:
        abstract = match.group(1).strip()
        abstract = re.sub(r'\s+', ' ', abstract)  # Replace multiple consecutive spaces with a single space
        return abstract
    else:
        return None

def extract_section(text, section_number):
    lines = text.split('\n')
    section_start_pattern = f"^{section_number} "
    section_end_pattern = f"^{section_number + 1} "

    section_started = False
    section_content_lines = []

    for line in lines:
        if re.match(section_end_pattern, line):
            break

        if section_started:
            section_content_lines.append(line)

        if re.match(section_start_pattern, line):
            section_started = True

    section_content = " ".join(section_content_lines)
    section_content = re.sub(r'\s+', ' ', section_content)  # Replace multiple consecutive spaces with a single space

    return section_content if section_content else None

def read_text_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    return content

This just extracts text without consideration of formatting

In [2]:
pdf_path = "Validation_Papers/example.pdf"
text = extract_text_from_pdf(pdf_path)
print("Number of characters in text")
print(len(text))

48381


If you have a Latex file this works as well.

In [2]:
text_file_path = 'Validation_Papers/example.tex'
text = read_text_file(text_file_path)
print("Number of characters in text")
print(len(text))

63029


If you trust the format of the paper we can extract the title, abstract, and section text. You most likely need to manually add the abstract.

In [14]:
title = extract_title(text)
abstract = extract_abstract(text)

print("Title:")
print(title)
print("\nAbstract:")
print(abstract)

section_1_text = extract_section(text, 1)
print("\nSection 1:")
print(section_1_text)

Title:
\documentclass{article}

Abstract:
None

Section 1:
None


In the next part we experiment with the OpenAI API. You need to provide a key of your own.

In [8]:
import openai
import tiktoken

def num_tokens_from_string(string: str, encoding_name = "cl100k_base") -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

# Provide your key here
openai.api_key = ""

use_model="gpt-3.5-turbo"

First, let's calculate the number of tokens in our paper

In [5]:
num_tokens_from_string(text)

15247

The GPT-4 API has a context size of 8k tokens. Roughly, one could nearly fit an entire NeurIPS formatted paper and expect a reasonable summary of the paper. The relevant challenges are that we also need a system prompt that communicates the current-objective to GPT. We want something simple, but effective so we'll go with a single-pass:

1. GPT will take parts of the paper as input and take notes relevant to a system-prompt describing the task + abstract + rubric
2. GPT will then organize these notes
3. GPT will write a review using a system-prompt describing NeurIPS style reviewer guidelines

In [5]:
# Load note_instructions
print("Token lengths")
note_prompt = read_text_file("Note_Instruction.txt")
print(num_tokens_from_string(note_prompt))

# Load note organization instructions
organize_prompt = read_text_file("Organize_Notes.txt")
print(num_tokens_from_string(organize_prompt))

# Load note organization instructions
review_guidelines = read_text_file("NeurIPS_Guidelines.txt")
print(num_tokens_from_string(review_guidelines))

364
337
1638


In [6]:
# Create basic prompt scaffold
def notes(text, prompt, current_notes = ''):
    
    response = openai.ChatCompletion.create(
      model=use_model,
      messages=[
            {"role": "system", "content": "You are a helpful assistant. " + prompt},
            {"role": "user", "content": "\n Text: " + text + "\n Context: " + current_notes + "\n Notes: "},
        ]
    )

    summary = response['choices'][0]['message']['content']
    return summary

# Create basic prompt scaffold
def review(text, prompt):
    
    response = openai.ChatCompletion.create(
      model=use_model,
      messages=[
            {"role": "system", "content": "You are a helpful assistant. " + prompt},
            {"role": "user", "content": "\n Text: " + text + "\n 1. Summary and contributions: "},
        ]
    )

    summary = response['choices'][0]['message']['content']
    return summary

The code needs to be babysat becuase the API is inconsistent. If the code throws an error in the middle of aggregating notes you can comment out the "all_notes" variable and set the chunk variable to the current location.

In [11]:
chunk = 1
context_size = 12000
#context_size = 15000
exit = False
all_notes = ''
print("Length of paper: " + str(len(text)))
while not exit:
    print("Current Chunk: " + str(chunk))
    print("Progress: " + str((chunk-1)*context_size) + " out of " + str(len(text)))
    print("Length of notes: " + str(len(all_notes)))
    if context_size*chunk >= len(text):
        
        paper_text = text[context_size*(chunk-1):]
        #print(num_tokens_from_string(note_prompt + paper_text))
        paper_notes = notes(paper_text, note_prompt)
        #print(paper_notes)
        exit = True
    else:
        paper_text = text[context_size*(chunk-1):context_size*chunk]
        #print(num_tokens_from_string(note_prompt + paper_text))
        paper_notes = notes(paper_text, note_prompt)
        #print(paper_notes)
        
    chunk += 1
    all_notes += paper_notes

Length of paper: 63029
Current Chunk: 3
Progress: 24000 out of 63029
Length of notes: 4072
Current Chunk: 4
Progress: 36000 out of 63029
Length of notes: 5591
Current Chunk: 5
Progress: 48000 out of 63029
Length of notes: 5625
Current Chunk: 6
Progress: 60000 out of 63029
Length of notes: 7104


In [1]:
print("Organizing Notes")
organized_notes = notes(all_notes, organize_prompt)
print("Generating Review")
review_text = review(abstract + "\n" + organized_notes, review_guidelines)
print(review_text)

Organizing Notes


NameError: name 'notes' is not defined