Define functions to extract text from pdf.

In [1]:
from typing import List
from PyPDF2 import PdfReader


def extract_pdf_text(filepath: str) -> List[str]:
    """
    Extracts text from each page of a PDF file using PyPDF2 and returns it as a list of strings.
    
    Parameters:
    filepath (str): The file path or URL of the PDF file to extract text from.
    
    Returns:
    List[str]: A list of strings containing the extracted text from each page of the PDF.
    """
    pdf_file = open(filepath, 'rb')
    pdf_reader = PdfReader(pdf_file)
    pages = len(pdf_reader.pages)

    text_list = []
    for page in range(pages):
        pdf_page = pdf_reader.pages[page]
        text = pdf_page.extract_text()
        text_list.append(text)
        
    pdf_file.close()
    return text_list


Text the function.

In [2]:
%%time
pdf_text_list = extract_pdf_text('2022.pdf')
print(pdf_text_list)


CPU times: total: 3.03 s
Wall time: 15.7 s


Print a page.

In [3]:
print(pdf_text_list[5])

Index
FORWARD-LOOKING STATEMENTS
In this Annual Report, the Company makes, and from time to time may otherwise make in its public filings, press releases and discussions by Company
management, forward-looking statements concerning the Company’s operations, performance and financial condition, as well as its strategic objectives. Some of
these forward-looking statements relate to future events and expectations and can be identified by the use of forward-looking words such as “believes”, “expects”,
“may”, “will”, “should”, “seeks”, “approximately”, “intends”, “plans”, “estimates”, or “anticipates” or the negative of those words or other comparable
terminology. Such forward-looking statements speak only as of the time they are made and are subject to various risks and uncertainties and the Company claims
the protection afforded by the safe harbor for forward-looking statements contained in the Private Securities Litigation Reform Act of 1995. Actual results could
differ materially from th

Define function to create question-answer dataframe.

In [4]:
from typing import List
import pandas as pd
import openai

In [5]:
openai.api_key = "ENTER YOUR API KEY HERE"

Create data frame.

In [6]:
df = pd.DataFrame(pdf_text_list)
df.columns = ['context']
df.shape

(307, 1)

The function takes in a single argument `context`, which is a string representing the context for which questions should be generated. It returns a string containing the question generated by the API.

The function uses `try` and `except` to catch any errors that might occur while interacting with the API. It sends a POST request to the OpenAI Completion API using the `openai.Completion.create()` method, passing in various parameters such as the GPT-3 engine to use, the prompt to use (which includes the context and a placeholder for the question), and settings for temperature, max tokens, and penalties.

If the request is successful, the function extracts the question text from the response dictionary and returns it. If there was an error, the function returns an empty string.

The function has type hints for both arguments and return value, and includes a docstring that describes what the function does, what arguments it takes, and what it returns.

In [7]:
def get_questions(context: str) -> str:
    """
    Given a text context, generates a list of questions using OpenAI's GPT-3 API.

    Args:
    - context: A string representing the context for which questions should be generated.

    Returns:
    - A string containing the question generated by the API.
    """
    
    try:
        response = openai.Completion.create(
            engine="davinci-instruct-beta-v3",
            prompt=f"Write questions based on the text below\n\nText: {context}\n\nQuestions:\n1.",
            temperature=0,
            max_tokens=200,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            stop=["\n\n"]
        )
        # Extract question text from the response
        question_text = response['choices'][0]['text']
        return question_text
    except:
        # Return an empty string if there was an error
        return ""

Run on real data

In [8]:
%%time
df['questions']= df.context.apply(get_questions)
df['questions'] = "1." + df.questions
print(df[['questions']].values[0][0])

1. What is the name of the company that is the subject of the text?
2. What is the state of incorporation of the company?
3. What is the company's Employer Identification Number?
4. What is the company's principal executive office address?
5. Which exchange is the company's Common Stock listed on?
6. Are any of the company's securities registered pursuant to Section 12(b) of the Securities Exchange Act of 1934?
7. Is the company a well-known seasoned issuer, as defined in Rule 405 of the Securities Act?
8. Has the company filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months?
9. Has the company submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T during the preceding 12 months?
10. Is the company a large accelerated filer, an accelerated filer, a
CPU times: total: 219 ms
Wall time: 6min 22s


The function takes in a single argument `row`, which is a pandas dataframe row containing 'context' and 'questions' columns. It returns a string containing the answer generated by the API.

The function uses `try` and `except` to catch any errors that might occur while interacting with the API. It sends a POST request to the OpenAI Completion API using the `openai.Completion.create()` method, passing in various parameters such as the GPT-3 engine to use, the prompt to use (which includes the context, the question, and a placeholder for the answer), and settings for temperature, max tokens, and penalties.

If the request is successful, the function extracts the answer text from the response dictionary and returns it. If there was an error, the function prints the error message and returns an empty string.

The function has type hints for both arguments and return value, and includes a docstring that describes what the function does, what arguments it takes, and what it returns.

In [9]:
def get_answers(row: pd.DataFrame) -> str:
    """
    Given a dataframe row containing context and questions, generates an answer using OpenAI's GPT-3 API.

    Args:
    - row: A pandas dataframe row containing 'context' and 'questions' columns.

    Returns:
    - A string containing the answer generated by the API.
    """
    
    try:
        response = openai.Completion.create(
            engine="davinci-instruct-beta-v3",
            prompt=f"Write answer (limit to 1 paragraph) based on the text below\n\nText: {row.context}\n\nQuestions:\n{row.questions}\n\nAnswers:\n1.",
            temperature=0,
            max_tokens=800,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
        )
        # Extract answer text from the response
        answer_text = response['choices'][0]['text']
        return answer_text
    except Exception as e:
        # Print the error message and return an empty string if there was an error
        print (e)
        return ""

Run on real data

In [None]:
%%time
df['answers']= df.apply(get_answers, axis=1)
df['answers'] = "1." + df.answers
df = df.dropna().reset_index().drop('index',axis=1)
print(df[['answers']].values[0][0])

Save

In [12]:
df.to_csv('tmp_output_ar.csv')