## Download

In [1]:
pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp (from openai)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
Collecting multidict<7.0,>=4.5 (from aiohttp->openai)
  Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0,>=4.0.0a3 (from aiohttp->openai)
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting yarl<2.0,>=1.0 (from aiohttp->openai)
  Downloadin

## Import OpenAI

In [2]:
import openai
openai.api_key = "<YOUR_API_KEY>"

In [4]:
pip install PyPDF2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


## Define `extract_df_text` Function

In [5]:
from typing import List
from PyPDF2 import PdfReader


def extract_pdf_text(filepath: str) -> List[str]:
    """
    Extracts text from each page of a PDF file using PyPDF2 and returns it as a list of strings.

    Parameters:
    filepath (str): The file path or URL of the PDF file to extract text from.

    Returns:
    List[str]: A list of strings containing the extracted text from each page of the PDF.
    """
    pdf_file = open(filepath, 'rb')
    pdf_reader = PdfReader(pdf_file)
    pages = len(pdf_reader.pages)

    text_list = []
    for page in range(pages):
        pdf_page = pdf_reader.pages[page]
        text = pdf_page.extract_text()
        text_list.append(text)

    pdf_file.close()
    return text_list


### Scrape PDF

In [6]:
%%time
name = "/content/the-economic-potential-of-generative-ai-the-next-productivity-frontier-vf.pdf"
pdf_text_list = extract_pdf_text(f"{name}")
print(pdf_text_list)

['The economic potential of generative AI \nJune 2023The economic \npotential of generative AI \nThe next productivity frontier\nAuthors\nMichael ChuiEric HazanRoger RobertsAlex SinglaKate SmajeAlex SukharevskyLareina YeeRodney Zemmel', 'ii The economic potential of generative AI: The next productivity frontier', 'Contents\nKey insights\n3\nChapter 1: Generative AI  \nas a technology catalyst 4\nGlossary \n6\nChapter 2: Generative AI use \ncases across functions and industries8\nSpotlight: Retail and \nconsumer packaged goods 27\nSpotlight: Banking \n28Spotlight: Pharmaceuticals and medical products 30\nChapter 3: The generative \nAI future of work: Impacts on work activities, economic growth, and productivity 32\nChapter 4: Considerations  \nfor businesses and society 48\nAppendix \n53\n1 The economic potential of generative AI: The next productivity frontier', '2 The economic potential of generative AI: The next productivity frontier', '1. G enerative AI’s impact on \nproductivity co

### Create DataFrame

Create dataframe

In [8]:
import pandas as pd

In [9]:
df = pd.DataFrame(pdf_text_list)
df.columns = ['context']
df.shape

(68, 1)

### Define Function: `get_questions`

In [10]:
def get_questions(context: str) -> str:
    """
    Given a text context, generates a list of questions using OpenAI's GPT-3 API.

    Args:
    - context: A string representing the context for which questions should be generated.

    Returns:
    - A string containing the question generated by the API.
    """

    try:
        response = openai.Completion.create(
            engine="davinci-instruct-beta-v3",
            prompt=f"Write questions based on the text below\n\nText: {context}\n\nQuestions:\n1.",
            temperature=0,
            max_tokens=200,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            stop=["\n\n"]
        )
        # Extract question text from the response
        question_text = response['choices'][0]['text']
        return question_text
    except:
        # Return an empty string if there was an error
        return ""

Run on real data

In [11]:
%%time
df['questions'] = df.context.apply(get_questions)
df['questions'] = "1." + df.questions
print(df[['questions']].values[0][0])

1. What is the potential of generative AI?
2. What are the benefits of generative AI?
3. What are the challenges of generative AI?
CPU times: user 613 ms, sys: 87.7 ms, total: 701 ms
Wall time: 1min 9s


### Define Function: `get_answers`

In [13]:
def get_answers(row: pd.DataFrame) -> str:
    """
    Given a dataframe row containing context and questions, generates an answer using OpenAI's GPT-3 API.

    Args:
    - row: A pandas dataframe row containing 'context' and 'questions' columns.

    Returns:
    - A string containing the answer generated by the API.
    """

    try:
        response = openai.Completion.create(
            engine="davinci-instruct-beta-v3",
            prompt=f"Write answer (limit to 1 paragraph) based on the text below\n\nText: {row.context}\n\nQuestions:\n{row.questions}\n\nAnswers:\n1.",
            temperature=0,
            max_tokens=500,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
        )
        # Extract answer text from the response
        answer_text = response['choices'][0]['text']
        return answer_text
    except Exception as e:
        # Print the error message and return an empty string if there was an error
        print (e)
        return ""

Run on real data

In [14]:
%%time
df['answers']= df.apply(get_answers, axis=1)
df['answers'] = "1." + df.answers
df = df.dropna().reset_index().drop('index',axis=1)
print(df[['answers']].values[0][0])

This model's maximum context length is 2049 tokens, however you requested 2397 tokens (1897 in your prompt; 500 for the completion). Please reduce your prompt; or completion length.
This model's maximum context length is 2049 tokens, however you requested 2399 tokens (1899 in your prompt; 500 for the completion). Please reduce your prompt; or completion length.
1. The potential of generative AI is vast. It has the ability to create new products and services, and to drive innovation.
2. The benefits of generative AI are many. It can help businesses to be more productive and efficient, and to create new products and services. It can also help to improve decision-making and to boost innovation.
3. The challenges of generative AI are also many. It can be difficult to manage and control, and businesses need to be sure that they have the resources to implement it effectively.
CPU times: user 831 ms, sys: 93 ms, total: 924 ms
Wall time: 1min 47s


In [16]:
df.head()

Unnamed: 0,context,questions,answers
0,The economic potential of generative AI \nJune...,1. What is the potential of generative AI?\n2....,1. The potential of generative AI is vast. It ...
1,ii The economic potential of generative AI: Th...,1. What is generative AI?\n2. What are the ben...,1. Generative AI is a type of AI that is able ...
2,Contents\nKey insights\n3\nChapter 1: Generati...,1. What are the key insights of the text?\n2. ...,1. The key insights of the text are that gener...
3,2 The economic potential of generative AI: The...,1. What is generative AI?\n2. What are the ben...,1. Generative AI is a type of AI that is able ...
4,1. G enerative AI’s impact on \nproductivity c...,1. What are the four areas in which generative...,1. The four areas in which generative AI has t...


## Save DataFrame

In [15]:
df.to_csv(f'mckinsey_gen_ai.csv')