# Creating a synthetic Q&A dataset

We use `davinci-instruct-beta-v3`, a model specialized in following instructions, to create questions based on the given context. Then we also use `davinci-instruct-beta-v3` to answer those questions, given the same context.

## 2.1 Read in the data & create a context

Create a context by concatenating the title and the text of the article.

In [None]:
import pandas as pd

df = pd.read_csv("artificial_intelligence.csv")
df['context'] = df.title + "\n" + df.section + "\n\n" + df.text
df.head()

Unnamed: 0,title,section,text,ntokens,context
0,Artificial intelligence,Summary,Artificial intelligence (AI) is intelligence d...,536,Artificial intelligence\nSummary\n\nArtificial...
1,Artificial intelligence,History,Artificial beings with intelligence appeared a...,1113,Artificial intelligence\nHistory\n\nArtificial...
2,Artificial intelligence,Goals,The general problem of simulating (or creating...,50,Artificial intelligence\nGoals\n\nThe general ...
3,Artificial intelligence,"Reasoning, problem-solving",Early researchers developed algorithms that im...,120,"Artificial intelligence\nReasoning, problem-so..."
4,Artificial intelligence,Knowledge representation,Knowledge representation and knowledge enginee...,335,Artificial intelligence\nKnowledge representat...


## 2.2 Create questions based on the context

Use `davinci-instruct` to generate a number of plausible questions relating to the Wikipedia section contents.

Note: The attribute temperature is set to 0, but it may be beneficial to experiment with a higher temperature to get a higher diversity of questions.

In [9]:
import openai
import os

openai.api_key = "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"

def get_questions(context):
      response = openai.Completion.create(
            engine="davinci-instruct-beta-v3",
            prompt=f"Write questions based on the text below:\n\nText: {context}\n\nQ:\n1.",
            temperature=0.0,
            max_tokens=100,
            top_p=1.0,
            frequency_penalty=0.0,
            presence_penalty=0.0,
            stop=["\n\n"]
        )

      return response.choices[0].text

df['questions'] = df.context.apply(get_questions)
df['questions'] = "1." + df.questions
print(df[['questions']].values[0][0])

1. What are the goals of AI research?
2. What are some of the tools used in AI research?
3. What are the risks associated with artificial intelligence?
4. What is the definition of artificial intelligence?


# 2.3 Create answers based on the context

Use `davinci-instruct` to generate answers to the questions.

Note: The attribute temperature is set to 0, but it may be beneficial to experiment with a higher temperature to get a higher diversity of questions.

In [10]:
def get_answers(row):
    response = openai.Completion.create(
        engine="davinci-instruct-beta-v3",
        prompt=f"Write answer based on the text below\n\nText: {row.context}\n\nQuestions:\n{row.questions}\n\nAnswers:\n1.",
        temperature=0.0,
        max_tokens=100,
        top_p=1.0,
        frequency_penalty=0.0,
        presence_penalty=0.0,
    )

    return response.choices[0].text

df['answers'] = df.apply(get_answers, axis=1)
df['answers'] = "1." + df.answers
df = df.dropna().reset_index().drop('index',axis=1)
print(df[['answers']].values[0][0])

1. The goals of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception, and the ability to move and manipulate objects.

2. AI research uses a variety of tools, including search and mathematical optimization, formal logic, artificial neural networks, and methods based on statistics, probability, and economics.

3. The risks associated with artificial intelligence include the possibility of existential risk to humanity, unemployment, and redundancies.

4. Artificial intelligence is intelligence demonstrated
