# Get questions answered from documentation using GPT API

**TLDR**: This notebook shows how to use GPT to answer questions from a big document that has an available index. It is done in 2 steps: (1) for a given question, ask GPT which section/chapter in the index to look for the answer (2) get the content of the section/chapter and include in the question/prompt so the answer can refer to the content (context) to better answer the question.


**Motivation**

With the availability of the GPT API, we can get natural language answers from data crawled by OpenAI. However, to get answers from private documents, we need to finetune the GPT model on the private documents or include those documents in the context of the question. If the document is too large for the context (maximum 4096 tokens, resulting in max ~3000 words including the answer) we need to be able to first find the sections of the sections of the document that may contain the answer. There are many ways to do this: 

**Embeddings**


1.   Split the document into chunks 
2.   Embedd this text chunks using GPT
3.   Embedd the question and use as context the chunk(s) with maximum cosine distance between chunk and question embedding

See https://platform.openai.com/docs/tutorials/web-qa-embeddings

**Using the table of contents/index (this notebook)**

If the document is structured using chapters, sections and sub-sections, we can ask ChatGPT (for a given question), which section has the higher probability of containing the answer to the question we want answered. Then, we can include the text content of the section into a second question which will help to answer the user's question. 

Since we cannot force GPT to answer a multiple choice question, we can use the likelihood it assigns for each section a pick the one with maximum likelihood.

**Data**

As an example we use the OpenAI API documentation. We extract the content and table of contents, then we ask questions.

https://platform.openai.com/docs/api-reference

To download the html, use a headless browser such as:

google-chrome --headless --dump-dom 'https://platform.openai.com/docs/api-reference' > ~/api_reference.html

In [66]:
# Helper class where we store the table of contents and content of eash section
class TextSection:
  def __init__(self, name, text, sub_sections = None):
    self.name = name
    self.text = text
    self.sub_sections = sub_sections
  
  def get_index_text(self, level=0):
    index_text = " "*level + self.name
    if self.sub_sections:
      return  '\n'.join([index_text]+[s_s.get_index_text(level+1) for s_s in self.sub_sections])
    else:
      return index_text
  
  def flatten(self):
    sub_section_list = [self]
    if self.sub_sections:
      for s_s in self.sub_sections:
        sub_section_list += s_s.flatten()
    return sub_section_list


In [67]:
# Using BeautfulSoup to extract the text from html

from pathlib import Path
from bs4 import BeautifulSoup
filename = "api_reference.html"
html = Path(filename).read_text()
soup = BeautifulSoup(html, features="html.parser")
docs_body = soup.find('div','docs-body')

api_reference = TextSection("OpenAI API reference", "", [])

for i in list(list(docs_body.children)[0].children):
  text_section = TextSection(i.h2.get_text(), i.get_text(), [])
  if i['class'][1] == 'md':
    api_reference.sub_sections.append(text_section)
  if i['class'][1] == 'endpoint':
    api_reference.sub_sections[-1].sub_sections.append(text_section)   
    
    

In [68]:
# Table of contents (index)
print(api_reference.get_index_text())

OpenAI API reference
 Introduction
 Authentication
 Making requests
 Models
  List models
  Retrieve model
 Completions
  Create completion
 Chat
  Create chat completionBeta
 Edits
  Create edit
 Images
  Create imageBeta
  Create image editBeta
  Create image variationBeta
 Embeddings
  Create embeddings
 Audio
  Create transcriptionBeta
  Create translationBeta
 Files
  List files
  Upload file
  Delete file
  Retrieve file
  Retrieve file content
 Fine-tunes
  Create fine-tune
  List fine-tunes
  Retrieve fine-tune
  Cancel fine-tune
  List fine-tune events
  Delete fine-tune model
 Moderations
  Create moderation
 Engines
  List enginesDeprecated
  Retrieve engineDeprecated
 Parameter details


In [99]:
# Text content of the first section (Introduction)
print(api_reference.sub_sections[0].text)

IntroductionYou can interact with the API through HTTP requests from any language, via our official Python bindings, our official Node.js library, or a community-maintained library.To install the official Python bindings, run the following command:pip install openaiTo install the official Node.js library, run the following command in your Node.js project directory:npm install openai


In [69]:
# Flatten version of the sections
[a.name for a in api_reference.flatten()]

['OpenAI API reference',
 'Introduction',
 'Authentication',
 'Making requests',
 'Models',
 'List models',
 'Retrieve model',
 'Completions',
 'Create completion',
 'Chat',
 'Create chat completionBeta',
 'Edits',
 'Create edit',
 'Images',
 'Create imageBeta',
 'Create image editBeta',
 'Create image variationBeta',
 'Embeddings',
 'Create embeddings',
 'Audio',
 'Create transcriptionBeta',
 'Create translationBeta',
 'Files',
 'List files',
 'Upload file',
 'Delete file',
 'Retrieve file',
 'Retrieve file content',
 'Fine-tunes',
 'Create fine-tune',
 'List fine-tunes',
 'Retrieve fine-tune',
 'Cancel fine-tune',
 'List fine-tune events',
 'Delete fine-tune model',
 'Moderations',
 'Create moderation',
 'Engines',
 'List enginesDeprecated',
 'Retrieve engineDeprecated',
 'Parameter details']

In [None]:
!pip install openai

In [47]:
# insert your own api key
import openai
import getpass

openai.api_key = getpass.getpass()

··········


In [90]:
# Question
question = "How can I get to known all the models available in the API?"

In [91]:
prompt = f"This is a list of sections containing the API reference of OpenAI: \n{api_reference.get_index_text()}. \nProvided the previous list of sections. Which section can better answer the following question: \n\"{question}\"? \nThe section:"

# Part 1: Prompt to find which section answers the question


In [92]:
print(prompt)

This is a list of sections containing the API reference of OpenAI: 
OpenAI API reference
 Introduction
 Authentication
 Making requests
 Models
  List models
  Retrieve model
 Completions
  Create completion
 Chat
  Create chat completionBeta
 Edits
  Create edit
 Images
  Create imageBeta
  Create image editBeta
  Create image variationBeta
 Embeddings
  Create embeddings
 Audio
  Create transcriptionBeta
  Create translationBeta
 Files
  List files
  Upload file
  Delete file
  Retrieve file
  Retrieve file content
 Fine-tunes
  Create fine-tune
  List fine-tunes
  Retrieve fine-tune
  Cancel fine-tune
  List fine-tune events
  Delete fine-tune model
 Moderations
  Create moderation
 Engines
  List enginesDeprecated
  Retrieve engineDeprecated
 Parameter details. 
Provided the previous list of sections. Which section can better answer the following question: 
"How can I get to known all the models available in the API?"? 
The section:


In [93]:
# Open ended answer is good, but we need structure
openai.Completion.create(model="text-davinci-003", prompt=prompt, temperature=0, max_tokens=20).choices[0].text

' "Models" \nList models'

In [94]:
# Get the likelihood for all the sections (answer in the prompt)
section_likelihoods = []
for section in api_reference.flatten():
  completion = openai.Completion.create(model="text-davinci-003", prompt=prompt+section.name, temperature=0, max_tokens=0, echo=True, logprobs=1)
  section_likelihoods.append(sum(completion.choices[0].logprobs.token_logprobs[1:]))

In [95]:
# Plot the likelihoods (models and list models are the best)
max_likelihood = section_likelihoods[0]
max_section = api_reference.flatten()[0]
for section, likelihood in zip(api_reference.flatten(), section_likelihoods):
  print(section.name, likelihood)
  if likelihood>max_likelihood:
    max_section=section
    max_likelihood=likelihood

OpenAI API reference -546.3433321591601
Introduction -546.3458362236801
Authentication -548.6128641759998
Making requests -544.6660877487195
Models -532.7902207637338
List models -528.42620743276
Retrieve model -538.8091972026949
Completions -545.5277878728415
Create completion -549.9212066213499
Chat -548.5513519629001
Create chat completionBeta -552.1376227778402
Edits -546.4442406511398
Create edit -551.8922831572302
Images -547.8282930549102
Create imageBeta -553.9139928172501
Create image editBeta -553.7375977768696
Create image variationBeta -552.7827922148501
Embeddings -549.14612271828
Create embeddings -552.57049525262
Audio -551.60806847418
Create transcriptionBeta -553.1769849809197
Create translationBeta -557.9149830902699
Files -548.3277477801101
List files -550.5578719745204
Upload file -549.24956939049
Delete file -549.2407333835198
Retrieve file -553.8603904980902
Retrieve file content -554.1357165980902
Fine-tunes -551.0679240754097
Create fine-tune -553.2274938205902


In [96]:
print(f"The section \"{max_section.name}\" has the maximum likelihood ({max_likelihood}) to answer the question: \n{question}")

The section "List models" has the maximum likelihood (-528.42620743276) to answer the question: 
How can I get to known all the models available in the API?


# Part 2: Include the content of the maximum likelihood section from the table of contents. Get the final answer

In [97]:
prompt_with_context = f"This is a section from the OpenAI API reference that will be used to answer some questions:\n\n {max_section.text} \n\nTaking into account the previous information, {question}"
print(prompt_with_context)

This is a section from the OpenAI API reference that will be used to answer some questions:

 List modelsget https://api.openai.com/v1/modelsLists the currently available models, and provides basic information about each one such as the owner and availability.Example requestcurlSelect librarycurlpythonnode.jsCopy‍1
2
curl https://api.openai.com/v1/models \
  -H 'Authorization: Bearer YOUR_API_KEY'ResponseCopy‍1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
{
  "data": [
    {
      "id": "model-id-0",
      "object": "model",
      "owned_by": "organization-owner",
      "permission": [...]
    },
    {
      "id": "model-id-1",
      "object": "model",
      "owned_by": "organization-owner",
      "permission": [...]
    },
    {
      "id": "model-id-2",
      "object": "model",
      "owned_by": "openai",
      "permission": [...]
    },
  ],
  "object": "list"
} 

Taking into account the previous information, How can I get to known all the models available in the API?


In [98]:
# Answer
openai.Completion.create(model="text-davinci-003", prompt=prompt_with_context, temperature=0, max_tokens=1000).choices[0].text

'\n\nTo get to know all the models available in the API, you can make a request to the endpoint https://api.openai.com/v1/models. This will return a list of all the models available, along with basic information about each one such as the owner and availability.'