# PDF Manual to SOP creation 
Details see [here](https://docs.google.com/document/d/1l-2EdPYP0_R5SOJuKoiyLypdzArFpDHG4fPG8Xz40RE/edit?usp=sharing)

Code referenced from OpenAI cookbook [Question Answering using Embeddings](https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb)

Test PDF for microwave operation taken from [here](https://fs.panasonic.com/pdf/user_manual/Convection/NE-C1275/A00033C5ABP_140922.pdf)

**Items still to be addressed:**

*   Prompt engineering to get to more of the SOP format/tone
* Add reference section to link back to page numbers where information is extracted from - done
*   Considering SOP creation context: how to handle if filtered input context is still longer than max tokens for chatGPT
* How to consider pictures and diagrams 
* PDF extraction still need a better way to preprocess sections/headings 
* Better embeddings search if required 

**Bugs:**

* Fix API call backoff - done 
* Switch token counting to tiktoken - done
* Need to ensure headings/index is not duplicated - for now just drop duplicates

**TODO:**
* Write a "what this does" section
* Deploy to somewhere for ppl to try - framework is done
* Refactor to have major code in functions as much as possible to split outputs - mostly done



## Aesthetics

In [1]:
# Notebook-only
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

## Import dependencies

In [4]:
# Notebook-only
!pip install openai tiktoken PyPDF2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.0-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.1/70.1 KB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 KB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.20
  Downloading requests-2.28.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 KB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting blobfile>=2
  Downloading blobfile-2.0.1-py3-none-any.

In [35]:
import pandas as pd
import tiktoken
import PyPDF2
import numpy as np
import time
import json
import openai
import pickle
import tenacity

In [92]:
# App-modify
# Based on user input
PDF_FILE = 'microwave.pdf'
PROCESSED_FILE = "processed.csv"
EMBEDDING_FILE = "embeddings.csv"
QUERY2 = "How to turn on the microwave"
QUERY = "A part of my food heats up but the other part doesn\'t, what do I do?"

# Defaults
EMBEDDING_MODEL = "text-embedding-ada-002"
COMPLETIONS_MODEL = "gpt-3.5-turbo"   # "text-davinci-003"
ENCODING = "cl100k_base"  # encoding for ChatGPT models

# Prompt defaults
PROMPT_HEADER = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
MAX_SECTION_LEN = 1500  # 2000 for context incl. rest of prompt, save 2000 for completion 

CHAT_COMPLETIONS_API_PARAMS = {
    # We use temperature of 0.0 because it gives the most predictable, factual answer.
    "temperature": 0.0,
    "max_tokens": 2000,
    "model": COMPLETIONS_MODEL,
}

BASE_MESSAGE = [{"role": "system", "content": "You are a kind helpful assistant"}]

In [37]:
# App-modify
with open('token.json') as f:
    data = json.load(f)
    key = data['OPENAI_TOKEN']

openai.api_key = key

## PDF text extraction and preprocessing

In [72]:
# PDF extraction and preprocessing helper functions

def extract_pdf(pdf_data):
  # For now, each page is treated as a separate section of content and the first 
  # sentence of the page is treated as the heading 
  pdf_reader = PyPDF2.PdfReader(pdf_data)

  # Extract the text content from the PDF
  headings = []
  contents = []
  pg_nums = []
  for page in range(max(0,pg_start-1), min(pg_end, len(pdf_reader.pages))):
      text_content = pdf_reader.pages[page].extract_text()
      headings.append(text_content.split('\n')[0])
      contents.append(text_content)
      pg_nums.append(page + 1)  # start at pg 1

  # Create a Pandas dataframe from the headings and content
  return pd.DataFrame({'heading': headings, 'content': contents, 'pg_number': pg_nums})

def num_tokens_from_string(string, encoding_name):
  """Returns the number of tokens in a text string."""
  encoding = tiktoken.get_encoding(encoding_name)
  num_tokens = len(encoding.encode(string))
  return num_tokens

def count_tokens(df):
  # encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')
  return df.apply(lambda x: num_tokens_from_string(x.content, "cl100k_base"), axis=1)
  # pdf_sections.nlargest(n=3, columns='tokens')

def preprocess_pdf_data(pdf_file):
  pdf_sections = extract_pdf(pdf_file)
  # basic text processing - this can be improved later
  pdf_sections.replace('\n',' ', regex=True, inplace=True)           # replace new line characters with space
  pdf_sections.replace(r'^(\d+)', '', regex=True, inplace=True)      # remove any leading numeric characters
  pdf_sections["tokens"] = count_tokens(pdf_sections)
  pdf_sections.drop_duplicates(subset=["heading"], keep="first", inplace=True)
  return pdf_sections

In [79]:
# App-modify
# Open the PDF file in binary mode
with open(PDF_FILE, 'rb') as pdf_file:
  # Create a PDF reader object
  pdf_sections = preprocess_pdf_data(pdf_file)

# App-modify
# save as CSV to be loaded later
pdf_sections.to_csv(PROCESSED_FILE, index=False)

In [80]:
# App-modify
pdf_sections = pd.read_csv(PROCESSED_FILE)

In [81]:
# set column index for future search 
pdf_sections.set_index("heading", inplace=True)

In [82]:
# App-modify
# final preprocessed data
pdf_sections.sample(5)

Unnamed: 0_level_0,content,pg_number,tokens
heading,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
How to Change the Beep Tone,How to Change the Beep Tone INFORMATION ACTION...,38,119
NOTES:,NOTES: DO NOT attempt to reheat any food that ...,65,663
CAUTION,"CAUTION 1. To reduce the risk of burns, electr...",6,472
SD Memory Card Part No. RP-SD016BCS0,SD Memory Card Part No. RP-SD016BCS0 An SD Mem...,15,375
CONTROLS TO USE INFORMATION ACTION DISPLAY,CONTROLS TO USE INFORMATION ACTION DISPLAY 1Op...,32,254


## Create embeddings from PDF content 

In [16]:
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(10))
def get_embedding(text):
    result = openai.Embedding.create(
      model=EMBEDDING_MODEL,
      input=text
    )
    # time.sleep(2)  # force sleep 2 seconds for now 
    return result["data"][0]["embedding"]

def compute_doc_embeddings(df):
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_embedding(r.content) for idx, r in df.iterrows()
    }

def load_embeddings(fname):
    """
    Read the document embeddings and their keys from a CSV.
    
    fname is the path to a CSV with exactly these named columns: 
        "heading", "0", "1", ... up to the length of the embedding vectors.
    """
    
    df = pd.read_csv(fname, header=0)
    max_dim = max([int(c) for c in df.columns if c != "heading"])
    return {
           (r.title, r.heading): [r[str(i)] for i in range(max_dim + 1)] for _, r in df.iterrows()
    }

In [17]:
# App-modify
# compute embeddings
document_embeddings = compute_doc_embeddings(pdf_sections)

In [18]:
# App-modify
# saving embeddings for future load
pd.DataFrame.from_dict(document_embeddings, orient="index").to_csv(EMBEDDING_FILE)

In [19]:
# App-modify
# load embedding
embeddings_raw = pd.read_csv(EMBEDDING_FILE)
embeddings_raw.columns.values[0] = "heading"
embeddings_raw.set_index("heading", inplace=True)
document_embeddings = embeddings_raw.T.to_dict('list')

In [20]:
# An example embedding:
example_entry = list(document_embeddings.items())[0]
print(f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")

Operating Instructions and User Guide : [-0.006402482278645, -0.0063926973380148, -0.0111415581777691, -0.0288584623485803, -0.0222439765930175]... (1536 entries)


## Find relevant sections based on query

In [21]:
def vector_similarity(x, y):
    """
    Returns the similarity between two vectors.
    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(x), np.array(y))

def order_document_sections_by_query_similarity(query, contexts):
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    
    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_embedding(query)
    
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)
    
    return document_similarities

In [77]:
# Notebook-only
# test embedding similarity search 
order_document_sections_by_query_similarity(QUERY, document_embeddings)[:5]

[(0.8364629138645348, 'Common Problems'),
 (0.8178868241122307, 'Affects of the foodsReheating by Microwave'),
 (0.8133949885623041, 'Tips for Reheating your own Homemade Foods'),
 (0.8093917257523815, 'Reheating by Microwave'),
 (0.8088500052216336, '.Containers')]

## Construct query

In [49]:
# get context separators to make prompt easier to read
SEPARATOR = "\n* "
encoding = tiktoken.get_encoding(ENCODING)
separator_len = len(encoding.encode(SEPARATOR))

In [84]:
def construct_prompt(question, context_embeddings, df, diagnostic=False):
    """
    Fetch relevant 
    """
    most_relevant_document_sections = order_document_sections_by_query_similarity(question, context_embeddings)
    
    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []
     
    for _, section_index in most_relevant_document_sections:
        # Add contexts until we run out of space. may want to change this later       
        document_section = df.loc[section_index]
        # print(section_index, " ", document_section.tokens)
        
        chosen_sections_len += document_section.tokens + separator_len
        if chosen_sections_len > MAX_SECTION_LEN:
            break
            
        chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))
        chosen_sections_indexes.append(str(section_index))
            
    # Useful diagnostic information
    if diagnostic:
      print(f"Selected {len(chosen_sections)} document sections:")
      print("\n".join(chosen_sections_indexes))
    
    full_prompt = PROMPT_HEADER + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"

    result = {"prompt" : full_prompt, "ref" : chosen_sections_indexes}
    
    return result

In [83]:
# Notebook-only
prompt = construct_prompt(
    QUERY,
    document_embeddings,
    pdf_sections,
    diagnostic=True
)

print("\n===INPUT PROMPT BELOW===\n", prompt)

Common Problems   483
Affects of the foodsReheating by Microwave   624
Tips for Reheating your own Homemade Foods   545
Reheating by Microwave   767
Selected 3 document sections:
Common Problems
Affects of the foodsReheating by Microwave
Tips for Reheating your own Homemade Foods

===INPUT PROMPT BELOW===
 {'prompt': 'Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don\'t know."\n\nContext:\n\n* Common Problems PROBLEM SOLUTION Food cools quickly after Microwave or Combination cooking. Foods take longer to cook, defrost or reheat than stated. Liquids boil over when cooked by  microwave or combination.Vegetables become wrinkly and hard when cooked/reheated by  microwave.Foods heat unevenly. Only one side of cavity heats. Foods heated by microwave are  hard and tough.Foods “explode” during heating. Foods that have been defrosteddo not heat in the centre.Return to oven for additional cooking. Check 

## Generate output based on prompt

In [88]:
def create_reference(df, indexes):
  references = []
  for index in indexes:
    reference = "Page " + str(df.loc[index].pg_number) + ": " + index
    references.append(reference)
  return "\n\nReferenced from: " + ("\n").join(references)

def answer_query_with_context_chatgpt(
    query,
    df,
    document_embeddings,
    show_prompt=False,
    show_diagnostic=False,
    show_reference=True
):
    prompt = construct_prompt(
        query,
        document_embeddings,
        df,
        show_diagnostic
    )

    message = BASE_MESSAGE + [{"role":"user", "content":prompt["prompt"]}]

    if show_prompt:
        print(message)   

    if show_reference:
      reference_text = create_reference(df, prompt["ref"])
    else:
      reference_text = ""

    response = openai.ChatCompletion.create(
                messages=message,
                **CHAT_COMPLETIONS_API_PARAMS
            )
    
    full_response = response["choices"][0]["message"]["content"] + reference_text

    return {"response" : full_response, "query" : message}

In [93]:
answer_query_with_context_chatgpt(QUERY, pdf_sections, document_embeddings, True, True, True)

Selected 2 document sections:
Common Problems
Affects of the foodsReheating by Microwave
[{'role': 'system', 'content': 'You are a kind helpful assistant'}, {'role': 'user', 'content': 'Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don\'t know."\n\nContext:\n\n* Common Problems PROBLEM SOLUTION Food cools quickly after Microwave or Combination cooking. Foods take longer to cook, defrost or reheat than stated. Liquids boil over when cooked by  microwave or combination.Vegetables become wrinkly and hard when cooked/reheated by  microwave.Foods heat unevenly. Only one side of cavity heats. Foods heated by microwave are  hard and tough.Foods “explode” during heating. Foods that have been defrosteddo not heat in the centre.Return to oven for additional cooking. Check oven is plugged into its own 13amp socket. Do not use extension  cable or adapters.The stated times are only a guide.Heating will be i

'Arrange the food so that thicker parts are on the outside of the plate and smaller foods are in the center. Do not heat very dense foods with porous foods as the latter will heat faster. Rearrange the food as above. It is impossible for only one side of the cavity to receive microwaves as the energy is distributed by a rotating guide.\n\nReferenced from: Page 76: Common Problems\nPage 42: Affects of the foodsReheating by Microwave'

In [None]:
# completion with text-davinci-003, 10x more expensive than chatgpt

# COMPLETIONS_API_PARAMS = {
#     # We use temperature of 0.0 because it gives the most predictable, factual answer.
#     "temperature": 0.0,
#     "max_tokens": 300,
#     "model": COMPLETIONS_MODEL,
# }

# def answer_query_with_context(
#     query,
#     df,
#     document_embeddings,
#     show_prompt=False,
#     show_diagnostic=False
# ):
#     prompt = construct_prompt(
#         query,
#         document_embeddings,
#         df,
#         show_diagnostic
#     )
    
#     if show_prompt:
#         print(prompt, "\n")

#     response = openai.Completion.create(
#                 prompt=prompt,
#                 **COMPLETIONS_API_PARAMS
#             )

#     return response["choices"][0]["text"].strip(" \n")

In [None]:
# answer_query_with_context(QUERY2, pdf_sections, document_embeddings, True, True)

Selected 1 document sections:
Common Problems
Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."

Context:

* Common Problems PROBLEM SOLUTION Food cools quickly after Microwave or Combination cooking. Foods take longer to cook, defrost or reheat than stated. Liquids boil over when cooked by  microwave or combination.Vegetables become wrinkly and hard when cooked/reheated by  microwave.Foods heat unevenly. Only one side of cavity heats. Foods heated by microwave are  hard and tough.Foods “explode” during heating. Foods that have been defrosteddo not heat in the centre.Return to oven for additional cooking. Check oven is plugged into its own 13amp socket. Do not use extension  cable or adapters.The stated times are only a guide.Heating will be improved with the use ofcorrect containers and arranging. Remember to vary the heating, if the  food is colder or heavier than stated.Ensure the o

'Rearrange foods, so that thicker parts are on the outside of the plate and smaller foods to the centre. Do not heat very dense foods with porous foods as the later will heat faster.'

In [None]:
# print(QUERY1, "\n")
# answer_query_with_context(QUERY1, pdf_sections, document_embeddings)

How to turn on the microwave 



'Open Door. Put in Food. Close Door. Select Power Level. Press Microwave Pad to select correct power. Select Time. Press Number Pads to set a heating time. Press Start Pad.'