# Introducing Specific Context to OpenAI - Course Syllabus
- In this notebook, we will write methods to allow ChatGPT to answer questions given context about a particular text. In this case, this will be a course syllabus.
- Using the OpenAI tutorial on prompt engineering as a guide, we will create methods to embed the syllabus divided into sections, relate a given prompt about the syllabus to the most closely related sections of the syllabus, and have the engine provide a more targeted answer to the query.

### Future Goals
1. Write script to parse any syllabus and put into useful form.
2. Fine tune GPT model on transcripts from Zoom recordings of classes to answer content questions

In [None]:
# Import dependencies
import openai
import pandas as pd
import numpy as np
import pickle
from transformers import GPT2TokenizerFast
from typing import List
import tiktoken

# Set up the OpenAI API key
openai.api_key = "<INSERT API KEY>"
COMPLETIONS_MODEL = "text-davinci-003"
EMBEDDING_MODEL = EMBEDDING_MODEL = "text-embedding-ada-002"

Next, we'll import the data, which should be in a basic "heading, content" format. In future work, which makes it easier for the underlying LLM to decide what part of the text is most relevant.

In [33]:
# Read in the syllabus, which is in CSV format
df = pd.read_csv('/Users/reggiebain/Downloads/reggie-syllabus.csv')
df = df.set_index(df.columns[0], drop=True)
df

Unnamed: 0_level_0,content
heading,Unnamed: 1_level_1
Course Info,Course \nPHY 162 - General Physics 2 (Resident...
Course Summary,PHY 162/PHYS 216 General Physics II is an alge...
College Credit Hours (Dual-Enrollment),This course is dual enrolled with Francis Mari...
Learning Outcomes,"Upon completion of this course, students will ..."
Pedagogy,This course will help students develop key pro...
Lab Learning Outcomes,Lab offers students the chance to gain interac...
Assignments,Students will work complete weekly lab assignm...
Textbook Resources,"Physics: Principles with Applications, 7th ed...."
WebAssign,Students will submit HW assignments and comple...
Needed Supplies,Resources students need for class/lab include:...


In [34]:
# Method to get a text embedding from OpenAI, passing string and ada02 embeddings. Return a list
def get_embedding(text: str, model: str=EMBEDDING_MODEL) -> list[float]:
    
    result = openai.Embedding.create(model=model, input=text)
    
    return result["data"][0]["embedding"]

# Feed data frame and run get_embedding on each row. Return {row index, embedded row}
def compute_doc_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:

    return {idx: get_embedding(r.content) for idx, r in df.iterrows()}

In [35]:
def load_embeddings(fname: str) -> dict[tuple[str, str], list[float]]:
    """
    Read the document embeddings and their keys from a CSV.
    
    fname is the path to a CSV with exactly these named columns: 
        "title", "heading", "0", "1", ... up to the length of the embedding vectors.
    """
    
    df = pd.read_csv(fname, header=0)
    max_dim = max([int(c) for c in df.columns if c != "title" and c != "heading"])
    return {
           (r.title, r.heading): [r[str(i)] for i in range(max_dim + 1)] for _, r in df.iterrows()
    }

In [None]:
#document_embeddings = load_embeddings("https://cdn.openai.com/API/examples/data/olympics_sections_document_embeddings.csv")

# ===== OR, uncomment the below line to recaculate the embeddings from scratch. ========

document_embeddings = compute_doc_embeddings(df)

RateLimitError: You exceeded your current quota, please check your plan and billing details.

In [15]:
# An example embedding:
example_entry = list(document_embeddings.items())[0]
print(f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")

Course Info : [0.0003045338380616158, 0.005849773995578289, -0.0038816893938928843, -0.01082106027752161, -0.0019834069535136223]... (1536 entries)


In [16]:
def vector_similarity(x: list[float], y: list[float]) -> float:
    """
    Returns the similarity between two vectors.
    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(x), np.array(y))

In [17]:
def order_document_sections_by_query_similarity(query: str, contexts: dict[(str, str), np.array]) -> list[(float, (str, str))]:
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    
    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_embedding(query)
    
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)
    
    return document_similarities

In [18]:
order_document_sections_by_query_similarity("When is the first quiz?", document_embeddings)[:5]

[(0.8078118355886577, 'Course Schedule'),
 (0.7829636272300543, 'Grading'),
 (0.78047162605558, 'Chapter and Lab Schedule'),
 (0.7750835078136363, 'Prerequisites'),
 (0.7739317196080004, 'Tests and Exams')]

In [19]:
order_document_sections_by_query_similarity("How many tests will we have?", document_embeddings)[:5]

[(0.8285374304779508, 'Course Schedule'),
 (0.8135638376248444, 'Tests and Exams'),
 (0.8121181837551523, 'Grade Weigting'),
 (0.7804430562638993, 'Prerequisites'),
 (0.7717517663537621, 'Assignments')]

In [20]:
MAX_SECTION_LEN = 500
SEPARATOR = "\n* "
ENCODING = "gpt2"  # encoding for text-davinci-003

encoding = tiktoken.get_encoding(ENCODING)
separator_len = len(encoding.encode(SEPARATOR))

f"Context separator contains {separator_len} tokens"

'Context separator contains 3 tokens'

In [21]:
import re
from typing import Set
from transformers import GPT2TokenizerFast

import numpy as np
from nltk.tokenize import sent_tokenize

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def count_tokens(text: str) -> int:
    """count the number of tokens in a string"""
    return len(tokenizer.encode(text))

In [22]:
df['tokens'] = df.content.apply(count_tokens)

In [23]:
df

Unnamed: 0_level_0,content,tokens
heading,Unnamed: 1_level_1,Unnamed: 2_level_1
Course Info,Course \nPHY 162 - General Physics 2 (Resident...,175
Course Summary,PHY 162/PHYS 216 General Physics II is an alge...,75
College Credit Hours (Dual-Enrollment),This course is dual enrolled with Francis Mari...,79
Learning Outcomes,"Upon completion of this course, students will ...",177
Pedagogy,This course will help students develop key pro...,57
Lab Learning Outcomes,Lab offers students the chance to gain interac...,83
Assignments,Students will work complete weekly lab assignm...,126
Textbook Resources,"Physics: Principles with Applications, 7th ed....",114
WebAssign,Students will submit HW assignments and comple...,87
Needed Supplies,Resources students need for class/lab include:...,159


In [24]:
def construct_prompt(question: str, context_embeddings: dict, df: pd.DataFrame) -> str:
    """
    Fetch relevant 
    """
    most_relevant_document_sections = order_document_sections_by_query_similarity(question, context_embeddings)
    
    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []
     
    for _, section_index in most_relevant_document_sections:
        # Add contexts until we run out of space.        
        document_section = df.loc[section_index]
        
        chosen_sections_len += document_section.tokens + separator_len
        if chosen_sections_len > MAX_SECTION_LEN:
            break
            
        chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))
        chosen_sections_indexes.append(str(section_index))
            
    # Useful diagnostic information
    print(f"Selected {len(chosen_sections)} document sections:")
    print("\n".join(chosen_sections_indexes))
    
    header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
    
    return header + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"

In [25]:
prompt = construct_prompt(
    "What types of assignments does this course have?",
    document_embeddings,
    df
)

print("===\n", prompt)

Selected 5 document sections:
Assignments
Homework
Clickers
WebAssign
Grade Weigting
===
 Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."

Context:

* Students will work complete weekly lab assignments that will be written in Google Docs and utilize Google Sheets/Excel extensively. Most labs will be submitted electronically to Canvas. Some activities will require students to work in groups of 3-4 and others will require students to turn in work individually. During most weeks, the assignment should be completed and submitted during the lab period.   Students may be asked to complete a more formal lab write-up or other technical writing assignment following a specific rubric (rubric will be posted to Canvas). Such an assignment will count for 15% of the total lab grade. 
* Done via WebAssign, roughly 1-2 assignments will be due per week constituting ~10-20 problems/week. Deadlines are

In [26]:
COMPLETIONS_API_PARAMS = {
    # We use temperature of 0.0 because it gives the most predictable, factual answer.
    "temperature": 0.0,
    "max_tokens": 300,
    "model": COMPLETIONS_MODEL,
}

In [27]:
# Note: The ** syntax is used to pass the contents of the dictionary as keyword arguments to the create() method.
def answer_query_with_context(
    query: str,
    df: pd.DataFrame,
    document_embeddings: dict[(str, str), np.array],
    show_prompt: bool = False
) -> str:
    prompt = construct_prompt(
        query,
        document_embeddings,
        df
    )
    
    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
                prompt=prompt,
                **COMPLETIONS_API_PARAMS
            )

    return response["choices"][0]["text"].strip(" \n")

In [28]:
query = "What textbook will this course use?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 5 document sections:
Textbook Resources
WebAssign
Learning Outcomes
Prerequisites
Pedagogy

Q: What textbook will this course use?
A: Physics: Principles with Applications, 7th ed., Douglas C. Giancoli.


In [29]:
query = "How will we turn in homework?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 5 document sections:
Assignments
Homework
Grading
WebAssign
Grade Weigting

Q: How will we turn in homework?
A: Homework will be submitted via WebAssign.


In [30]:
query = "What percent of the grade will each test be worth?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 4 document sections:
Grade Weigting
Grading
Course Schedule
Assignments

Q: What percent of the grade will each test be worth?
A: 11.25% for each of the 4 tests.
