# Embedding Video Caption Text for Search

## What are embeddings?

OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:

- **Search (where results are ranked by relevance to a query string)**
- Recommendations (where items with related text strings are recommended)
- Clustering (where text strings are grouped by similarity)
- Anomaly detection (where outliers with little relatedness are identified)
- Diversity measurement (where similarity distributions are analyzed)
- Classification (where text strings are classified by their most similar label)

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness.

**Small distances suggest high relatedness and large distances suggest low relatedness.**

https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_Wikipedia_articles_for_search.ipynb

## Motivation

Previously, we proposed integrating lecture-specific information with a student's question to ensure that GPT's responses are contextualized accurately. We can achieve this by creating embeddings from all CS50 video captions, say from 2022, and retrieving relevant information based on these embeddings. The top N results, as determined by the shortest distance to the question, will then be incorporated into our few-shot prompting approach.

## SRT Preprocessing

The following demonstrates one possible data processing pipeline to create embeddings from raw SRT files.

The end result is a JSONL document containing hundreds of thousands of JSON documents. Each document represents a fraction of the lecture caption text, its corresponding embeddings, and useful metadata (e.g., week number, YouTube video ID, etc.). This JSONL document will eventually be loaded into a vector database.

In [None]:
import json
import openai
import os
import tiktoken

In [None]:
CAPTION_PATH = "../data/caption_srts/cs50-2023"
SRT_FILES = {
    "lecture0": { "video_name": "Week 0", "youtube_id": "3LPJfIKxwWc" },
    "lecture1": { "video_name": "Week 1", "youtube_id": "cwtpLIWylAw" },
    "lecture2": { "video_name": "Week 2", "youtube_id": "4vU4aEFmTSo" },
    "lecture3": { "video_name": "Week 3", "youtube_id": "jZzyERW7h1A" },
    "lecture4": { "video_name": "Week 4", "youtube_id": "F9-yqoS7b8w" },
    "lecture5": { "video_name": "Week 5", "youtube_id": "0euvEdPwQnQ" },
    "lecture6": { "video_name": "Week 6", "youtube_id": "EHi0RDZ31VA" },
    "lecture7": { "video_name": "Week 7", "youtube_id": "1RCMYG8RUSE" },
    "lecture8": { "video_name": "Week 8", "youtube_id": "ciz2UaifaNM" },
    "lecture9": { "video_name": "Week 9", "youtube_id": "-aqUek49iL8" },
    "ai": { "video_name": "Artificial Intelligence", "youtube_id": "6X58aP7yXC4" },
    "cybersecurity": { "video_name": "Cybersecurity", "youtube_id": "EKof-cJiTG8" },
    "section1": { "video_name": "Section 1", "youtube_id": "Tw2-No1J5j0" },
    "section2": { "video_name": "Section 2", "youtube_id": "tnbPMzwSN7A" },
    "section3": { "video_name": "Section 3", "youtube_id": "DdaRHPGhe-E" },
    "section4": { "video_name": "Section 4", "youtube_id": "m2WzPVd4QIc" },
    "section5": { "video_name": "Section 5", "youtube_id": "VqCbWinLqsc" },
    "section6": { "video_name": "Section 6", "youtube_id": "Y07zwrbq4Lc" },
    "section7": { "video_name": "Section 7", "youtube_id": "DQ-OAvbaN4k" },
    "section8": { "video_name": "Section 8", "youtube_id": "DIdm5ubBZIs" },
    "section9": { "video_name": "Section 9", "youtube_id": "IkpaPGqlDHU" },
    "mario-less": { "video_name": "Mario (Less Comfortable)", "youtube_id": "NAs4FIWkJ4s" },
    "mario-more": { "video_name": "Mario (More Comfortable)", "youtube_id": "FzN9RAjYG_Q" },
    "cash": { "video_name": "Cash", "youtube_id": "Y3nWGvqt_Cg" },
    "credit": { "video_name": "Credit", "youtube_id": "dF7wNjsRBjI" },
    "readability": { "video_name": "Readability", "youtube_id": "AOVyZEh9zgE" },
    "caesar": { "video_name": "Caesar", "youtube_id": "V2uusmv2wxI" },
    "substitution": { "video_name": "Substitution", "youtube_id": "cXAoZAsgxJ4" },
    "plurality": { "video_name": "Plurality", "youtube_id": "ftOapzDjEb8" },
    "runoff": { "video_name": "Runoff", "youtube_id": "-Vc5aGywKxo" },
    "tideman": { "video_name": "Tideman", "youtube_id": "kb83NwyYI68" },
    "filter-less-intro": { "video_name": "Filter (Less Comfortable)", "youtube_id": "K0v9byp9jd0" },
    "filter-more-intro": { "video_name": "Filter (More Comfortable)", "youtube_id": "vsOsctDernw" },
    "filter-less-blur": { "video_name": "Filter / Blur (Less Comfortable)", "youtube_id": "6opWB7DaFCY" },
    "filter-more-blur": { "video_name": "Filter / Blur (More Comfortable)", "youtube_id": "dxNO-hCjT0w" },
    "filter-grayscale": { "video_name": "Filter / Grayscale", "youtube_id": "A8LA2osnAwM" },
    "filter-sepia": { "video_name": "Filter / Sepia", "youtube_id": "m0_vouQLufc" },
    "filter-reflect": { "video_name": "Filter / Reflect", "youtube_id": "dlWpx8gQdFo" },
    "speller": { "video_name": "Speller", "youtube_id": "_z57x5PGF4w" },
    "speller-load": { "video_name": "Speller / Load", "youtube_id": "-BX4wLZRwbc" },
    "speller-hash": { "video_name": "Speller / Hash", "youtube_id": "aFe05MQ56Rc" },
    "speller-size": { "video_name": "Speller / Size", "youtube_id": "3cD-_NGTw9A" },
    "speller-check": { "video_name": "Speller / Check", "youtube_id": "qPz_Mr69yE0" },
    "speller-unload": { "video_name": "Speller / Unload", "youtube_id": "qkC4l0pUvCk" },
    "dna": { "video_name": "DNA", "youtube_id": "j84b_EgntcQ" },
    "movies": { "video_name": "Movies", "youtube_id": "v5_A3giDlQs" },
    "fiftyville": { "video_name": "Fiftyville", "youtube_id": "x7Q8tJMi7cQ" },
    "finance": { "video_name": "Finance", "youtube_id": "7wPTAwT-6bA" }
}

Let's first take a look at what the SRT file looks like.

https://www.3playmedia.com/blog/create-srt-file/

In [None]:
with open(f"{CAPTION_PATH}/lecture0.srt") as f:
    print("\n".join(f.read().splitlines()[:30]))

**Here are a few things to pay attention:**

1. We want to convert all time code to seconds.
2. We want to attach metadata for each line of the caption text for later use.
3. We would want to remove the speaker name from the caption text.

In [None]:
def parse_timecode(timecode):
    """parse hh:mm:ss,mmm to seconds"""
    timecode = timecode.split(":")
    timecode = int(timecode[0]) * 3600 + int(timecode[1]) * 60 + float(timecode[2].replace(",", "."))
    return round(timecode)

In [None]:
jsonl_document = []
for caption in SRT_FILES:

    with open(f"{CAPTION_PATH}/{caption}.srt") as f:

        # remove doule empty lines and split lines based on empty line
        lines = f.read()
        lines = lines.replace("\n\n\n", "\n\n")
        lines = lines.split("\n\n")

        for line in lines:

            # remove SRT sequence number and skip empty lines
            line = line.split("\n")[1:]

            if (len(line) < 2):
                continue

            # remove speaker name
            if ("DAVID MALAN:" in line[1].upper()):
                line[1] = line[1].replace("DAVID MALAN:", "")

            # create JSON document
            video_name = SRT_FILES[caption]["video_name"]
            youtube_id = SRT_FILES[caption]["youtube_id"]
            start_time = line[0].split(" --> ")[0]
            end_time = line[0].split(" --> ")[1]

            metadata = {}
            metadata["week"] = video_name
            metadata["youtube_id"] = youtube_id
            metadata["start"] = parse_timecode(start_time)
            metadata["end"] = parse_timecode(end_time)

            json_doc = {}
            json_doc["text"] = (" ").join(line[1:]).strip()
            json_doc["metadata"] = metadata

            # skip empty caption text
            if (json_doc["text"] == ""):
                continue

            # add to JSONL document
            jsonl_document.append(json_doc)

**Let's take a look at what this JSONL document looks like:**

In [None]:
print(f"Total caption text in JSONL document: {len(jsonl_document)}", end="\n\n")

preview_lines = 10
print(f"First {preview_lines} line(s) of JSONL document:")
for index, each in enumerate(jsonl_document[:preview_lines]):
    print(f"{index + 1}:", each)

**Each line of the JSONL document is a valid JSON file containing lecture caption text, start/end time, YouTube video ID, and week name.**

In [None]:
# create processed_data directory if needed
if not os.path.exists("../data/processed_data"):
    os.makedirs("../data/processed_data")

# save to JSONL file
with open("../data/processed_data/lectures_2023_raw.jsonl", "w") as f:
    for chunk in jsonl_document:
        f.write(f"{json.dumps(chunk)}\n")

## Merge caption texts to span a longer duration

### Motivation

Caption text serves as a subtitle for videos, typically spanning just a few seconds to ensure readability during video playback. However, this feature doesn't significantly aid in embedding search.

Consider the search results in this context; you'd likely find more fragmented pieces of caption text rather than complete sentences. These fragments often lack the necessary contextual information, rendering them not particularly helpful.

To enhance search capabilities and offer a more valuable context, we aim to decrease this granularity (e.g., increase the duration for which the caption text appears).

**We need to strike a balance in picking the optimal duration so it doesn't result in too many caption texts per JSON document**

In [None]:
# we merge caption texts to form a new caption text that spans around 30 seconds
SPAN_THRESHOLD = 30 

In [None]:
# open JSONL file
jsonl_document = []
with open('../data/processed_data/lectures_2023_raw.jsonl', 'r') as f:
    for line in f:
        jsonl_document.append(json.loads(line))

# for each json document, merge the subtitles into one string into 30 seconds chunks
new_jsonl_document = []
total_documents = len(jsonl_document)
span = 0
running_counter = 0
current_pointer = 0

while True:

    # if we are at the end of the file, break
    if (current_pointer + running_counter + 1 > total_documents):
        break

    # calculate the span between the current document and the next document
    if (span < 0):
        print("ERROR: span is less than 0")
        print("current pointer: " + str(current_pointer))
        print("span: " + str(span))
        print("running counter: " + str(running_counter))
        break

    # if the next document is in the same week, merge the documents
    # if we are at the end of the file, append the last document
    running_counter += 1
    if (current_pointer + running_counter == total_documents):
        new_jsonl_document.append(jsonl_document[current_pointer])
        break

    if (jsonl_document[current_pointer]["metadata"]["week"] == jsonl_document[current_pointer + running_counter]["metadata"]["week"]):
        span = jsonl_document[current_pointer + running_counter]["metadata"]["end"] - jsonl_document[current_pointer]["metadata"]["start"]
        jsonl_document[current_pointer]["text"] += " " + jsonl_document[current_pointer + running_counter]["text"]
        jsonl_document[current_pointer]["metadata"]["end"] = jsonl_document[current_pointer + running_counter]["metadata"]["end"]

        # if the span is greater than the threshold, append the document and reset the pointer
        if (span >= SPAN_THRESHOLD):
            new_jsonl_document.append(jsonl_document[current_pointer])
            span = 0
            current_pointer += running_counter + 1
            running_counter = 0
            continue

    # if the next document is not in the same week, append the document and reset the pointer
    else:
        new_jsonl_document.append(jsonl_document[current_pointer])
        current_pointer += running_counter
        running_counter = 0
        span = 0
        continue

**Let's look at what the new JSONL document looks like:**

In [None]:
print(f"Total caption text in JSONL document: {len(new_jsonl_document)}", end="\n\n")

preview_lines = 5
print(f"First {preview_lines} line(s) of JSONL document:")
for index, each in enumerate(new_jsonl_document[:preview_lines]):
    print(f"{index + 1}:", each)

**You should see that each JSON document's caption text should be longer, spanning a longer duration. This would give us a more useful and relevant lecture context in the prompt.**

In [None]:
# save new JSONL file
with open('../data/processed_data/lectures_2023_merged.jsonl', 'w') as f:
    f.write('\n'.join(json.dumps(i) for i in new_jsonl_document))

## Create embeddings for caption texts

We are now ready to create embeddings for caption texts.

**Note that we only generate embeddings for each caption text, we do not generate embeddings for metadata.**

Reference: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_Wikipedia_articles_for_search.ipynb

In [None]:
jsonl_document = []
text_documents = [] # easier to work with OpenAI's API

with open("../data/processed_data/lectures_2023_merged.jsonl") as f:
    for line in f.readlines():
        line = json.loads(line)
        jsonl_document.append(line)
        text_documents.append(line["text"])

In [None]:
print(jsonl_document[0].keys())

In [None]:
openai.api_key = os.getenv("OPENAI_API_KEY")
EMBEDDING_TOKEN_LIMIT = 8191
EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023
BATCH_SIZE = 1000  # you can submit up to 2048 embedding inputs per request

def num_tokens_from_string(string: str, model_name: str = "gpt-4") -> int:
    """Returns the number of tokens in a text string."""
    # https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

    encoding = tiktoken.encoding_for_model(model_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [None]:
# calculate total tokens needed and cost
# https://openai.com/pricing#language-models
total_tokens = num_tokens_from_string("".join(text_documents))
print(f"Total tokens: {total_tokens}")
print(f"Total embedding cost: ${round(total_tokens * 0.1 / 1000000, 4)}")

In [None]:
# create embeddings
embeddings = []

try:
    
    # this is a rather expensive operation, proceed with caution 
    if (input("Proceed to create embeddings? (y/n) ") != "y"):
        raise KeyboardInterrupt

    for batch_start in range(0, len(text_documents), BATCH_SIZE):
        batch_end = batch_start + BATCH_SIZE
        batch = text_documents[batch_start:batch_end]
        print(f"Batch {batch_start} to {batch_end-1} of {len(text_documents)}")

        # call OpenAI API to create embedding for a given caption text
        response = openai.embeddings.create(model=EMBEDDING_MODEL, input=batch)

        # double check embeddings are in same order as input
        for i, be in enumerate(response.data):
            assert i == be.index

        batch_embeddings = [e.embedding for e in response.data]
        embeddings.extend(batch_embeddings)

    print("Finished creating embeddings")
            
except KeyboardInterrupt:
    print("operation aborted")

In [None]:
# save jsonl file with embeddings
with open("../data/processed_data/lectures_2023_embeddings.jsonl", "w") as f:
    for i, embedding in enumerate(embeddings):

        # store embeddings for a given caption text
        jsonl_document[i]["embedding"] = embedding
        f.write(json.dumps(jsonl_document[i]) + "\n")


**You can take a look at what the embedding looks like for each caption text.**

**Note that because we use the `text-embedding-ada-002` model, we will always get a 1536-dimensional embedding vector (i.e., there are 1536 numbers inside).**

Reference: https://platform.openai.com/docs/guides/embeddings/second-generation-models

In [None]:
print(f"Total caption text in JSONL document: {len(jsonl_document)}")

In [None]:
preview_line = 42
json_doc = jsonl_document[preview_line-1]

print(f"Total characters: {len(json_doc['text'])}", f"embedding size: {len(json_doc['embedding'])}\n")
print(f"plain_text:\n{json_doc['text']}\n")
print(f"embedding:\n{json_doc['embedding']}\n")
print(f"metadata:\n{json_doc['metadata']}")