# Embedding Video Caption Text for Search

## What are embeddings?

OpenAIâ€™s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:

- **Search (where results are ranked by relevance to a query string)**
- Recommendations (where items with related text strings are recommended)
- Clustering (where text strings are grouped by similarity)
- Anomaly detection (where outliers with little relatedness are identified)
- Diversity measurement (where similarity distributions are analyzed)
- Classification (where text strings are classified by their most similar label)

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness.

**Small distances suggest high relatedness and large distances suggest low relatedness.**

https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_Wikipedia_articles_for_search.ipynb

## Motivation

Previously, we proposed integrating lecture-specific information with a student's question to ensure that GPT's responses are contextualized accurately. We can achieve this by creating embeddings from all CS50 video captions, say from 2022, and retrieving relevant information based on these embeddings. The top N results, as determined by the shortest distance to the question, will then be incorporated into our few-shot prompting approach.

## SRT Preprocessing

The following demonstrates one possible data processing pipeline to create embeddings from raw SRT files.

The end result is a JSONL document containing hundreds of thousands of JSON documents. Each document represents a fraction of the lecture caption text, its corresponding embeddings, and useful metadata (e.g., week number, YouTube video ID, etc.). This JSONL document will eventually be loaded into a vector database.

In [1]:
import json
import openai
import os
import tiktoken

In [2]:
SRT_FILES = {
    "lecture0": { "video_name": "Week 0", "youtube_id": "IDDmrzzB14M" },
    "lecture1": { "video_name": "Week 1", "youtube_id": "ywg7cW0Txs4" },
    "lecture2": { "video_name": "Week 2", "youtube_id": "XmYnsO7iSI8" },
    "lecture3": { "video_name": "Week 3", "youtube_id": "4oqjcKenCH8" },
    "lecture4": { "video_name": "Week 4", "youtube_id": "AcWIE9qazLI" },
    "lecture5": { "video_name": "Week 5", "youtube_id": "X8h4dq9Hzq8" },
    "lecture6": { "video_name": "Week 6", "youtube_id": "5Jppcxc1Qzc" },
    "lecture7": { "video_name": "Week 7", "youtube_id": "zrCLRC3Ci1c" },
    "lecture8": { "video_name": "Week 8", "youtube_id": "alnzFK-4xMY" },
    "lecture9": { "video_name": "Week 9", "youtube_id": "oVA0fD13NGI" },
    "lecture10": { "video_name": "Week 10", "youtube_id": "iXG0sXlzuF0" },
    "cybersecurity": { "video_name": "Cybersecurity", "youtube_id": "Kuy4cEXpXEE" },
    "section1": { "video_name": "Section 1", "youtube_id": "EDHpBJNi6KY" },
    "section2": { "video_name": "Section 2", "youtube_id": "FxPHywzblfo" },
    "section3": { "video_name": "Section 3", "youtube_id": "djmUUa6srSY" },
    "section4": { "video_name": "Section 4", "youtube_id": "5aCAQyH-sko" },
    "section5": { "video_name": "Section 5", "youtube_id": "2Og20w6uQTs" },
    "section6": { "video_name": "Section 6", "youtube_id": "2MowgKc_anU" },
    "section7": { "video_name": "Section 7", "youtube_id": "InnVHzZeG7I" },
    "section8": { "video_name": "Section 8", "youtube_id": "Wja6Ng4UXA" },
    "section9": { "video_name": "Section 9", "youtube_id": "RmcIhrBN0m0" },
    "mario-less": { "video_name": "Mario (Less Comfortable)", "youtube_id": "NAs4FIWkJ4s" },
    "mario-more": { "video_name": "Mario (More Comfortable)", "youtube_id": "FzN9RAjYG_Q" },
    "cash": { "video_name": "Cash", "youtube_id": "Y3nWGvqt_Cg" },
    "credit": { "video_name": "Credit", "youtube_id": "dF7wNjsRBjI" },
    "readability": { "video_name": "Readability", "youtube_id": "AOVyZEh9zgE" },
    "caesar": { "video_name": "Caesar", "youtube_id": "V2uusmv2wxI" },
    "substitution": { "video_name": "Substitution", "youtube_id": "cXAoZAsgxJ4" },
    "plurality": { "video_name": "Plurality", "youtube_id": "ftOapzDjEb8" },
    "runoff": { "video_name": "Runoff", "youtube_id": "-Vc5aGywKxo" },
    "tideman": { "video_name": "Tideman", "youtube_id": "kb83NwyYI68" },
    "filter-less-intro": { "video_name": "Filter (Less Comfortable)", "youtube_id": "K0v9byp9jd0" },
    "filter-more-intro": { "video_name": "Filter (More Comfortable)", "youtube_id": "vsOsctDernw" },
    "filter-less-blur": { "video_name": "Filter / Blur (Less Comfortable)", "youtube_id": "6opWB7DaFCY" },
    "filter-more-blur": { "video_name": "Filter / Blur (More Comfortable)", "youtube_id": "dxNO-hCjT0w" },
    "filter-grayscale": { "video_name": "Filter / Grayscale", "youtube_id": "A8LA2osnAwM" },
    "filter-sepia": { "video_name": "Filter / Sepia", "youtube_id": "m0_vouQLufc" },
    "filter-reflect": { "video_name": "Filter / Reflect", "youtube_id": "dlWpx8gQdFo" },
    "speller": { "video_name": "Speller", "youtube_id": "_z57x5PGF4w" },
    "speller-load": { "video_name": "Speller / Load", "youtube_id": "-BX4wLZRwbc" },
    "speller-hash": { "video_name": "Speller / Hash", "youtube_id": "aFe05MQ56Rc" },
    "speller-size": { "video_name": "Speller / Size", "youtube_id": "3cD-_NGTw9A" },
    "speller-check": { "video_name": "Speller / Check", "youtube_id": "qPz_Mr69yE0" },
    "speller-unload": { "video_name": "Speller / Unload", "youtube_id": "qkC4l0pUvCk" },
    "dna": { "video_name": "DNA", "youtube_id": "j84b_EgntcQ" },
    "movies": { "video_name": "Movies", "youtube_id": "v5_A3giDlQs" },
    "fiftyville": { "video_name": "Fiftyville", "youtube_id": "x7Q8tJMi7cQ" },
    "finance": { "video_name": "Finance", "youtube_id": "7wPTAwT-6bA" }
}

Let's first take a look at what the SRT file looks like.

https://www.3playmedia.com/blog/create-srt-file/

In [3]:
with open("../data/caption_srts/cs50-2022/lecture0.srt") as f:
    print("\n".join(f.read().splitlines()[:30]))

1
00:00:00,000 --> 00:00:02,988


2
00:00:02,988 --> 00:00:06,474
[MUSIC PLAYING]

3
00:00:06,474 --> 00:01:13,310


4
00:01:13,310 --> 00:01:15,260
DAVID J. MALAN: All right.

5
00:01:15,260 --> 00:01:19,220
This is CS50, Harvard
University's introduction

6
00:01:19,220 --> 00:01:21,560
to the intellectual
enterprises of computer science

7
00:01:21,560 --> 00:01:22,940
and the arts of programming.



**Here are a few things to pay attention:**

1. We want to convert all time code to seconds.
2. We want to attach metadata for each line of the caption text for later use.
3. We would want to remove the speaker name from the caption text.

In [4]:
def parse_timecode(timecode):
    """parse hh:mm:ss,mmm to seconds"""
    timecode = timecode.split(":")
    timecode = int(timecode[0]) * 3600 + int(timecode[1]) * 60 + float(timecode[2].replace(",", "."))
    return round(timecode)

In [5]:
jsonl_document = []
for lecture in SRT_FILES:

    with open(f"../data/caption_srts/cs50-2022/{lecture}.srt") as f:

        # remove doule empty lines and split lines based on empty line
        lines = f.read()
        lines = lines.replace("\n\n\n", "\n\n")
        lines = lines.split("\n\n")

        total_lines = len(lines)
        print(f"Total caption text for {SRT_FILES[lecture]['video_name']}: {total_lines}")

        for line in lines:

            # remove SRT sequence number and skip empty lines
            line = line.split("\n")[1:]

            if (len(line) < 2):
                continue

            # remove speaker name
            if ("DAVID MALAN:" in line[1].upper()):
                line[1] = line[1].replace("DAVID MALAN:", "")

            # create JSON document
            video_name = SRT_FILES[lecture]["video_name"]
            youtube_id = SRT_FILES[lecture]["youtube_id"]
            start_time = line[0].split(" --> ")[0]
            end_time = line[0].split(" --> ")[1]

            metadata = {}
            metadata["week"] = video_name
            metadata["youtube_id"] = youtube_id
            metadata["start"] = parse_timecode(start_time)
            metadata["end"] = parse_timecode(end_time)

            json_doc = {}
            json_doc["text"] = (" ").join(line[1:]).strip()
            json_doc["metadata"] = metadata

            # skip empty caption text
            if (json_doc["text"] == ""):
                continue

            # add to JSONL document
            jsonl_document.append(json_doc)

Total caption text for Week 0: 2971
Total caption text for Week 1: 3246


Total caption text for Week 2: 3026
Total caption text for Week 3: 2684
Total caption text for Week 4: 3330
Total caption text for Week 5: 3009
Total caption text for Week 6: 2920
Total caption text for Week 7: 2953
Total caption text for Week 8: 3351
Total caption text for Week 9: 2864
Total caption text for Week 10: 2215
Total caption text for Cybersecurity: 976
Total caption text for Section 1: 1104
Total caption text for Section 2: 1158
Total caption text for Section 3: 1264
Total caption text for Section 4: 1061
Total caption text for Section 5: 1222
Total caption text for Section 6: 1043
Total caption text for Section 7: 1406
Total caption text for Section 8: 1126
Total caption text for Section 9: 1138
Total caption text for Mario (Less Comfortable): 217
Total caption text for Mario (More Comfortable): 95
Total caption text for Cash: 156
Total caption text for Credit: 90
Total caption text for Readability: 147
Total caption text for Caesar: 272
Total caption text for Substitution

**Let's take a look at what this JSONL document looks like:**

In [6]:
print(f"Total caption text in JSONL document: {len(jsonl_document)}", end="\n\n")

preview_lines = 10
print(f"First {preview_lines} line(s) of JSONL document:")
for index, each in enumerate(jsonl_document[:preview_lines]):
    print(f"{index + 1}:", each)

Total caption text in JSONL document: 47201

First 10 line(s) of JSONL document:
1: {'text': '[MUSIC PLAYING]', 'metadata': {'week': 'Week 0', 'youtube_id': 'IDDmrzzB14M', 'start': 3, 'end': 6}}
2: {'text': 'DAVID J. MALAN: All right.', 'metadata': {'week': 'Week 0', 'youtube_id': 'IDDmrzzB14M', 'start': 73, 'end': 75}}
3: {'text': "This is CS50, Harvard University's introduction", 'metadata': {'week': 'Week 0', 'youtube_id': 'IDDmrzzB14M', 'start': 75, 'end': 79}}
4: {'text': 'to the intellectual enterprises of computer science', 'metadata': {'week': 'Week 0', 'youtube_id': 'IDDmrzzB14M', 'start': 79, 'end': 82}}
5: {'text': 'and the arts of programming.', 'metadata': {'week': 'Week 0', 'youtube_id': 'IDDmrzzB14M', 'start': 82, 'end': 83}}
6: {'text': 'My name is David Malan, and I actually took this course myself, back in 1996.', 'metadata': {'week': 'Week 0', 'youtube_id': 'IDDmrzzB14M', 'start': 83, 'end': 88}}
7: {'text': 'I was a sophomore at the time.', 'metadata': {'week': 'Wee

**Each line of the JSONL document is a valid JSON file containing lecture caption text, start/end time, YouTube video ID, and week name.**

In [7]:
# create processed_data directory if needed
if not os.path.exists("../data/processed_data"):
    os.makedirs("../data/processed_data")

# save to JSONL file
with open("../data/processed_data/lectures_2022_raw.jsonl", "w") as f:
    for chunk in jsonl_document:
        f.write(f"{json.dumps(chunk)}\n")

## Merge caption texts to span a longer duration

### Motivation

Caption text serves as a subtitle for videos, typically spanning just a few seconds to ensure readability during video playback. However, this feature doesn't significantly aid in embedding search.

Consider the search results in this context; you'd likely find more fragmented pieces of caption text rather than complete sentences. These fragments often lack the necessary contextual information, rendering them not particularly helpful.

To enhance search capabilities and offer a more valuable context, we aim to decrease this granularity (e.g., increase the duration for which the caption text appears).

**We need to strike a balance in picking the optimal duration so it doesn't result in too many caption texts per JSON document**

In [8]:
# we merge caption texts to form a new caption text that spans around 30 seconds
SPAN_THRESHOLD = 30 

In [9]:
# open JSONL file
jsonl_document = []
with open('../data/processed_data/lectures_2022_raw.jsonl', 'r') as f:
    for line in f:
        jsonl_document.append(json.loads(line))

# for each json document, merge the subtitles into one string into 30 seconds chunks
new_jsonl_document = []
total_documents = len(jsonl_document)
span = 0
running_counter = 0
current_pointer = 0

while True:

    # if we are at the end of the file, break
    if (current_pointer + running_counter + 1 > total_documents):
        break

    # calculate the span between the current document and the next document
    if (span < 0):
        print("ERROR: span is less than 0")
        print("current pointer: " + str(current_pointer))
        print("span: " + str(span))
        print("running counter: " + str(running_counter))
        break

    # if the next document is in the same week, merge the documents
    # if we are at the end of the file, append the last document
    running_counter += 1
    if (current_pointer + running_counter == total_documents):
        new_jsonl_document.append(jsonl_document[current_pointer])
        break

    if (jsonl_document[current_pointer]["metadata"]["week"] == jsonl_document[current_pointer + running_counter]["metadata"]["week"]):
        span = jsonl_document[current_pointer + running_counter]["metadata"]["end"] - jsonl_document[current_pointer]["metadata"]["start"]
        jsonl_document[current_pointer]["text"] += " " + jsonl_document[current_pointer + running_counter]["text"]
        jsonl_document[current_pointer]["metadata"]["end"] = jsonl_document[current_pointer + running_counter]["metadata"]["end"]

        # if the span is greater than the threshold, append the document and reset the pointer
        if (span >= SPAN_THRESHOLD):
            new_jsonl_document.append(jsonl_document[current_pointer])
            span = 0
            current_pointer += running_counter + 1
            running_counter = 0
            continue

    # if the next document is not in the same week, append the document and reset the pointer
    else:
        new_jsonl_document.append(jsonl_document[current_pointer])
        current_pointer += running_counter
        running_counter = 0
        span = 0
        continue

**Let's look at what the new JSONL document looks like:**

In [10]:
print(f"Total caption text in JSONL document: {len(new_jsonl_document)}", end="\n\n")

preview_lines = 5
print(f"First {preview_lines} line(s) of JSONL document:")
for index, each in enumerate(new_jsonl_document[:preview_lines]):
    print(f"{index + 1}:", each)

Total caption text in JSONL document: 4352

First 5 line(s) of JSONL document:
1: {'text': '[MUSIC PLAYING] DAVID J. MALAN: All right.', 'metadata': {'week': 'Week 0', 'youtube_id': 'IDDmrzzB14M', 'start': 3, 'end': 75}}
2: {'text': "This is CS50, Harvard University's introduction to the intellectual enterprises of computer science and the arts of programming. My name is David Malan, and I actually took this course myself, back in 1996. I was a sophomore at the time. I was actually concentrating in government, because a year prior, as a first year, I'd come into Harvard thinking that I liked history and constitutional law and similar classes in high school. And so when I got here, I rather gravitated toward that which was familiar. I figured, if I liked and if I were good at that particular subject", 'metadata': {'week': 'Week 0', 'youtube_id': 'IDDmrzzB14M', 'start': 75, 'end': 108}}
3: {'text': "in high school, then that's presumably who I'm supposed to be here. But it wasn't until s

**You should see that each JSON document's caption text should be longer, spanning a longer duration. This would give us a more useful and relevant lecture context in the prompt.**

In [11]:
# save new JSONL file
with open('../data/processed_data/lectures_2022_merged.jsonl', 'w') as f:
    f.write('\n'.join(json.dumps(i) for i in new_jsonl_document))

## Create embeddings for caption texts

We are now ready to create embeddings for caption texts.

**Note that we only generate embeddings for each caption text, we do not generate embeddings for metadata.**

Reference: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_Wikipedia_articles_for_search.ipynb

In [12]:
jsonl_document = []
text_documents = [] # easier to work with OpenAI's API

with open("../data/processed_data/lectures_2022_merged.jsonl") as f:
    for line in f.readlines():
        line = json.loads(line)
        jsonl_document.append(line)
        text_documents.append(line["text"])

In [13]:
print(jsonl_document[0].keys())

dict_keys(['text', 'metadata'])


In [14]:
openai.api_key = os.getenv("OPENAI_API_KEY")
EMBEDDING_TOKEN_LIMIT = 8191
EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023
BATCH_SIZE = 1000  # you can submit up to 2048 embedding inputs per request

def num_tokens_from_string(string: str, model_name: str = "gpt-3.5-turbo") -> int:
    """Returns the number of tokens in a text string."""
    # https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

    encoding = tiktoken.encoding_for_model(model_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [15]:
# create embeddings
embeddings = []

try:
    
    # calculate total tokens needed and cost
    # https://openai.com/pricing
    total_tokens = num_tokens_from_string("".join(text_documents))
    print(f"Total tokens: {total_tokens}")
    print(f"Total cost: ${round(total_tokens * 0.0004 / 1000, 4)}")
    
    # this is a rather expensive operation, proceed with caution 
    if (input("Continue? (y/n) ") != "y"):
        raise KeyboardInterrupt

    for batch_start in range(0, len(text_documents), BATCH_SIZE):
        batch_end = batch_start + BATCH_SIZE
        batch = text_documents[batch_start:batch_end]
        print(f"Batch {batch_start} to {batch_end-1} of {len(text_documents)}")

        # call OpenAI API to create embedding for a given caption text
        response = openai.embeddings.create(model=EMBEDDING_MODEL, input=batch)

        # double check embeddings are in same order as input
        for i, be in enumerate(response.data):
            assert i == be.index

        batch_embeddings = [e.embedding for e in response.data]
        embeddings.extend(batch_embeddings)

    print("Finished creating embeddings")
            
except KeyboardInterrupt:
    print("operation aborted")

Total tokens: 524877
Total cost: $0.21
Batch 0 to 999 of 4352
Batch 1000 to 1999 of 4352
Batch 2000 to 2999 of 4352
Batch 3000 to 3999 of 4352
Batch 4000 to 4999 of 4352
Finished creating embeddings


In [16]:
# save jsonl file with embeddings
with open("../data/processed_data/lectures_2022_embeddings.jsonl", "w") as f:
    for i, embedding in enumerate(embeddings):

        # store embeddings for a given caption text
        jsonl_document[i]["embedding"] = embedding
        f.write(json.dumps(jsonl_document[i]) + "\n")


**You can take a look at what the embedding looks like for each caption text.**

**Note that because we use the `text-embedding-ada-002` model, we will always get a 1536-dimensional embedding vector (i.e., there are 1536 numbers inside).**

Reference: https://platform.openai.com/docs/guides/embeddings/second-generation-models

In [17]:
print(f"Total caption text in JSONL document: {len(jsonl_document)}")

Total caption text in JSONL document: 4352


In [18]:
preview_line = 42
json_doc = jsonl_document[preview_line-1]

print(f"Total characters: {len(json_doc['text'])}", f"embedding size: {len(json_doc['embedding'])}\n")
print(f"plain_text:\n{json_doc['text']}\n")
print(f"embedding:\n{json_doc['embedding']}\n")
print(f"metadata:\n{json_doc['metadata']}")

Total characters: 628 embedding size: 1536

plain_text:
AUDIENCE: [INAUDIBLE] DAVID J. MALAN: Say it again. AUDIENCE: More lightbulbs. DAVID J. MALAN: Yeah, so more light bulbs. So let me do this. Let me just grab something to put these on, so I can use a few of them at a time. And let me propose that here, instead of having just one light bulb, let me give myself maybe three in total. So all of them are initially off, and if you think of this in miniature form, in your mind's eye, this is like a computer with three transistors. Three switches representing now the number you and I know as 0. Why? They're just all off. So how does a computer go about representing the number 1?

embedding:
[-0.003996677231043577, -0.017532531172037125, 0.006546623073518276, -0.014374825172126293, -0.004406253807246685, 0.011587060987949371, -0.023517636582255363, -0.02413860894739628, -0.03469512239098549, -0.02384794130921364, 0.04087841138243675, 0.044339995831251144, -0.013192337937653065, 0.001135419