The first step is to import the libraries and set the OpenAI API key and endpoint. You'll need to set the following environment variables:

- `AZURE_OPENAI_API_KEY` - Your OpenAI API key
- `AZURE_OPENAI_ENDPOINT` - Your OpenAI endpoint

In [14]:
import os
import pandas as pd
import openai
#from openai.embeddings_utils import cosine_similarity, get_embedding
from sklearn.metrics.pairwise import cosine_similarity

from dotenv import load_dotenv
load_dotenv()
ev = dict(os.environ)
print(ev)

OPENAI_EMBEDDING_ENGINE = "text-embedding-ada-002"
SIMILARITIES_RESULTS_THRESHOLD = 0.75
DATASET_NAME = "embedding_index_3m.json"

openai.api_type = "azure"
openai.api_key = os.getenv("AZURE_OPENAI_KEY","").strip()
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT","").strip()
openai.api_version = "2023-07-01-preview"

#OPENAI_EMBEDDING_DEPLOYMENT_NAME = os.environ[
#    "AZURE_OPENAI_EMBEDDING_MODEL_DEPLOYMENT_NAME"
#]

{'__CFBundleIdentifier': 'com.apple.Terminal', 'TMPDIR': '/var/folders/7r/ybz255g54cs8yq9zhswm2qcw0000gn/T/', 'XPC_FLAGS': '0x0', 'TERM': 'xterm-color', 'SSH_AUTH_SOCK': '/private/tmp/com.apple.launchd.uzajjJKDrX/Listeners', 'XPC_SERVICE_NAME': '0', 'TERM_PROGRAM': 'Apple_Terminal', 'TERM_PROGRAM_VERSION': '452', 'TERM_SESSION_ID': '191D7B03-B704-4E0C-BFA3-6A76AA0EAD9A', 'SHELL': '/bin/zsh', 'HOME': '/Users/I559573', 'LOGNAME': 'I559573', 'USER': 'I559573', 'PATH': '/Library/Frameworks/Python.framework/Versions/3.10/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/I559573/Library/Python/3.8/bin:/opt/homebrew/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Library/Apple/usr/bin:/Applications/Privileges.app/Contents/Resources', 'SHLVL': '1', 'PWD': '/Users/I559573/Documents/GitHub/generative-ai-for-begi

In [26]:
def get_embedding(text,model):
    embed = openai.embeddings.create(
          model=model,
          input=text, encoding_format="float")
    return [embed.data[0].embedding]

In [27]:
text = 'the quick brown fox jumped over the lazy dog'
model = 'text-embedding-ada-002'
embed = get_embedding(text,OPENAI_EMBEDDING_ENGINE)
print(embed)

[[-0.0044746567, 0.009786528, -0.014904951, -0.0064249854, -0.011353132, 0.015513834, -0.023721071, -0.016414473, -0.015818276, -0.029632311, 0.021298224, 0.021095263, 0.018570933, 0.0041702148, -0.0007155169, -0.007579326, 0.025217908, -0.0042146128, 0.011175542, -0.008587789, -0.009513798, 0.021577295, -0.005993693, -0.008257977, 0.006041262, 0.013040246, 0.0074397903, -0.003516934, -0.008955656, 0.0011939817, 0.0066660014, 0.0038657733, -0.03927296, -0.0025592118, -0.012761175, -0.0217422, -0.00370721, -0.010458835, 0.025979012, -0.045691602, 0.009399633, 0.01565337, -0.02261747, -0.011619519, -0.0028573107, 0.012215717, 0.010534946, -0.012532843, -0.022896541, 0.010363698, -0.0002695576, 0.0065835486, -0.009919721, -0.011625862, -0.004718844, -0.0072368295, -0.0013802936, -0.0062505654, 0.009380605, -0.0057621906, 0.010782305, 0.013547649, -0.005140622, -0.0056321686, 0.0007163097, -0.009399633, 0.011105774, -0.010534946, -0.013052931, -0.004890092, 0.029353239, -0.001888489, 0.000

Next, we are going to load the Embedding Index into a Pandas Dataframe. The Embedding Index is stored in a JSON file called `embedding_index_3m.json`. The Embedding Index contains the Embeddings for each of the YouTube transcripts up until late Oct 2023.

In [7]:
def load_dataset(source: str) -> pd.core.frame.DataFrame:
    # Load the video session index
    pd_vectors = pd.read_json(source)
    return pd_vectors.drop(columns=["text"], errors="ignore").fillna("")

Next, we are going to create a function called `get_videos` that will search the Embedding Index for the query. The function will return the top 5 videos that are most similar to the query. The function works as follows:

1. First, a copy of the Embedding Index is created.
2. Next, the Embedding for the query is calculated using the OpenAI Embedding API.
3. Then a new column is created in the Embedding Index called `similarity`. The `similarity` column contains the cosine similarity between the query Embedding and the Embedding for each video segment.
4. Next, the Embedding Index is filtered by the `similarity` column. The Embedding Index is filtered to only include videos that have a cosine similarity greater than or equal to 0.75.
5. Finally, the Embedding Index is sorted by the `similarity` column and the top 5 videos are returned.

In [33]:
def get_videos(
    query: str, dataset: pd.core.frame.DataFrame, rows: int
) -> pd.core.frame.DataFrame:
    # create a copy of the dataset
    video_vectors = dataset.copy()

    # get the embeddings for the query
    query_embeddings = get_embedding(query, OPENAI_EMBEDDING_ENGINE)
    #print(query_embeddings)

    # create a new column with the calculated similarity for each row
    video_vectors["similarity"] = video_vectors["ada_v2"].apply(
        lambda x: cosine_similarity(query_embeddings, [x])
    )

    # filter the videos by similarity
    mask = video_vectors["similarity"] >= SIMILARITIES_RESULTS_THRESHOLD
    video_vectors = video_vectors[mask].copy()

    # sort the videos by similarity
    video_vectors = video_vectors.sort_values(by="similarity", ascending=False).head(
        rows
    )

    # return the top rows
    return video_vectors.head(rows)

This function is very simple, it just prints out the results of the search query.

In [9]:
def display_results(videos: pd.core.frame.DataFrame, query: str):
    def _gen_yt_url(video_id: str, seconds: int) -> str:
        """convert time in format 00:00:00 to seconds"""
        return f"https://youtu.be/{video_id}?t={seconds}"

    print(f"\nVideos similar to '{query}':")
    for index, row in videos.iterrows():
        youtube_url = _gen_yt_url(row["videoId"], row["seconds"])
        print(f" - {row['title']}")
        print(f"   Summary: {' '.join(row['summary'].split()[:15])}...")
        print(f"   YouTube: {youtube_url}")
        print(f"   Similarity: {row['similarity']}")
        print(f"   Speakers: {row['speaker']}")

1. First, the Embedding Index is loaded into a Pandas Dataframe.
2. Next, the user is prompted to enter a query.
3. Then the `get_videos` function is called to search the Embedding Index for the query.
4. Finally, the `display_results` function is called to display the results to the user.
5. The user is then prompted to enter another query. This process continues until the user enters `exit`.

![](media/notebook_search.png)

You will be prompted to enter a query. Enter a query and press enter. The application will return a list of videos that are relevant to the query. The application will also return a link to the place in the video where the answer to the question is located.

Here are some queries to try out:

- What is Azure Machine Learning?
- How do convolutional neural networks work?
- What is a neural network?
- Can I use Jupyter Notebooks with Azure Machine Learning?
- What is ONNX?

In [34]:
pd_vectors = load_dataset(DATASET_NAME)

# get user query from imput
while True:
    query = input("Enter a query: ")
    if query == "exit":
        break
    videos = get_videos(query, pd_vectors, 5)
    display_results(videos, query)

Enter a query: How do convolutional neural networks work?

Videos similar to 'How do convolutional neural networks work?':
 - Data Science, Convolutional Neural Networks, and Machine Learning in the Cloud (Part 3 of 4)
   Summary: In this video, Seth Juarez continues his talk on data science, convolutional neural networks, and...
   YouTube: https://youtu.be/0TwbqkQ9pxk?t=0
   Similarity: [[0.85817626]]
   Speakers: Seth Juarez
 - Demystifying AI
   Summary: In this video, the concept of Convolutional Neural Networks (CNNs) in deep learning for computer...
   YouTube: https://youtu.be/k-K3g4FKS_c?t=183
   Similarity: [[0.84965972]]
   Speakers: Micheleen Harris
 - An Intuitive Approach to Machine Learning Models (Part 1 of 4)
   Summary: In this video, the speaker explains the concept of building convolutional neural networks (CNNs) from...
   YouTube: https://youtu.be/lPyK38sRWLI?t=549
   Similarity: [[0.83924915]]
   Speakers: Seth, Seth Juarez
 - Optimization, Machine Learning Model