# Motivation
I've recently gotten a bunch of embeddings for each of Anthony Fantano's videos. In this notebook, I want to develop a rudimentary prototype for semantic search. 

# Setup
The cells below will help to set up the rest of the notebook. 

I'll start by changing my working directory. 

In [1]:
%cd ..

C:\Data\Personal Study\Programming\neural-needle-drop


Now, I'll import some libraries.

In [34]:
# Import statements
import requests
import os
from requests.structures import CaseInsensitiveDict
import numpy as np
import json
from numpy import dot
from numpy.linalg import norm
from tqdm import tqdm
from pathlib import Path
import pandas as pd
import math
import traceback
from time import sleep
from IPython.display import display, Markdown

# Loading Data
Next, I'm going to load in all of the data. 

In [3]:
# Create a DataFrame containing all of the data scraped for each of the videos
tnd_data_df_records = []
for child_dir in tqdm(list(Path("data/theneedledrop_scraping/").iterdir())):
    
    # Extract the video ID from the 
    cur_video_id = child_dir.name
    
    # Load in the details.json file
    try:
        with open(f"data/theneedledrop_scraping/{cur_video_id}/details.json", "r") as json_file:
            cur_details_dict = json.load(json_file)
    except:
        cur_details_dict = {}
        
    # Load in the transcription.json file
    try:
        with open(f"data/theneedledrop_scraping/{cur_video_id}/transcription.json", "r") as json_file:
            cur_transcription_dict = json.load(json_file)
    except:
        cur_transcription_dict = {}
        
    # Load in the embedding
    try:
        with open(f"data/theneedledrop_scraping/{cur_video_id}/whole_video_embedding.json", "r") as json_file:
            whole_video_embedding = json.load(json_file)
    except:
        whole_video_embedding = None
        
    # Create a "record" for this video
    tnd_data_df_records.append({
        "video_id": cur_video_id,
        "details_dict": cur_details_dict,
        "transcription_dict": cur_transcription_dict,
        "whole_video_embedding": whole_video_embedding
    })
    
# Now, we want to create a DataFrame from the tnd_data_df_records
tnd_data_df = pd.DataFrame.from_records(tnd_data_df_records)

# Making the embeddings ndarrays instead of lists 
tnd_data_df["whole_video_embedding"] = tnd_data_df["whole_video_embedding"].apply(lambda x: np.asarray(x) if x is not None else None)

# Add a "transcription string" column 
tnd_data_df["transcription_str"] = tnd_data_df["transcription_dict"].apply(lambda x: x['text'] if 'text' in x else None)

# Add a couple of columns indicating how long each of the transcriptions are 
tnd_data_df["transcription_length"] = tnd_data_df["transcription_str"].apply(lambda x: len(x) if x is not None else None)
tnd_data_df["transcription_approx_tokens"] = tnd_data_df["transcription_str"].apply(lambda x: int(math.ceil(len(x)/3.5)) if x is not None else None)

# Add a couple of columns grabbing the title and URL of the video 
tnd_data_df["video_title"] = tnd_data_df["details_dict"].apply(lambda x: x['title'])
tnd_data_df["video_url"] = tnd_data_df["video_id"].apply(lambda x: f"https://www.youtube.com/watch?v={x}")

100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 3974/3974 [00:25<00:00, 158.34it/s]


# Methods
Below, I'm going to write a couple of methods. 

In [17]:
# This method will return a list of ndarrays, each representing text embeddings of 
# the text in each index of the input_text_list list
def generate_embeddings(input_text_list, print_exceptions=False):
    
    # Get the OpenAI API key from the environment variables 
    api_key = os.getenv("OPENAI_API_KEY", "")
    
    # Build the API request
    url = "https://api.openai.com/v1/embeddings"
    headers = CaseInsensitiveDict()
    headers["Content-Type"] = "application/json"
    headers["Authorization"] = "Bearer " + api_key
    data = """{"input": """ + json.dumps(input_text_list) + ""","model":"text-embedding-ada-002"}"""
    
    # Send the API request
    resp = requests.post(url, headers=headers, data=data)
    
    # If the request was successful, return ndarrays of the embeddings. Otherwise, return None objects 
    if resp.status_code == 200:
        return [np.asarray(data_object['embedding']) for data_object in resp.json()['data']]
    else:
        if (print_exceptions):
            print(resp.json())
        return [None for txt in input_text_list]
    
# This method will generate the embedding for a single string
def generate_embedding(txt_input):
    return (generate_embeddings([txt_input])[0])
    
# This method will return the cosine similarity of two ndarrays
def cosine_sim(a, b):
    return dot(a, b)/(norm(a)*norm(b))

# Whole-Video Search Prototype

In [56]:
# Indicate the search string, and then generate an embedding based off of these 
search_txt = "ameer vann"
search_txt_emb = generate_embedding(search_txt)

Now that we have this embedding, we can search the Fantano videos! 

In [57]:
# Search across all of the different embeddings to determine which videos are similar
tnd_data_with_embs_df = tnd_data_df[tnd_data_df["whole_video_embedding"].notna()].copy()
search_result_sim = tnd_data_with_embs_df.copy()
search_result_sim["cosine_sim_to_search"] = search_result_sim["whole_video_embedding"].apply(
    lambda x: cosine_sim(search_txt_emb, x))
search_result_sim = search_result_sim.sort_values("cosine_sim_to_search", ascending=False)

# We're going to print out the top_n results
top_n = 10
for idx, row in enumerate(list(search_result_sim.head(top_n).itertuples())):
    markdown_str = f"**#{idx+1}:** [{row.video_title}]({row.video_url})<br>Similarity: {row.cosine_sim_to_search:.3f}<br>"
    display(Markdown(markdown_str))

**#1:** [Ameer Vann - Emmanuel EP REVIEW](https://www.youtube.com/watch?v=zcn-Kp_OWfI)<br>Similarity: 0.785<br>

**#2:** [Various Artists - NOW That's What I Call Music, Vol. 69 COMPILATION REVIEW](https://www.youtube.com/watch?v=2ADxX13pFhU)<br>Similarity: 0.782<br>

**#3:** [ZelooperZ - Van Gogh's Left Ear ALBUM REVIEW](https://www.youtube.com/watch?v=9Dn926lHlKE)<br>Similarity: 0.772<br>

**#4:** [Paysage d'Hiver - Im Wald ALBUM REVIEW](https://www.youtube.com/watch?v=qwhoHT727TA)<br>Similarity: 0.772<br>

**#5:** [Myrkur - Mareridt ALBUM REVIEW](https://www.youtube.com/watch?v=NOvcBmZCRJ8)<br>Similarity: 0.770<br>

**#6:** [Clown Core - Van ALBUM REVIEW](https://www.youtube.com/watch?v=R68JzG7Vb7w)<br>Similarity: 0.768<br>

**#7:** [SXSW 2012 Vlog 2](https://www.youtube.com/watch?v=nbOzVokN5mc)<br>Similarity: 0.765<br>

**#8:** [The Velvet Underground & Nico - Self-Titled ALBUM REVIEW](https://www.youtube.com/watch?v=zaumyfPx3Ao)<br>Similarity: 0.764<br>

**#9:** [RECORD STORE DAY 2017 PICKS!!!!](https://www.youtube.com/watch?v=GnLciv6GeXE)<br>Similarity: 0.764<br>

**#10:** [SXSW 2012 Vlog 1](https://www.youtube.com/watch?v=hYRO6gXD7eA)<br>Similarity: 0.763<br>