# Motivation
I want to try and re-create the video search using MySQL and Pinecone. That way, I can quickly prototype the Dash UI. This notebook will contain code for that.

# Setup
The cells below will help set up the rest of the notebook. 

I'm going to start by changing my working directory to the repo root. 

In [1]:
%cd ..

d:\data\programming\neural-needle-drop


Next, I'll import different modules. 

In [5]:
# Import statements
import pandas as pd
import pinecone
import os
import mysql.connector
import traceback
import numpy as np
import itertools
import math
from tqdm import tqdm
from time import sleep
from time import time
import requests
from requests.structures import CaseInsensitiveDict
import json
from pathlib import Path
import traceback
from IPython.display import display, Markdown

After that, I'm going to set up a connection to my MySQL database. This will help me load in the relevant data. 

In [3]:
# Set up the connection to the MySQL server
cnx = mysql.connector.connect(
    user='root', password=os.getenv("MYSQL_PASSWORD"), 
    host='localhost', database='neural-needle-drop')

# Create a cursor 
cursor = cnx.cursor()

Finally, we're going to set up the connection to the Pinecone API. 

In [4]:
# Initialize the Pinecone API connection
pinecone.init(api_key=os.getenv("PINECONE_API_KEY"))

# Setting up the index 
pinecone_index = pinecone.Index("neural-needledrop-prototype")
pinecone_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.2,
 'namespaces': {'video_embeddings': {'vector_count': 69496}},
 'total_vector_count': 69496}

# Methods
The methods below will also help throughout the rest of the notebook. 

In [107]:
# This method will return a list of ndarrays, each representing text embeddings of 
# the text in each index of the input_text_list list
def generate_embeddings(input_text_list, print_exceptions=False):
    
    # Get the OpenAI API key from the environment variables 
    api_key = os.getenv("OPENAI_API_KEY", "")
    
    # Build the API request
    url = "https://api.openai.com/v1/embeddings"
    headers = CaseInsensitiveDict()
    headers["Content-Type"] = "application/json"
    headers["Authorization"] = "Bearer " + api_key
    data = """{"input": """ + json.dumps(input_text_list) + ""","model":"text-embedding-ada-002"}"""
    
    # Send the API request
    resp = requests.post(url, headers=headers, data=data)
    
    # If the request was successful, return ndarrays of the embeddings. Otherwise, return None objects 
    if resp.status_code == 200:
        return [np.asarray(data_object['embedding']) for data_object in resp.json()['data']]
    else:
        if (print_exceptions):
            print(resp.json())
        return [None for txt in input_text_list]
    
# This method will generate the embedding for a single string
def generate_embedding(txt_input, print_exceptions=False):
    return (generate_embeddings([txt_input], print_exceptions)[0])

def query_to_df(query, print_error=False):
    '''Query the active MySQL database and return results in a DataFrame'''

    # Try to return the results as a DataFrame
    try:
        # Execute the query
        cursor.execute(query)

        # Fetch the results 
        res = cursor.fetchall()

        # Return a DataFrame
        return pd.DataFrame(res, columns=[i[0] for i in cursor.description])

    # If we run into an Exception, return None
    except Exception as e:
        if (print_error):
            print(f"Ran into the following error:\n{e}\nStack trace:")
            print(traceback.format_exc())
        return None

# Search Flow
Below, I'm going to try and replicate the "search flow" - eventually, I'd wrap this up into a method of its own. 

1. Get the embedding for your search string from OpenAI
2. Search Pinecone for the most similar embeddings
3. Do some data transformation to determine which videos have the largest amount of similar segments
4. Query the MySQL database in order to determine some of the information for that particular set of videos

I'll kick off the first step: getting the embedding for my search string! 

In [227]:
# Declare the user's search string
search_str = "You'll go to hell for what your dirty mind is thinking"

# Get the embedding for the search string
search_str_emb = generate_embedding(search_str, print_exceptions=True)

Next, I'll search Pinecone for the most similar embeddings. I'll try and only look for embeddings that're segment chunks. 

In [228]:
# Query the Pinecone index for the 5000 most similar 
pinecone_results = pinecone_index.query(
    vector=search_str_emb.tolist(),
    filter={
        "embedding_type": "segment_chunk"
    },
    top_k=5000,
    include_metadata=True,
    namespace="video_embeddings"
)

Now, I should be able to parse these results into a DataFrame, and then run some manipulations on it in order to get a sense of which videos I ought to query for from MySQL. 

In [242]:
# Create a DataFrame from the Pinecone results 
top_segment_matches_original_df = pd.DataFrame.from_records(
    [{"id": x.id, "score": x.score} | x.metadata
        for x in pinecone_results['matches']])

grouped_sorted_segment_df = top_segment_matches_original_df.groupby("video_id")
top_segment_matches_df = grouped_sorted_segment_df.apply(
    lambda x: x.sort_values("score", ascending=False).head(5)).reset_index(
    drop=True).copy()

# Determine the average score across the different videos 
avg_segment_sim_by_video_df = top_segment_matches_df.groupby("video_id")["score"].mean(numeric_only=True).reset_index().rename(
    columns={"score": "avg_segment_sim"}).sort_values("avg_segment_sim", ascending=False)

median_segment_sim_by_video_df = top_segment_matches_df.groupby("video_id")["score"].median(numeric_only=True).reset_index().rename(
    columns={"score": "median_segment_sim"}).sort_values("median_segment_sim", ascending=False)

segment_ct_by_video_df = top_segment_matches_df.groupby("video_id").count().reset_index().rename(
    columns={"id": "segment_ct"}).sort_values("segment_ct", ascending=False)[["video_id", "segment_ct"]]

# Create the "scored_video_df", which tries to merge some degree of "relevance" and "frequency"
scored_video_df = segment_ct_by_video_df.merge(avg_segment_sim_by_video_df, on="video_id")
scored_video_df = scored_video_df.merge(median_segment_sim_by_video_df, on="video_id")
scored_video_df["neural_search_score"] = scored_video_df["segment_ct"] * scored_video_df["avg_segment_sim"]
scored_video_df = scored_video_df.sort_values("neural_search_score", ascending=False)

# We'll also add in information about the most similar segment in each video 
top_single_segments_per_video_df = top_segment_matches_df[top_segment_matches_df["video_id"].isin(
    list(scored_video_df.head(10)["video_id"]))]
grouped_sorted_segment_df = top_single_segments_per_video_df.groupby("video_id")
top_single_segments_per_video_df = grouped_sorted_segment_df.apply(
    lambda x: x.sort_values("score", ascending=False).head(1)).reset_index(
    drop=True).copy()

Now that I've got this "scored neural search" DataFrame, I'm going to grab the relevant video information from MySQL. I'll join this with the `scored_video_df`. 

In [243]:
# This query will determine the information for the top videos
top_scored_video_info_query_filter_str = " OR ".join([f'id="{row.video_id}"' for row in scored_video_df.head(10).itertuples()])
top_scored_video_info_query = f"""
SELECT
    *
FROM
    video_details
WHERE {top_scored_video_info_query_filter_str}"""

# Execute the above query 
top_scored_video_info_df = query_to_df(top_scored_video_info_query, print_error=True)

# Merge in some of the scores 
top_scored_video_info_df = top_scored_video_info_df.merge(scored_video_df, left_on="id", right_on="video_id").drop(
    columns=["video_id"]).sort_values("neural_search_score", ascending=True)

Finally, we're going to query the MySQL database once more to get the text of the transcription segments corresponding to these embeddings. 

In [244]:
# Creating a "filter string" for the transcription query
all_video_filter_str_list = []
for row in top_single_segments_per_video_df.itertuples():
    segment_filter_str = " OR ".join([f"segment={num}" for num in list(range(int(row.start_segment), int(row.end_segment)+1))])
    all_video_filter_str_list.append(f"id='{row.video_id}' AND ({segment_filter_str})")
transcription_filter_str = " OR ".join([f"({cur_vid_filter_str})" for cur_vid_filter_str in all_video_filter_str_list])

# Crafting the transcription query 
top_segment_transcriptions_query = f"""SELECT * FROM transcriptions WHERE {transcription_filter_str}"""

# Executing the transcription query 
top_segment_transcriptions_df = query_to_df(top_segment_transcriptions_query, print_error=True)

Finally, we're going to merge these "top segments" back into the `top_scored_video_info_df`. 

In [245]:
# Join together the individual segments to create segment chunks
top_segment_chunk_per_video_df = top_segment_transcriptions_df.groupby("id")["text"].apply(list).reset_index()
top_segment_chunk_per_video_df["text"] = top_segment_chunk_per_video_df["text"].apply(
    lambda seg_list: " ".join([seg.strip() for seg in seg_list]))
top_segment_chunk_per_video_df = top_segment_chunk_per_video_df.rename(columns={"text": "segment_transcription"})
top_segment_chunk_per_video_df = top_segment_chunk_per_video_df.merge(
    top_single_segments_per_video_df[["score", "video_id"]].rename({"score": "top_segment_score"}), 
    left_on="id", right_on="video_id")

# Merge these segment chunks back into the top_scored_video_info_df DataFrame
top_scored_video_info_df = top_scored_video_info_df.merge(top_segment_chunk_per_video_df, on="id")

# Methodized Search
Now that I've got a search method sketched out, I can turn it into a method. 

In [247]:
def neural_tnd_video_search(search_str):

    """
    This method will run 'neural search' across all of TheNeedleDrop's videos, using 
    the methodology I'd established with Pinecone and MySQL. 

    It'll return two DataFrames: one containing the information about the top scoring 
    videos, and another containing the segments that scored the highest for this video.  
    """

    # Get the embedding for the search string
    search_str_emb = generate_embedding(search_str)

    # ================================================

    # Query the Pinecone index for the 5000 most similar 
    pinecone_results = pinecone_index.query(
        vector=search_str_emb.tolist(),
        filter={
            "embedding_type": "segment_chunk"
        },
        top_k=3000,
        include_metadata=True,
        namespace="video_embeddings"
    )

    # ================================================

    # Create a DataFrame from the Pinecone results 
    top_segment_matches_original_df = pd.DataFrame.from_records(
        [{"id": x.id, "score": x.score} | x.metadata
            for x in pinecone_results['matches']])

    grouped_sorted_segment_df = top_segment_matches_original_df.groupby("video_id")
    top_segment_matches_df = grouped_sorted_segment_df.apply(
        lambda x: x.sort_values("score", ascending=False).head(5)).reset_index(
        drop=True).copy()

    # Determine the average score across the different videos 
    avg_segment_sim_by_video_df = top_segment_matches_df.groupby("video_id")["score"].mean(numeric_only=True).reset_index().rename(
        columns={"score": "avg_segment_sim"}).sort_values("avg_segment_sim", ascending=False)

    median_segment_sim_by_video_df = top_segment_matches_df.groupby("video_id")["score"].median(numeric_only=True).reset_index().rename(
        columns={"score": "median_segment_sim"}).sort_values("median_segment_sim", ascending=False)

    segment_ct_by_video_df = top_segment_matches_df.groupby("video_id").count().reset_index().rename(
        columns={"id": "segment_ct"}).sort_values("segment_ct", ascending=False)[["video_id", "segment_ct"]]

    # Create the "scored_video_df", which tries to merge some degree of "relevance" and "frequency"
    scored_video_df = segment_ct_by_video_df.merge(avg_segment_sim_by_video_df, on="video_id")
    scored_video_df = scored_video_df.merge(median_segment_sim_by_video_df, on="video_id")
    scored_video_df["neural_search_score"] = scored_video_df["segment_ct"] * scored_video_df["avg_segment_sim"]
    scored_video_df = scored_video_df.sort_values("neural_search_score", ascending=False)

    # We'll also add in information about the most similar segment in each video 
    top_single_segments_per_video_df = top_segment_matches_df[top_segment_matches_df["video_id"].isin(
        list(scored_video_df.head(10)["video_id"]))]
    grouped_sorted_segment_df = top_single_segments_per_video_df.groupby("video_id")
    top_single_segments_per_video_df = grouped_sorted_segment_df.apply(
        lambda x: x.sort_values("score", ascending=False).head(1)).reset_index(
        drop=True).copy()

    # ================================================

    # This query will determine the information for the top videos
    top_scored_video_info_query_filter_str = " OR ".join([f'id="{row.video_id}"' for row in scored_video_df.head(10).itertuples()])
    top_scored_video_info_query = f"""
    SELECT
        *
    FROM
        video_details
    WHERE {top_scored_video_info_query_filter_str}"""

    # Execute the above query 
    top_scored_video_info_df = query_to_df(top_scored_video_info_query, print_error=True)

    # Merge in some of the scores 
    top_scored_video_info_df = top_scored_video_info_df.merge(scored_video_df, left_on="id", right_on="video_id").drop(
        columns=["video_id"]).sort_values("neural_search_score", ascending=False)

    # ================================================

    # Creating a "filter string" for the transcription query
    all_video_filter_str_list = []
    for row in top_single_segments_per_video_df.itertuples():
        segment_filter_str = " OR ".join([f"segment={num}" for num in list(range(int(row.start_segment), int(row.end_segment)+1))])
        all_video_filter_str_list.append(f"id='{row.video_id}' AND ({segment_filter_str})")
    transcription_filter_str = " OR ".join([f"({cur_vid_filter_str})" for cur_vid_filter_str in all_video_filter_str_list])

    # Crafting the transcription query 
    top_segment_transcriptions_query = f"""SELECT * FROM transcriptions WHERE {transcription_filter_str}"""

    # Executing the transcription query 
    top_segment_transcriptions_df = query_to_df(top_segment_transcriptions_query, print_error=True)

    # ================================================

    # Join together the individual segments to create segment chunks
    top_segment_chunk_per_video_df = top_segment_transcriptions_df.groupby("id")["text"].apply(list).reset_index()
    top_segment_chunk_per_video_df["text"] = top_segment_chunk_per_video_df["text"].apply(
        lambda seg_list: " ".join([seg.strip() for seg in seg_list]))
    top_segment_chunk_per_video_df = top_segment_chunk_per_video_df.rename(columns={"text": "segment_transcription"})
    top_segment_chunk_per_video_df = top_segment_chunk_per_video_df.merge(
        top_single_segments_per_video_df[["score", "video_id"]].rename(columns={"score": "top_segment_score"}), 
        left_on="id", right_on="video_id")

    # Merge these segment chunks back into the top_scored_video_info_df DataFrame
    top_scored_video_info_df = top_scored_video_info_df.merge(top_segment_chunk_per_video_df, on="id")

    # ================================================

    return top_scored_video_info_df

With this search method in hand, I want to return the results! 

In [248]:
top_scored_video_info_df = neural_tnd_video_search(
    "Bo Burnham Peace Hippy .")

top_scored_video_info_df.head(3)

Unnamed: 0,id,title,length,channel_id,description,view_ct,channel_name,publish_date,url,segment_ct,avg_segment_sim,median_segment_sim,neural_search_score,segment_transcription,top_segment_score,video_id
0,q_gBZFYXHQI,"Sun Kil Moon - ""War On Drugs: Suck My Cock"" TR...",487,UCt7fwAhXDy3oNFTAzF2o8Pw,Listen: http://sunkilmoon.com/mkwod/index.html...,73526,theneedledrop,2014-10-11,https://www.youtube.com/watch?v=q_gBZFYXHQI,5,0.806858,0.802795,4.03429,who he categorized as hillbillies to shut the ...,0.823755,q_gBZFYXHQI
1,Kh2_xvw3LUA,David Byrne - American Utopia ALBUM REVIEW,418,UCt7fwAhXDy3oNFTAzF2o8Pw,Listen: https://www.youtube.com/watch?v=euEgyX...,96418,theneedledrop,2018-03-15,https://www.youtube.com/watch?v=Kh2_xvw3LUA,5,0.806043,0.801537,4.030215,"American Utopia. David Byrne, one of the most ...",0.825939,Kh2_xvw3LUA
2,KwTAPEe6A1A,Lil B- I'm Gay (I'm Happy) ALBUM REVIEW,1223,UCt7fwAhXDy3oNFTAzF2o8Pw,Listen: http://www.youtube.com/watch?v=PcSfimb...,692780,theneedledrop,2011-07-04,https://www.youtube.com/watch?v=KwTAPEe6A1A,5,0.800715,0.800591,4.003576,"and other controversies about him, you know, m...",0.814706,KwTAPEe6A1A


I'm also going to try my hand at printing some stuff out. 

In [250]:
for row in top_scored_video_info_df.head(3).itertuples():
    markdown_str = f"""
    {row.title}

    {row.segment_transcription}
    """

    display(Markdown(markdown_str))


    Sun Kil Moon - "War On Drugs: Suck My Cock" TRACK REVIEW ft. Very Naughty NSFW Words

    who he categorized as hillbillies to shut the fuck up. But the biggest and most controversial incident of this occurred at an Ottawa folk festival where apparently Mark was playing at the same time as The War on Drugs, a Americana alt country psychedelic, Crouch Rock Band, whose album I reviewed earlier this year,
    


    David Byrne - American Utopia ALBUM REVIEW

    American Utopia. David Byrne, one of the most enigmatic musical figures of the 1970s and 80s, is back with a new full-length album his first solo record in like 14 years or so.
    


    Lil B- I'm Gay (I'm Happy) ALBUM REVIEW

    and other controversies about him, you know, making tracks like, I'm God, that's small potatoes at this point. This album finds little Bee at the peak of his newest dilemma, naming his new album, I'm gay, I'm happy.
    