# Motivation
One of the pages I want to create in the Dash app is a "transcription display" page. The user will see a video at the top of their page, and will have the transcription below, with each of the segments colored according to how similar a particular segment is to a search query. 

This notebook will help facilitate the creation of a prototype of that page! 

# Setup
The cells below will help set up the rest of the notebook. 

I'm going to start by changing my working directory to the repo root. 

In [1]:
%cd ..

d:\data\programming\neural-needle-drop


Next, I'll import different modules. 

In [2]:
# Import statements
import pandas as pd
import pinecone
import os
import mysql.connector
import traceback
import numpy as np
import itertools
import math
from tqdm import tqdm
from time import sleep
from time import time
import requests
from requests.structures import CaseInsensitiveDict
import json
from pathlib import Path
import traceback
from IPython.display import display, Markdown

After that, I'm going to set up a connection to my MySQL database. This will help me load in the relevant data. 

In [3]:
# Set up the connection to the MySQL server
cnx = mysql.connector.connect(
    user='root', password=os.getenv("MYSQL_PASSWORD"), 
    host='localhost', database='neural-needle-drop')

# Create a cursor 
cursor = cnx.cursor()

Finally, we're going to set up the connection to the Pinecone API. 

In [4]:
# Initialize the Pinecone API connection
pinecone.init(api_key=os.getenv("PINECONE_API_KEY"))

# Setting up the index 
pinecone_index = pinecone.Index("neural-needledrop-prototype")
pinecone_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.2,
 'namespaces': {'video_embeddings': {'vector_count': 130836}},
 'total_vector_count': 130836}

# Methods
The methods below will also help throughout the rest of the notebook. 

In [5]:
# This method will return a list of ndarrays, each representing text embeddings of 
# the text in each index of the input_text_list list
def generate_embeddings(input_text_list, print_exceptions=False):
    
    # Get the OpenAI API key from the environment variables 
    api_key = os.getenv("OPENAI_API_KEY", "")
    
    # Build the API request
    url = "https://api.openai.com/v1/embeddings"
    headers = CaseInsensitiveDict()
    headers["Content-Type"] = "application/json"
    headers["Authorization"] = "Bearer " + api_key
    data = """{"input": """ + json.dumps(input_text_list) + ""","model":"text-embedding-ada-002"}"""
    
    # Send the API request
    resp = requests.post(url, headers=headers, data=data)
    
    # If the request was successful, return ndarrays of the embeddings. Otherwise, return None objects 
    if resp.status_code == 200:
        return [np.asarray(data_object['embedding']) for data_object in resp.json()['data']]
    else:
        if (print_exceptions):
            print(resp.json())
        return [None for txt in input_text_list]
    
# This method will generate the embedding for a single string
def generate_embedding(txt_input, print_exceptions=False):
    return (generate_embeddings([txt_input], print_exceptions)[0])

def query_to_df(query, print_error=False):
    '''Query the active MySQL database and return results in a DataFrame'''

    # Try to return the results as a DataFrame
    try:
        # Execute the query
        cursor.execute(query)

        # Fetch the results 
        res = cursor.fetchall()

        # Return a DataFrame
        return pd.DataFrame(res, columns=[i[0] for i in cursor.description])

    # If we run into an Exception, return None
    except Exception as e:
        if (print_error):
            print(f"Ran into the following error:\n{e}\nStack trace:")
            print(traceback.format_exc())
        return None

# Collecting Data
Now that everything is properly set up, we'll want to collect the data relevant to this page. 

Below, I'll indicate a handful of different things that'll be used as "seeds" to the data collection process.

In [6]:
# This is the user's search query
user_search_query = "Man on the moon, spaceship going to the firey depths of Hell"

# This is the embedding of the user's search query 
user_search_query_emb = generate_embedding(user_search_query, print_exceptions=True)

# This is the video that the user wants to compare their search query to 
cur_video_id = "J_LvLhFq2IU"

The aforementioned information will have already been generated by the user's behavior on the "Search" page; when they click on one of the videos, they'll be moved to this Transcription Display page. 

The cell below will collect the relevant data from Pinecone. 

In [8]:
# Query the Pinecone index for all of the 
pinecone_results = pinecone_index.query(
    vector=user_search_query_emb,
    filter={
        "embedding_type": "segment_chunk",
        "video_id": cur_video_id
    },
    top_k=10000,
    include_metadata=True,
    namespace="video_embeddings"
)

# Create a DataFrame from the Pinecone results 
top_segment_matches_original_df = pd.DataFrame.from_records(
    [{"id": x.id, "score": x.score} | x.metadata
        for x in pinecone_results['matches']])

Next, we're going to collect the relevant data from the `transcriptions` table. 

In [13]:
cur_video_transcription_df = query_to_df(f"""SELECT * FROM transcriptions WHERE id="{cur_video_id}" AND segment != -1""")

Finally, we're going to need to transform all of this data. There are two data structures that we want to create: 

- **Segment Chunks** - This is a list of the different "segment chunks" and their relevance to the search query. 
- **Segment Info** - This is essentially just the `cur_video_transcription_df`. 

In [18]:
# Creating the segment_chunk_df
segment_chunk_df = top_segment_matches_original_df[["start_segment", "end_segment", "score"]].copy()

# Creating the segment_info_df
segment_info_df = cur_video_transcription_df[["segment", "start_time", "text"]].rename(
    columns={"start_time": "seek"}).copy()

We're going to save these DataFrames - they'll be the basis for testing out our page. 

In [19]:
# Saving our DataFrames
segment_chunk_df.to_json("data/test_segment_chunk_df.json", orient="records", indent=2)
segment_info_df.to_json("data/test_segment_info_df.json", orient="records", indent=2)