# **Writing Postgres Queries**
In order to write the API, I'll need a couple of different Postgres queries. I'll test them out throughout this notebook! 

# Setup
The cells below will set up the rest of the notebook.

I'll start by configuring the kernel: 

In [1]:
# Change the working directory 
%cd ..

# Enable the autoreload extension, which will automatically load in new code as it's written
%load_ext autoreload
%autoreload 2

d:\data\programming\neural-needledrop\api


Now I'll import some necessary modules:

In [2]:
# General import statements
from sqlalchemy import create_engine, MetaData
from sqlalchemy.orm import sessionmaker, declarative_base
import datetime
import pandas as pd

# Importing custom modules
from utils.openai import embed_text
from utils.settings import (
    POSTGRES_USER,
    POSTGRES_PASSWORD,
    POSTGRES_HOST,
    POSTGRES_PORT,
    POSTGRES_DB,
    LOG_TO_CONSOLE,
)
from utils.logging import get_logger
from utils.postgres import query_postgres
import utils.postgres_queries as pg_queries

# Set up a logger for this notebook
logger = get_logger("postgres_notebook", log_to_console=LOG_TO_CONSOLE)

Finally, I'll set up some Postgres connectors: 

In [3]:
# Create the connection string to the database
postgres_connection_string = f"postgresql://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:{POSTGRES_PORT}/{POSTGRES_DB}"

# Create the connection engine
engine = create_engine(postgres_connection_string)
metadata = MetaData()
session = sessionmaker(bind=engine)()
Base = declarative_base()

# **Query Experimentation**
Below, I've collected some of my experiments with writing the API queries.

### Searching for Similar Embeddings
The crux of this project: embedding arbitrary text, and then finding the most similar embeddings to that text.

In [4]:
# First: define the query, and then embed it
query_text = "Shredding guitar, heavy drums, and fast-paced vocals"
query_embedding = embed_text(query_text)

# Now, get the most similar embeddings
most_similar_embeddings_df = pg_queries.most_similar_embeddings(query_embedding, engine, n=5)

# Show this DataFrame
most_similar_embeddings_df

Unnamed: 0,id,url,embedding_type,start_segment,end_segment,segment_length,embedding,cos_sim
0,CPVjp0i3WPQ_76_80,https://www.youtube.com/watch?v=CPVjp0i3WPQ,segment_chunk,76,80,4,"[0.0057957903,0.0022525394,-0.031274926,0.0077...",-0.17934
1,CeqFUpfTZOM_0_8,https://www.youtube.com/watch?v=CeqFUpfTZOM,segment_chunk,0,8,8,"[0.04532378,-0.038371727,-0.058565788,0.005624...",-0.183103
2,C08ZLI-JjSg_96_104,https://www.youtube.com/watch?v=C08ZLI-JjSg,segment_chunk,96,104,8,"[0.022653446,-0.005142732,-0.042279977,0.04552...",-0.189034
3,C08ZLI-JjSg_84_88,https://www.youtube.com/watch?v=C08ZLI-JjSg,segment_chunk,84,88,4,"[0.022653446,-0.005142732,-0.042279977,0.04552...",-0.189034
4,Cb3vfW2MRMY_16_24,https://www.youtube.com/watch?v=Cb3vfW2MRMY,segment_chunk,16,24,8,"[-0.013333404,0.03703987,-0.021140829,-0.02033...",-0.195559


### Retrieving Video Metadata
In order to display some information about a video, I'll need a general method to search for a bunch of their data. Should probably allow for the fetching of various videos' data, too. 

In [5]:
# Determine the IDs of the song we want video metadata for
song_ids = ["uCX9A3xROQo"]

# Get the video metadata for these songs
video_metadata_df = pg_queries.retrieve_multiple_video_metadata(["uCX9A3xROQo"], engine)

video_metadata_df

Unnamed: 0,id,title,length,channel_id,channel_name,short_description,description,view_ct,url,small_thumbnail_url,large_thumbnail_url,video_type,review_score,publish_date,scrape_date


### Retrieving a Video's Transcript
Another method will be retrieving a video's entire transcript!

In [6]:
# Determining the ID of the video we want the transcript for
video_id = "uCX9A3xROQo"

# Query for the entire transcript
video_transcript_df = pg_queries.retrieve_multiple_video_transcripts(["uCX9A3xROQo"], engine)

video_transcript_df

Unnamed: 0,url,text,segment_id,segment_seek,segment_start,segment_end,video_id


### Searching for Similar Embeddings (Filtered Options)
Below, I'm going to write a method to search for similar embeddings (over a filtered set of videos).


In [34]:
# Parameterize the search
# release_date_filter = [datetime.datetime(2023, 1, 1), datetime.datetime(2024, 6, 1)]
# video_type_filter = ["album_review", "mixtape_review"]
# review_score_filter = None

# Run the most_similar_embeddings_filtered from the postgres_queries module
most_similar_embeddings_filtered_df = (
    pg_queries.most_similar_embeddings_to_text_filtered(
        text="tiny string piano embellishment",
        engine=engine,
        n=100,
        # release_date_filter=release_date_filter,
        # video_type_filter=video_type_filter,
        # review_score_filter=review_score_filter,
        include_text=True,
    )
)

# Show the DataFrame
most_similar_embeddings_filtered_df

Unnamed: 0,id,url,embedding_type,start_segment,end_segment,segment_length,embedding,cos_sim,text
0,dCdMYNbK8Vs_48_52,https://www.youtube.com/watch?v=dCdMYNbK8Vs,segment_chunk,48,52,4,"[-0.0062035797,-0.0024234902,0.0007060562,-0.0...",0.546421,"title thanks to a grand upward, mobile musical..."
1,dCdMYNbK8Vs_40_48,https://www.youtube.com/watch?v=dCdMYNbK8Vs,segment_chunk,40,48,8,"[-0.009083154,0.012684595,-0.005445553,-0.0252...",0.477517,tiny string and piano embellishments orbit aro...
2,cS0bI-chYN8_32_36,https://www.youtube.com/watch?v=cS0bI-chYN8,segment_chunk,32,36,4,"[0.011554193,0.013490616,-0.009588517,-0.00783...",0.411181,"Benjamin's vocal delivery is stunning, and the..."
3,CEpKouCO6L8_64_72,https://www.youtube.com/watch?v=CEpKouCO6L8,segment_chunk,64,72,8,"[0.01007167,0.046383973,0.0017122536,-0.037335...",0.404509,"again dramatic, our paginated guitar chords. A..."
4,CEpKouCO6L8_32_36,https://www.youtube.com/watch?v=CEpKouCO6L8,segment_chunk,32,36,4,"[0.026274744,0.031433225,-0.0072821644,-0.0244...",0.396319,The band starts with nothing but feverish riff...
...,...,...,...,...,...,...,...,...,...
95,CS4HP_-HEdI_16_20,https://www.youtube.com/watch?v=CS4HP_-HEdI,segment_chunk,16,20,4,"[-0.0030135568,-0.033237662,-0.012565778,-0.01...",0.296571,"It starts right from the get-go on the track, ..."
96,cDpIC-MaY4Y_56_64,https://www.youtube.com/watch?v=cDpIC-MaY4Y,segment_chunk,56,64,8,"[0.029437153,-0.006786369,-0.031319603,-0.0348...",0.296556,I think sort of putting all this atmosphere in...
97,cDpIC-MaY4Y_52_56,https://www.youtube.com/watch?v=cDpIC-MaY4Y,segment_chunk,52,56,4,"[-0.0076610213,-0.006021503,-0.033267315,-0.02...",0.296219,The song is good. Melodies great. Vocals are s...
98,CCJZO4I2SS8_32_36,https://www.youtube.com/watch?v=CCJZO4I2SS8,segment_chunk,32,36,4,"[0.03816223,0.03799007,0.0060865893,-0.0205444...",0.295856,"point where it lost its appeal. Not only that,..."


In [28]:
most_sim_emb = most_similar_embeddings_filtered_df.iloc[0].embedding

In [33]:
most_similar_embeddings_filtered_df.iloc[0].text

'title thanks to a grand upward, mobile musical ascent marked by crashing drums, buzzing guitar chords and when the instrumental chaos clears, Mark hits upon these vocal lines that sound like parts saxophone, part premonition of what a quarter of all alt rock singers would sound like for the next 10 years or at least what they would try to sound like but fail miserably. The jazzy upright'

In [29]:
from utils.openai import embed_text

In [30]:
other_emb = embed_text("tiny string piano embellishment")

In [31]:
# Parse the string list into a list of floats using ast
import ast
most_sim_emb = ast.literal_eval(most_sim_emb)

In [32]:
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a,b):
    cos_sim = dot(a, b)/(norm(a)*norm(b))
    return cos_sim

cos_sim = cosine_similarity(most_sim_emb, other_emb)
print("Cosine Similarity: ", cos_sim)


Cosine Similarity:  0.5464692687906129
