# **Writing Postgres Queries**
In order to write the API, I'll need a couple of different Postgres queries. I'll test them out throughout this notebook! 

# Setup
The cells below will set up the rest of the notebook.

I'll start by configuring the kernel: 

In [1]:
# Change the working directory 
%cd ..

# Enable the autoreload extension, which will automatically load in new code as it's written
%load_ext autoreload
%autoreload 2

d:\data\programming\neural-needledrop\api


Now I'll import some necessary modules:

In [2]:
# General import statements
from sqlalchemy import create_engine, MetaData
from sqlalchemy.orm import sessionmaker, declarative_base
import datetime
import pandas as pd

# Importing custom modules
from utils.openai import embed_text
from utils.settings import (
    POSTGRES_USER,
    POSTGRES_PASSWORD,
    POSTGRES_HOST,
    POSTGRES_PORT,
    POSTGRES_DB,
    LOG_TO_CONSOLE,
)
from utils.logging import get_logger
from utils.postgres import query_postgres
import utils.postgres_queries as pg_queries

# Set up a logger for this notebook
logger = get_logger("postgres_notebook", log_to_console=LOG_TO_CONSOLE)

Finally, I'll set up some Postgres connectors: 

In [3]:
# Create the connection string to the database
postgres_connection_string = f"postgresql://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:{POSTGRES_PORT}/{POSTGRES_DB}"

# Create the connection engine
engine = create_engine(postgres_connection_string)
metadata = MetaData()
session = sessionmaker(bind=engine)()
Base = declarative_base()

# **Query Experimentation**
Below, I've collected some of my experiments with writing the API queries.

### Searching for Similar Embeddings
The crux of this project: embedding arbitrary text, and then finding the most similar embeddings to that text.

In [6]:
# First: define the query, and then embed it
query_text = "Shredding guitar, heavy drums, and fast-paced vocals"
query_embedding = embed_text(query_text)

# Now, get the most similar embeddings
most_similar_embeddings_df = pg_queries.most_similar_embeddings(query_embedding, engine, n=100)

# Show this DataFrame
most_similar_embeddings_df


    SELECT *, 1 - (embedding <=> '[0.009869296103715897, 0.015247329138219357, -0.021920427680015564, -0.018526460975408554, -0.0007504046079702675, 0.005582181271165609, -0.004969736095517874, 0.06548058241605759, -0.00025997014017775655, 0.009403582662343979, 0.019559962674975395, 0.009378064423799515, -0.0236301701515913, -0.006934663746505976, -0.02369396574795246, -0.0025773728266358376, -0.048179008066654205, -0.05389516055583954, -0.037614330649375916, 0.0095120370388031, 0.02535267174243927, -0.024510560557246208, -0.006960182450711727, 0.03054569475352764, -0.027304839342832565, 0.04659685865044594, -0.015528032556176186, 0.013512068428099155, 0.03539421781897545, -0.04435122758150101, -0.04353463277220726, -0.042029038071632385, 0.04713274911046028, -0.005722532980144024, -0.05731464549899101, -0.03567492216825485, 0.009748083539307117, 0.02625858038663864, -0.03317410498857498, 0.023528095334768295, 0.034220363944768906, -0.013933124020695686, 0.028733879327774048, 0.062367

Unnamed: 0,id,url,embedding_type,start_segment,end_segment,segment_length,embedding,cos_sim
0,67l6E8PVoFI_84_87,https://www.youtube.com/watch?v=67l6E8PVoFI,segment_chunk,84,87,3,"[0.054861132,0.034877438,0.025424674,0.0591738...",-0.082356
1,_byxaI-DWeA_393_396,https://www.youtube.com/watch?v=_byxaI-DWeA,segment_chunk,393,396,3,"[0.009323531,0.008237091,-0.06702495,0.0231495...",-0.054609
2,LRcZ3nbciUc_411_414,https://www.youtube.com/watch?v=LRcZ3nbciUc,segment_chunk,411,414,3,"[-0.006469526,0.039398286,-0.0050078044,0.0561...",-0.046334
3,_uySf429Z7E_33_36,https://www.youtube.com/watch?v=_uySf429Z7E,segment_chunk,33,36,3,"[-0.008398446,-0.028436717,-0.0023254154,0.046...",-0.043805
4,iXo7_VP06RI_393_396,https://www.youtube.com/watch?v=iXo7_VP06RI,segment_chunk,393,396,3,"[0.032006793,0.008628478,0.027461171,0.0479867...",-0.041323
...,...,...,...,...,...,...,...,...
95,YqLDx2PnOYM_60_63,https://www.youtube.com/watch?v=YqLDx2PnOYM,segment_chunk,60,63,3,"[0.011138871,-0.015118246,0.049215913,0.098611...",-0.000957
96,LRcZ3nbciUc_462_465,https://www.youtube.com/watch?v=LRcZ3nbciUc,segment_chunk,462,465,3,"[-0.012829968,-0.014875153,-0.024233062,0.0406...",-0.000941
97,ztqWF4z9kAw_15_18,https://www.youtube.com/watch?v=ztqWF4z9kAw,segment_chunk,15,18,3,"[0.02260463,-0.017916577,-0.027448477,0.067077...",-0.000725
98,P51fbOulZy4_528_531,https://www.youtube.com/watch?v=P51fbOulZy4,segment_chunk,528,531,3,"[0.014276995,-0.01431767,-0.025381325,0.032729...",-0.000675


### Retrieving Video Metadata
In order to display some information about a video, I'll need a general method to search for a bunch of their data. Should probably allow for the fetching of various videos' data, too. 

In [7]:
# Determine the IDs of the song we want video metadata for
song_ids = ["uCX9A3xROQo"]

# Get the video metadata for these songs
video_metadata_df = pg_queries.retrieve_multiple_video_metadata(["uCX9A3xROQo"], engine)

video_metadata_df

Unnamed: 0,id,title,length,channel_id,channel_name,short_description,description,view_ct,url,small_thumbnail_url,large_thumbnail_url,video_type,review_score,publish_date,scrape_date
0,uCX9A3xROQo,Armand Hammer - We Buy Diabetic Test Strips AL...,512,UCt7fwAhXDy3oNFTAzF2o8Pw,theneedledrop,Listen: https://armandhammer.bandcamp.com/albu...,Listen: https://armandhammer.bandcamp.com/albu...,154042,https://www.youtube.com/watch?v=uCX9A3xROQo,https://i.ytimg.com/vi/uCX9A3xROQo/default.jpg,https://i.ytimg.com/vi/uCX9A3xROQo/sddefault.jpg,album_review,9,2023-10-05,2024-01-06 00:52:26.235028


### Retrieving a Video's Transcript
Another method will be retrieving a video's entire transcript!

In [8]:
# Determining the ID of the video we want the transcript for
video_id = "uCX9A3xROQo"

# Query for the entire transcript
video_transcript_df = pg_queries.retrieve_multiple_video_transcripts(["uCX9A3xROQo"], engine)

video_transcript_df

Unnamed: 0,url,text,segment_id,segment_seek,segment_start,segment_end,video_id
0,https://www.youtube.com/watch?v=uCX9A3xROQo,"Hi everyone, Speaker of the House here, the I...",0,0,0,9,uCX9A3xROQo
1,https://www.youtube.com/watch?v=uCX9A3xROQo,We by diabetic test strips.,1,0,9,12,uCX9A3xROQo
2,https://www.youtube.com/watch?v=uCX9A3xROQo,"Okay, for how much?",2,0,12,14,uCX9A3xROQo
3,https://www.youtube.com/watch?v=uCX9A3xROQo,"Uh, zero dollars.",3,0,14,16,uCX9A3xROQo
4,https://www.youtube.com/watch?v=uCX9A3xROQo,It's just a title.,4,0,16,17,uCX9A3xROQo
...,...,...,...,...,...,...,...
137,https://www.youtube.com/watch?v=uCX9A3xROQo,Hit the like if you like.,137,49042,501,502,uCX9A3xROQo
138,https://www.youtube.com/watch?v=uCX9A3xROQo,Please subscribe and please don't cry.,138,49042,502,503,uCX9A3xROQo
139,https://www.youtube.com/watch?v=uCX9A3xROQo,Hit the bell as well over here next to my hea...,139,49042,503,507,uCX9A3xROQo
140,https://www.youtube.com/watch?v=uCX9A3xROQo,Hit that up with a link to subscribe to the c...,140,49042,507,509,uCX9A3xROQo


### Searching for Similar Embeddings (Filtered Options)
Below, I'm going to write a method to search for similar embeddings (over a filtered set of videos).


In [9]:
# Parameterize the search
release_date_filter = [datetime.datetime(2023, 1, 1), datetime.datetime(2024, 6, 1)]
video_type_filter = ["album_review", "mixtape_review"]
review_score_filter = [3, 5]

# Run the most_similar_embeddings_filtered from the postgres_queries module
most_similar_embeddings_filtered_df = (
    pg_queries.most_similar_embeddings_to_text_filtered(
        text="tiny string piano embellishment",
        engine=engine,
        n=300,
        release_date_filter=release_date_filter,
        video_type_filter=video_type_filter,
        review_score_filter=review_score_filter,
        include_text=True,
    )
)

# Show the DataFrame
most_similar_embeddings_filtered_df

ProgrammingError: (psycopg2.errors.UndefinedColumn) column text_segments_to_fetch.video_url does not exist
LINE 10:         text_segments_to_fetch.video_url = transcriptions.ur...
                 ^

[SQL: 
        SELECT
        transcriptions.*,
        text_segments_to_fetch.embedding_id
        FROM
        transcriptions
        JOIN
        text_segments_to_fetch
        ON
        text_segments_to_fetch.video_url = transcriptions.url
        ]
(Background on this error at: https://sqlalche.me/e/20/f405)