# **Creating a Neural Search Method**
Now that I've got some helpful methods within the **`01. Writing Postgres Queries`** notebook, I can spend some time writing a "neural search" method. This will be used for the API. 

# Setup
The cells below will set up the rest of the notebook.

I'll start by configuring the kernel: 

In [1]:
# Change the working directory 
%cd ..

# Enable the autoreload extension, which will automatically load in new code as it's written
%load_ext autoreload
%autoreload 2

d:\data\programming\neural-needledrop\api


Now I'll import some necessary modules:

In [2]:
# General import statements
import pandas as pd
import datetime
from IPython.display import Markdown, display

# Importing custom modules
from utils.settings import (
    POSTGRES_USER,
    POSTGRES_PASSWORD,
    POSTGRES_HOST,
    POSTGRES_PORT,
    POSTGRES_DB,
)
import utils.postgres_queries as pg_queries
import utils.postgres as postgres
from sqlalchemy import create_engine, MetaData
from sqlalchemy.orm import sessionmaker, declarative_base

I'll also set up my connection to the Postgres server: 

In [3]:
# Create the connection string to the database
postgres_connection_string = f"postgresql://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:{POSTGRES_PORT}/{POSTGRES_DB}"

# Create the connection engine
engine = create_engine(postgres_connection_string)
metadata = MetaData()
session = sessionmaker(bind=engine)()
Base = declarative_base()

# **Prototyping Neural Search**
Below, I'm going to prototype a search method. 

I'll start by parameterizing the search:

In [72]:
# Parameterize the search
query = "radiohead"
release_date_filter = [datetime.datetime(2015, 1, 1), datetime.datetime(2024, 6, 1)]
video_type_filter = ["album_review", "mixtape_review"]
review_score_filter = [6, 10]

# Extra parameters
n_chunks_to_consider_initially = 10000
n_most_similar_chunks_per_video = 5
n_videos_to_return = 10
n_segment_chunks_to_showcase = 3
min_similar_chunks = 2

Now, I'll try and identify the most similar segments:

In [73]:
postgres.query_postgres(
    "SET hnsw.ef_search = 1000",
    engine,
)

postgres.query_postgres(
    "SET ivfflat.probes = 100;",
    engine
)

In [74]:
# Run the query
similar_chunks_df = pg_queries.most_similar_embeddings_to_text_filtered(
    text=query,
    engine=engine,
    n=n_chunks_to_consider_initially,
    release_date_filter=release_date_filter,
    video_type_filter=video_type_filter,
    review_score_filter=review_score_filter,
    include_text=False,
    nearest_neighbors_to_screen=None,
)


EXPLAIN of the query:

                                           QUERY PLAN
0   Limit  (cost=10366.57..14678.49 rows=10000 wid...
1     ->  Hash Join  (cost=10366.57..26015.39 rows...
2           Hash Cond: ((embeddings_1.id)::text = ...
3           ->  Index Scan using embeddings_embedd...
4                 Order By: (embedding <=> '[0.017...
5           ->  Hash  (cost=6667.16..6667.16 rows=...
6                 ->  Hash Join  (cost=937.08..666...
7                       Hash Cond: ((embeddings.ur...
8                       ->  Seq Scan on embeddings...
9                       ->  Hash  (cost=926.02..92...
10                            ->  Seq Scan on vide...
11                                  Filter: ((publ...


With this in hand, I want to aggregate across a bunch of videos, and try to determine which video has the highest scores.

In [75]:
# Groupby `url`, and take the top `n_most_similar_chunks_per_video` chunks per video
aggregated_similar_chunks_df = similar_chunks_df.groupby("url").head(
    n_most_similar_chunks_per_video
)

# Aggregate the similarity statistics
aggregated_similar_chunks_df = (
    aggregated_similar_chunks_df.groupby("url")
    .agg(
        median_similarity=("cos_sim", "median"),
        average_similarity=("cos_sim", "mean"),
        n_similar_chunks=("cos_sim", "count"),
        max_similarity=("cos_sim", "max"),
    )
    .reset_index()
)

# Add a z-score column for the maximum similarity
aggregated_similar_chunks_df["cos_sim_z_score"] = (
    aggregated_similar_chunks_df["max_similarity"]
    - aggregated_similar_chunks_df["max_similarity"].mean()
) / aggregated_similar_chunks_df["max_similarity"].std()

# Add a weighted z-score median similarity column
aggregated_similar_chunks_df["weighted_z_score"] = (
    aggregated_similar_chunks_df["cos_sim_z_score"]
    * aggregated_similar_chunks_df["n_similar_chunks"]
)

# Add a weighted median similarity column
aggregated_similar_chunks_df["weighted_median_similarity"] = (
    aggregated_similar_chunks_df["median_similarity"]
    * aggregated_similar_chunks_df["n_similar_chunks"]
)

# Add a weighted average similarity column
aggregated_similar_chunks_df["weighted_average_similarity"] = (
    aggregated_similar_chunks_df["average_similarity"]
    * aggregated_similar_chunks_df["n_similar_chunks"]
)



# Sort by the weighted median similarity
aggregated_similar_chunks_df = (
    aggregated_similar_chunks_df.sort_values("weighted_z_score", ascending=False)
    .query(f"n_similar_chunks>={min_similar_chunks}")
    .head(n_videos_to_return)
)

Now, we're going to get some metadata about each video back. This will involve uploading a temporary table to Postgres, and then joining it to the `video_metadata` table:

In [76]:
# Create a temporary table called `temp_similar_chunks` that is the aggregated_similar_chunks_df DataFrame
with engine.connect() as conn:
    aggregated_similar_chunks_df.to_sql(
        "temp_similar_chunks", conn, if_exists="replace", index=False
    )

# Now, select the entire `video_metadata` table for each of the videos in the `temp_similar_chunks` table
similar_chunks_video_metadata_df = postgres.query_postgres(
    """
    SELECT 
        video_metadata.*, 
        temp_similar_chunks.median_similarity, 
        temp_similar_chunks.n_similar_chunks, 
        temp_similar_chunks.weighted_median_similarity,
        temp_similar_chunks.weighted_z_score
    FROM video_metadata
    JOIN temp_similar_chunks
    ON video_metadata.url = temp_similar_chunks.url
    ORDER BY temp_similar_chunks.weighted_median_similarity DESC
    """,
    engine=engine,
)

Finally, we're going to prepare our results. This will just be the `similar_chunks_video_metadata_df`, except also containing the `n_segment_chunks_to_showcase` most similar segment chunks per video. 

In [77]:
# Create a DataFrame containing the segment chunks I want to showcase
segment_chunks_to_showcase_df = (
    (
        pg_queries.fetch_text_for_segments(
            similar_chunks_df[
                similar_chunks_df["url"].isin(
                    similar_chunks_video_metadata_df["url"].unique()
                )
            ],
            engine,
        )
        .sort_values("cos_sim", ascending=False)
        .groupby("url")
        .head(n_segment_chunks_to_showcase)
        .sort_values(["url", "cos_sim"], ascending=False)
    )
    .groupby("url")
    .agg(
        top_segment_chunks=("text", lambda x: list(x)),
    )
    .reset_index()
)

# Merge this DataFrame with the video metadata
segment_chunks_to_showcase_df = segment_chunks_to_showcase_df.merge(
    similar_chunks_video_metadata_df, on="url"
).sort_values("weighted_z_score", ascending=False)

Now, I'm going to print the results using some nice formatting:

In [78]:
for index, row in segment_chunks_to_showcase_df.head(3).iterrows():
    display(Markdown(f"**{row['title']}**"))
    display(Markdown("\n".join([f"* {chunk}" for chunk in row['top_segment_chunks']])))


**Radiohead - A Moon Shaped Pool ALBUM REVIEW**

* Radiohead sounded like they were moving in a direction which was very soft Very subtle nuanced and maybe even depressing and that's in fact what happens on the rest of the album and And in album like that naturally is not going to appeal to everybody
* Dismal like with the song Dex Dark where Leirically Tom's coming in and telling the story about a UFO that comes in and sort of cripples humanity with this Definning sound a metaphor
* The internet's busiest music nerd and it's time for you of this new radio head album a moon shaped pool Lp number nine From UK based art rock outfit radio head five years later following up their last full length album the king of limbs Which received mixed reviews personally wasn't too crazy about the album you could check out my review for it

**The Smile - Wall of Eyes ALBUM REVIEW**

* This is the latest following Thel-P from the smile, a new-ish band out of the UK with musicians in it. Musicians you may have heard of, like guitarist, keyboardist, composer, Johnny Greenwood, as well as singer and songwriter Tom York. Both of whom are known for work they've done previously, and a little band named radiohead.
* There is something to it. It brings something different to the table different than what you usually hear out of a radiohead or radiohead adjacent track. Because while the smile is a different band, I don't think Johnny and Tom have made too much of an effort.
* a many of which have jazz or jazz funk leanings, which made him kind of a curious choice for this new project with Johnny and Tom. Given that's not necessarily where radiohead is coming from stylistically. But I can't deny the groovy and cerebral rhythms that he often lays in the background of the Smiles tracks.

**Lianne La Havas - Self-Titled ALBUM REVIEW**

* 
* There's almost certainly a statement being made in the album being self-titled too, because this album in my view is certainly unfiltered, it feels pure, it's almost as if Leanne Now this change in sound is a bit of a trade-off as the sound of Lian LaHavis here is
* 

# **Functionalized Version**
I took all of the code above and wrote a single method from it. Below, I'll show it off:

In [79]:
from utils.search import neural_search
import json

# Run the search
results = neural_search(
    query="jim carrey",
    release_date_filter=None,
    video_type_filter=None,
    review_score_filter=None,
    n_most_similar_chunks_per_video=5,
    n_videos_to_return=10,
    n_segment_chunks_to_showcase=3,
    min_similar_chunks=2,
    n_chunks_to_consider_initially=1000,
    nearest_neighbors_to_screen=None,
    ivfflat_probes=50
)

# Create a DataFrame
results_df = pd.DataFrame(json.loads(results))

results_df


EXPLAIN of the query:

                                          QUERY PLAN
0  Limit  (cost=4042.32..4676.62 rows=1000 width=78)
1    ->  Hash Left Join  (cost=4042.32..113886.75...
2          Hash Cond: ((embeddings.url)::text = (...
3          ->  Nested Loop  (cost=3116.30..109713...
4                ->  Index Scan using embeddings_...
5                      Order By: (embedding <=> '...
6                ->  Index Scan using embeddings_...
7                      Index Cond: ((id)::text = ...
8          ->  Hash  (cost=873.23..873.23 rows=42...
9                ->  Seq Scan on video_metadata  ...
Total time to get similar chunks: 1.4599530696868896 seconds.
Total time to aggregate and sort: 0.006056547164916992 seconds.
Total time to get video metadata and text: 3.284275770187378 seconds.


Unnamed: 0,url,top_segment_chunks,id,title,length,channel_id,channel_name,short_description,description,view_ct,small_thumbnail_url,large_thumbnail_url,video_type,review_score,publish_date,scrape_date,median_similarity,n_similar_chunks,weighted_median_similarity,weighted_z_score
0,https://www.youtube.com/watch?v=knCe71tRofw,[who historically has shown herself to be obse...,knCe71tRofw,Lorde - Solar Power ALBUM REVIEW,675,UCt7fwAhXDy3oNFTAzF2o8Pw,theneedledrop,Listen: https://www.youtube.com/watch?v=P103bW...,Listen: https://www.youtube.com/watch?v=P103bW...,655464,https://i.ytimg.com/vi/knCe71tRofw/default.jpg,https://i.ytimg.com/vi/knCe71tRofw/sddefault.jpg,album_review,4.0,1629676800000,1704542437455,0.249871,5,1.249354,14.842961
1,https://www.youtube.com/watch?v=L3E0kq9YkjA,[That was my gun. But that's because we run in...,L3E0kq9YkjA,THE WORST ALBUM OF 2016 (Corey Feldman's Angel...,3053,UCt7fwAhXDy3oNFTAzF2o8Pw,theneedledrop,Listen: https://www.youtube.com/watch?v=A51nLF...,Listen: https://www.youtube.com/watch?v=A51nLF...,2824802,https://i.ytimg.com/vi/L3E0kq9YkjA/default.jpg,https://i.ytimg.com/vi/L3E0kq9YkjA/sddefault.jpg,misc,,1474934400000,1704633144694,0.268495,5,1.342477,14.416697
2,https://www.youtube.com/watch?v=Ni258ri7U3o,"[It's so funny because like, I feel like while...",Ni258ri7U3o,TND Podcast #57: How To Get Into Death Grips f...,5218,UCt7fwAhXDy3oNFTAzF2o8Pw,theneedledrop,"In this episode of the TND Podcast, my guest i...","In this episode of the TND Podcast, my guest i...",250735,https://i.ytimg.com/vi/Ni258ri7U3o/default.jpg,https://i.ytimg.com/vi/Ni258ri7U3o/sddefault.jpg,tnd_podcast,,1491436800000,1704641419607,0.265107,5,1.325536,11.474882
3,https://www.youtube.com/watch?v=Q3x4SX9Uu_4,[I guess you just seems very emotionless and t...,Q3x4SX9Uu_4,Top-20 Music VIdeos of 2014,688,UCt7fwAhXDy3oNFTAzF2o8Pw,theneedledrop,===================================\nTop-50 al...,===================================\nTop-50 al...,108157,https://i.ytimg.com/vi/Q3x4SX9Uu_4/default.jpg,https://i.ytimg.com/vi/Q3x4SX9Uu_4/sddefault.jpg,misc,,1418860800000,1704636434930,0.240362,5,1.201809,10.068568
4,https://www.youtube.com/watch?v=z6UYz2eNoS8,[Another interview episode. And in this episod...,z6UYz2eNoS8,TND Podcast #17 ft. George Miller a.k.a. Filth...,5538,UCt7fwAhXDy3oNFTAzF2o8Pw,theneedledrop,"On this latest episode of the TND Podcast, we ...","On this latest episode of the TND Podcast, we ...",1295290,https://i.ytimg.com/vi/z6UYz2eNoS8/default.jpg,https://i.ytimg.com/vi/z6UYz2eNoS8/sddefault.jpg,tnd_podcast,,1441497600000,1704635100134,0.255719,5,1.278595,9.220244
5,https://www.youtube.com/watch?v=J_EXmjacmOc,"[Hey, be you, you know, you, you, everybody el...",J_EXmjacmOc,Pink Guy - Pink Season ALBUM REVIEW,675,UCt7fwAhXDy3oNFTAzF2o8Pw,theneedledrop,Listen: https://www.youtube.com/watch?v=0c_mhr...,Listen: https://www.youtube.com/watch?v=0c_mhr...,2344548,https://i.ytimg.com/vi/J_EXmjacmOc/default.jpg,https://i.ytimg.com/vi/J_EXmjacmOc/sddefault.jpg,album_review,,1484006400000,1704639364975,0.249878,4,0.999514,9.18785
6,https://www.youtube.com/watch?v=lmaCK2mGzd0,[Are we gonna see Biggie Rapping? Are we gonna...,lmaCK2mGzd0,Do we respect the past? (Tupac Hologram Reaction),367,UCt7fwAhXDy3oNFTAzF2o8Pw,theneedledrop,Watch: http://www.youtube.com/watch?v=ajVGIRsK...,Watch: http://www.youtube.com/watch?v=ajVGIRsK...,155560,https://i.ytimg.com/vi/lmaCK2mGzd0/default.jpg,https://i.ytimg.com/vi/lmaCK2mGzd0/sddefault.jpg,misc,,1334707200000,1704634651827,0.249952,3,0.749857,9.052297
7,https://www.youtube.com/watch?v=JW5_AedbPa4,"[I've had four hours sleep, can you tell? This...",JW5_AedbPa4,Anthony Speaks at Wesleyan: Lecture Pt. 2 + Q&...,1797,UCt7fwAhXDy3oNFTAzF2o8Pw,theneedledrop,"On Thursday, September 13th, I was invited to ...","On Thursday, September 13th, I was invited to ...",14448,https://i.ytimg.com/vi/JW5_AedbPa4/default.jpg,https://i.ytimg.com/vi/JW5_AedbPa4/hqdefault.jpg,misc,,1348358400000,1704641393164,0.270129,5,1.350645,8.977977
8,https://www.youtube.com/watch?v=i2uzq3A9YbU,"[like Tim Heidickers, the comedy, or that new ...",i2uzq3A9YbU,Childish Gambino - Because The Internet ALBUM ...,666,UCt7fwAhXDy3oNFTAzF2o8Pw,theneedledrop,Listen: http://www.youtube.com/watch?v=tG35R8F...,Listen: http://www.youtube.com/watch?v=tG35R8F...,1021918,https://i.ytimg.com/vi/i2uzq3A9YbU/default.jpg,https://i.ytimg.com/vi/i2uzq3A9YbU/sddefault.jpg,album_review,5.0,1386633600000,1704633484072,0.31664,2,0.633281,8.820244
9,https://www.youtube.com/watch?v=_byxaI-DWeA,"[Yeah, I guess it's kind of what you could say...",_byxaI-DWeA,TND Podcast #56: 10 Worst Rappers (Revisited) ...,4925,UCt7fwAhXDy3oNFTAzF2o8Pw,theneedledrop,"In this episode of The Needle Drop Podcast, D....","In this episode of The Needle Drop Podcast, D....",342889,https://i.ytimg.com/vi/_byxaI-DWeA/default.jpg,https://i.ytimg.com/vi/_byxaI-DWeA/sddefault.jpg,tnd_podcast,,1488844800000,1704632180600,0.255606,5,1.278031,8.542788
