# **Creating a Neural Search Method**
Now that I've got some helpful methods within the **`01. Writing Postgres Queries`** notebook, I can spend some time writing a "neural search" method. This will be used for the API. 

# Setup
The cells below will set up the rest of the notebook.

I'll start by configuring the kernel: 

In [1]:
# Change the working directory 
%cd ..

# Enable the autoreload extension, which will automatically load in new code as it's written
%load_ext autoreload
%autoreload 2

d:\data\programming\neural-needledrop\api


Now I'll import some necessary modules:

In [2]:
# General import statements
import pandas as pd
import datetime
from IPython.display import Markdown, display

# Importing custom modules
from utils.settings import (
    POSTGRES_USER,
    POSTGRES_PASSWORD,
    POSTGRES_HOST,
    POSTGRES_PORT,
    POSTGRES_DB,
)
import utils.postgres_queries as pg_queries
import utils.postgres as postgres
from sqlalchemy import create_engine, MetaData
from sqlalchemy.orm import sessionmaker, declarative_base

I'll also set up my connection to the Postgres server: 

In [3]:
# Create the connection string to the database
postgres_connection_string = f"postgresql://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:{POSTGRES_PORT}/{POSTGRES_DB}"

# Create the connection engine
engine = create_engine(postgres_connection_string)
metadata = MetaData()
session = sessionmaker(bind=engine)()
Base = declarative_base()

# **Prototyping Neural Search**
Below, I'm going to prototype a search method. 

I'll start by parameterizing the search:

In [5]:
# Parameterize the search
query = "deep, slowed drone background vocals, glitchy percussion and a heavy bassline"
release_date_filter = [datetime.datetime(2010, 1, 1), datetime.datetime(2024, 6, 1)]
video_type_filter = ["album_review", "mixtape_review"]
review_score_filter = [6, 10]

# Extra parameters
n_chunks_to_consider_initially = 250
n_most_similar_chunks_per_video = 10
n_videos_to_return = 10
n_segment_chunks_to_showcase = 3

Now, I'll try and identify the most similar segments:

In [6]:
# Run the query
similar_chunks_df = pg_queries.most_similar_embeddings_to_text_filtered(
    text=query,
    engine=engine,
    n=n_chunks_to_consider_initially,
    release_date_filter=release_date_filter,
    video_type_filter=video_type_filter,
    review_score_filter=review_score_filter,
    include_text=True,
)

With this in hand, I want to aggregate across a bunch of videos, and try to determine which video has the highest scores.

In [7]:
# Groupby `url`, and take the top `n_most_similar_chunks_per_video` chunks per video
aggregated_similar_chunks_df = similar_chunks_df.groupby("url").head(
    n_most_similar_chunks_per_video
)

# Aggregate the similarity statistics
aggregated_similar_chunks_df = (
    aggregated_similar_chunks_df.groupby("url")
    .agg(
        median_similarity=("cos_sim", "median"),
        n_similar_chunks=("cos_sim", "count"),
    )
    .reset_index()
)

# Add a weighted median similarity column
aggregated_similar_chunks_df["weighted_median_similarity"] = (
    aggregated_similar_chunks_df["median_similarity"]
    * aggregated_similar_chunks_df["n_similar_chunks"]
)

# Sort by the weighted median similarity
aggregated_similar_chunks_df = aggregated_similar_chunks_df.sort_values(
    "weighted_median_similarity", ascending=False
).head(n_videos_to_return)

Now, we're going to get some metadata about each video back. This will involve uploading a temporary table to Postgres, and then joining it to the `video_metadata` table:

In [8]:
# Create a temporary table called `temp_similar_chunks` that is the aggregated_similar_chunks_df DataFrame
with engine.connect() as conn:
    aggregated_similar_chunks_df.to_sql(
        "temp_similar_chunks", conn, if_exists="replace", index=False
    )

# Now, select the entire `video_metadata` table for each of the videos in the `temp_similar_chunks` table
similar_chunks_video_metadata_df = postgres.query_postgres(
    """
    SELECT 
        video_metadata.*, 
        temp_similar_chunks.median_similarity, 
        temp_similar_chunks.n_similar_chunks, 
        temp_similar_chunks.weighted_median_similarity
    FROM video_metadata
    JOIN temp_similar_chunks
    ON video_metadata.url = temp_similar_chunks.url
    ORDER BY temp_similar_chunks.weighted_median_similarity DESC
    """,
    engine=engine,
)

Finally, we're going to prepare our results. This will just be the `similar_chunks_video_metadata_df`, except also containing the `n_segment_chunks_to_showcase` most similar segment chunks per video. 

In [9]:
# Create a DataFrame containing the segment chunks I want to showcase
segment_chunks_to_showcase_df = (
    (
        similar_chunks_df[
            similar_chunks_df["url"].isin(
                similar_chunks_video_metadata_df["url"].unique()
            )
        ]
        .sort_values("cos_sim", ascending=False)
        .groupby("url")
        .head(n_segment_chunks_to_showcase)
        .sort_values(["url", "cos_sim"], ascending=False)
    )
    .groupby("url")
    .agg(
        top_segment_chunks=("text", lambda x: list(x)),
    )
    .reset_index()
)

# Merge this DataFrame with the video metadata
segment_chunks_to_showcase_df = segment_chunks_to_showcase_df.merge(
    similar_chunks_video_metadata_df, on="url"
).sort_values("weighted_median_similarity", ascending=False)

Now, I'm going to print the results using some nice formatting:

In [10]:
for index, row in segment_chunks_to_showcase_df.head(3).iterrows():
    display(Markdown(f"**{row['title']}**"))
    display(Markdown("\n".join([f"* {chunk}" for chunk in row['top_segment_chunks']])))


**Actress - Ghettoville ALBUM REVIEW**

* There's a breathy vocal sample on this track, some deep bass and I love the sharp sounds of the drums on this track. And the tracks which is back and forth between these robotic drum beats and these
* where it seems like he tries to strip back as much treble as possible with some pretty nice basslines on the song too, and then there's skyline which is incredibly minimal, it's glitchy, and there's some weird metallic drones hanging in the background of this track too, and I love
* while it isn't really a heavy track, it is an incredibly eerie song with a very sour drone hanging in the background with some very tense hyads playing in the foreground. There's some odd synthesizer keys that play on this track too, and longer this song goes,

**Nicolas Jaar - Cenizas ALBUM REVIEW**

* a bit more low fidelity. Honestly, it all sounds like a little bit of a AIA, a erogruper, but with Nick's voice on top of it instead, which is relatively haunting, especially as he digs into his lower register and sings very close to the microphone. It's like a pulter guys does in my ear
* instrumental palettes. From the watery keys and shuffling, percussive noise and wailing read leads on the track rubble to the track Xerox, which is a stunning drone piece, loaded with a lot of raw, pitchy instrumentation that is shrouded in ambient noise. There might be
* of sorts, doesn't really compare to anything else on the album. So while Sonesis is an exactly the most cohesive record in this somewhat ambient style that you're ever going to hear, most of the pieces are still pretty evocative. We have the opening track vanish, which features

**Liars - Mess ALBUM REVIEW**

* And the beats, the rhythms, the grooves on this LP range from monstrous, massive, industrialized nightmares to kind of upbeat, high tempo, glitchy and strange dance tracks. There are even a series of songs on this LP that head back into that ambient direction that you
* which has all these bubbling synthesizer arpeggios keeping a really tight pace. And there are some pretty weird vocal samples playing throughout this track too, like it might face. But their placement into the song seems so much like, I don't know, just like a electric body music from the
* busy piece of electronic music, very dense. And the way the song progresses, it doesn't really make a huge significant change in the midst of its runtime or anything like that. It just progressively gets noiseier and noiseier and noiseier. I wish there was a bit more change across this track, though it feels

# **Functionalized Version**
I took all of the code above and wrote a single method from it. Below, I'll show it off:

In [4]:
from utils.search import neural_search

# Run the search
neural_search(
    query="deep, slowed drone background vocals, glitchy percussion and a heavy bassline",
    release_date_filter=None,
    video_type_filter=None,
    review_score_filter=None,
    n_most_similar_chunks_per_video=8,
    n_videos_to_return=3,
    n_segment_chunks_to_showcase=3,
)

