# **Creating a Hybrid Search Method**
In this notebook, I'm going to try and combine the results of the neural search with the keyword search. This will result in a "hybrid search" algorithm, which combines the strengths of both. 

# Setup
The cells below will set up the rest of the notebook.

I'll start by configuring the kernel: 

In [1]:
# Change the working directory 
%cd ..

# Enable the autoreload extension, which will automatically load in new code as it's written
%load_ext autoreload
%autoreload 2

d:\data\programming\neural-needledrop\api


Now I'll import some necessary modules:

In [234]:
# General import statements
import pandas as pd
import json
from sqlalchemy import create_engine, MetaData
from sqlalchemy.orm import sessionmaker, declarative_base
from datetime import datetime

# Importing modules custom-built for this project
from utils.settings import (
    LOG_TO_CONSOLE,
    POSTGRES_DB, 
    POSTGRES_HOST, 
    POSTGRES_PASSWORD, 
    POSTGRES_PORT, 
    POSTGRES_USER
)
from utils.logging import get_logger
from utils.search import neural_search, keyword_search, rerank_segment_chunks_for_urls
import utils.postgres_queries as pg_queries
from concurrent.futures import ThreadPoolExecutor

# Set up a logger for this notebook
logger = get_logger("postgres_notebook", log_to_console=LOG_TO_CONSOLE)

Next up: we're going to set up the Postgres engine via SQLAlchemy!

In [9]:
# Create the connection string to the database
postgres_connection_string = f"postgresql://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:{POSTGRES_PORT}/{POSTGRES_DB}"

# Create the connection engine
engine = create_engine(postgres_connection_string)
metadata = MetaData()
session = sessionmaker(bind=engine)()
Base = declarative_base()

# Prototyping Hybrid Search
Below, I'm going to prototype the results of a hybrid search. I'll start by running the hybrid search itself:

In [223]:
# Parameterize the search
query = "jim carrey"
release_date_filter = None
video_type_filter = ["album_review"]
review_score_filter = [7, 10]
max_video_per_search_method = 20
max_results = 5
keyword_weight = 1
neural_weight = 0.6

# Run the neural search and keyword search in parallel
with ThreadPoolExecutor(max_workers=2) as executor:
    future_neural = executor.submit(
        neural_search,
        query=query,
        release_date_filter=release_date_filter,
        video_type_filter=video_type_filter,
        review_score_filter=review_score_filter,
        n_videos_to_return=max_video_per_search_method,
    )
    future_keyword = executor.submit(
        keyword_search,
        query=query,
        release_date_filter=release_date_filter,
        video_type_filter=video_type_filter,
        review_score_filter=review_score_filter,
        n_most_similar_videos=max_video_per_search_method,
    )

# Get the results
neural_results_json_str = future_neural.result()
keyword_results_json_str = future_keyword.result()

# Get DataFrame from the results
neural_results_df = pd.DataFrame(json.loads(neural_results_json_str)).reset_index(
    drop=True
)
keyword_results_df = pd.DataFrame(json.loads(keyword_results_json_str)).reset_index(
    drop=True
)

I'm also going to create a DataFrame containing the different video metadata values. 

In [228]:
# Create a DataFrame with the metadata of the resulting videos' metadata
all_resulting_videos_metadata_df = (
    pd.concat(
        [
            neural_results_df.drop(
                columns=[
                    "top_segment_chunks",
                    "n_segment_matches",
                    "neural_search_score",
                    "neural_search_score_z_score",
                ]
            ).copy(),
            keyword_results_df.drop(
                columns=[
                    "top_segment_chunks",
                    "n_segment_matches",
                    "keyword_search_score",
                    "keyword_search_score_z_score",
                ]
            ).copy(),
        ]
    )
    .drop_duplicates(subset=["url"])
    .reset_index()
)

Next, I'm going to do some "reciprocal rank fusion" - a pretty simple algorithm for merging the results of two rankings. 

In [205]:
# Do some reciprocal rank fusion
k = 60
neural_results_df["neural_score"] = k / (neural_results_df.index + 1 + k)
keyword_results_df["keyword_score"] = k / (keyword_results_df.index + 1 + k)
fused_df = (
    neural_results_df[["url", "neural_score", "neural_search_score_z_score"]]
    .copy()
    .merge(
        keyword_results_df[["url", "keyword_score", "keyword_search_score_z_score"]],
        on="url",
        how="outer",
    )
    .fillna(0)
)
fused_df["fused_score"] = fused_df["neural_score"] + fused_df["keyword_score"]
fused_df["avg_z_score"] = (
    fused_df["neural_search_score_z_score"] * neural_weight
    + fused_df["keyword_search_score_z_score"] * keyword_weight
) / (neural_weight + keyword_weight)
fused_df = (
    fused_df.sort_values(
        [
            "fused_score",
            "avg_z_score",
        ],
        ascending=[False, False],
    )
    .reset_index(drop=True)
    .head(max_results)
)

fused_df

Unnamed: 0,url,neural_score,neural_search_score_z_score,keyword_score,keyword_search_score_z_score,fused_score,avg_z_score
0,https://www.youtube.com/watch?v=LiQfSv1p_IA,0.0,0.0,0.983607,2.0,0.983607,1.25
1,https://www.youtube.com/watch?v=zfDFz8OmgzE,0.983607,2.929927,0.0,0.0,0.983607,1.098723
2,https://www.youtube.com/watch?v=nDZqxWjRLCU,0.967742,1.062586,0.0,0.0,0.967742,0.39847
3,https://www.youtube.com/watch?v=-deQEteXUK4,0.952381,0.974586,0.0,0.0,0.952381,0.36547
4,https://www.youtube.com/watch?v=74NdKIdbXJc,0.9375,0.649602,0.0,0.0,0.9375,0.243601


Now that I've got some top results, I'm going to do something a little more sophisticated: for each of the videos, I'm going to determine the best matches across *all* of the segment chunks for that video. Think of it as something of a "segment re-ranking". 

In [216]:
# Run the reranking of the segment chunks
reranked_segment_chunks_df = rerank_segment_chunks_for_urls(
    query=query,
    urls=fused_df["url"].tolist(),
    search_method="hybrid",
    n_top_segment_chunks=3
)

# Merge the reranked segment chunks with the fused_df
fused_with_text_df = fused_df.merge(
    reranked_segment_chunks_df,
    on="url",
    how="left",
)

Finally, I'll merge this back with the `all_resulting_videos_metadata_df`. 

In [229]:
# Create a DataFrame with the metadata of the resulting videos' metadata
result_df = fused_with_text_df.copy().merge(
    all_resulting_videos_metadata_df, on="url", how="left"
)

# Functionalizing the Method


In [272]:
from utils.search import hybrid_search

hybrid_search_results_json_str = hybrid_search(
    query="beautiful piano strings, slipping through the trees like dreams",
    release_date_filter=None,
    video_type_filter=None,
    review_score_filter=None,
    max_video_per_search_method=20,
    max_results=10,
    keyword_weight=1,
    neural_weight=0.85
)

hybrid_search_results_df = pd.DataFrame(json.loads(hybrid_search_results_json_str))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\trevb_b7z2dw1/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Retrieved 20 neural results and 20 keyword results.


  ]


In [274]:
hybrid_search_results_df[
    [
        "title",
        "top_segment_chunks",
        "neural_score",
        "neural_search_score_z_score",
        "keyword_score",
        "keyword_search_score_z_score",
        "fused_score",
        "avg_z_score",
        "cos_sim"
    ]
]

Unnamed: 0,title,top_segment_chunks,neural_score,neural_search_score_z_score,keyword_score,keyword_search_score_z_score,fused_score,avg_z_score,cos_sim
0,TND Podcast #45 ft. FrankJavCee,[with little sketchings and you would tell you...,0.0,0.0,0.983607,36.094834,0.983607,19.510721,"[0.3563498132, 0.3514914722, 0.3506569684]"
1,TND Podcast #42 ft. Digibro,"[but I love in the mountain in the cloud and, ...",0.0,0.0,0.967742,19.894517,0.967742,10.753793,"[0.312143445, 0.296794856, 0.2914485584]"
2,TND Podcast #52 ft. Pyrocynical and NFKRZ,"[It's got a chill wave vibe to it. You know, I...",0.0,0.0,0.952381,15.103193,0.952381,8.163888,"[0.3711246031, 0.3406806588, 0.2706609772]"
3,TND Podcast #57: How To Get Into Death Grips f...,[I think that's what makes them so fresh and t...,0.0,0.0,0.9375,11.888761,0.9375,6.426357,"[0.3680948458, 0.3656269094, 0.3617364764]"
4,TND Podcast #55: clipping. Live Interview @ S....,[they can be and just kind of really kind of t...,0.0,0.0,0.923077,11.834967,0.923077,6.397279,"[0.3549130049, 0.2874077942, 0.2858831499]"
5,TND Podcast #17 ft. George Miller a.k.a. Filth...,"[Yeah, absolutely. I discovered porn on new gr...",0.0,0.0,0.909091,11.555187,0.909091,6.246047,"[0.2062668266, 0.1992480516, 0.1962147568]"
6,TND Podcast #60: Worst Rappers in the Game 201...,[that's that's on his own music and I wasn't r...,0.0,0.0,0.895522,11.160964,0.895522,6.032954,"[0.3201868129, 0.3192802883, 0.2916520238]"
7,TND Podcast #58: How To Review An Album ft. Zo...,[to kind of stress to the listener kind of cat...,0.0,0.0,0.882353,10.827391,0.882353,5.852644,"[0.2940289624, 0.2933736444, 0.2817878387]"
8,TND Podcast #54 ft. Adam of YourMovieSucks,[bit of Regina Spector in my album maybe a lit...,0.0,0.0,0.869565,10.766742,0.869565,5.81986,"[0.3599791956, 0.3583581661, 0.3267615438]"
9,TND Podcast #56: 10 Worst Rappers (Revisited) ...,"[10 out of 10, bitch. Yeah. Maybe somebody can...",0.0,0.0,0.857143,8.734978,0.857143,4.72161,"[0.3345420064, 0.2706295525, 0.2679424722]"


In [271]:
import nltk
from nltk.corpus import stopwords
from nltk.util import ngrams

# Ensure you have the stopwords dataset downloaded
nltk.download("stopwords")


def generate_tsquery(input_query, max_n=3):
    # Tokenize the input string into words
    words = input_query.split()

    # Filter out stopwords
    filtered_words = [
        word for word in words if word.lower() not in stopwords.words("english")
    ]

    # Generate n-grams for each n from 1 to max_n
    phrases = []
    for n in range(1, max_n + 1):
        for ngram in ngrams(filtered_words, n):
            phrase = " <-> ".join(ngram)  # Use <-> for phrase search in tsquery
            phrases.append(phrase)

    # Combine all phrases with the OR operator for tsquery
    tsquery = " | ".join(phrases)

    return tsquery


# Example usage
input_query = "the quick brown fox"
tsquery = generate_tsquery(input_query, 3)
print(tsquery)

quick | brown | fox | quick <-> brown | brown <-> fox | quick <-> brown <-> fox


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\trevb_b7z2dw1/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
