## Semantic Search Text Fields Aggregation

This notebook documents the process of aggregating the various text-based metadata of each racquet into one query-able field. I create two different options. 

The first field is a structured field that lists each col in a dictionary-style format (name of attribute: text\n). 

The second field is a more organically aggregated field that fits each piece of metadata into a pre-defined sentence structure (looking for {racquet_weight}oz racquet with {racquet_power} power.).

## Imports and data loading

In [5]:
from itertools import zip_longest
import datashelf.core as ds
import numpy as np
import pandas as pd
from searchlite.document import Document

In [2]:
ds.ls("coll-files")

+------------------------+------------------------------------------------------------------+---------------------+----------------------+---------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+-----------+
| name                   | hash                                                             | date_created        | date_last_modified   | tag     |   version | message                                                                                                                                                                                 | file_path                                                                                                       | deleted   |
| racquets_metadata.yaml |                            

In [9]:
basic_cleaned_data = ds.load(
    collection_name = "racquets",
    hash_value = "ab63aaff96cc00a62726899d4f5ec493772b7b9ab3b5c42ef7aed3b0178e0981"
    
)
basic_cleaned_data.shape

(321, 24)

In [10]:
df = basic_cleaned_data.copy()

In [11]:
object_cols = df.select_dtypes(include = ["object"]).columns.to_list()
numeric_cols = df.select_dtypes(include = [np.number]).columns.to_list()

print(pd.DataFrame(
    list(zip_longest(object_cols, numeric_cols, fillvalue = None)),
    columns = ["Object Columns", "Numeric Columns"]
))

          Object Columns           Numeric Columns
0          racquet_brand            racquet_rating
1            racquet_img             racquet_price
2           racquet_name       racquet_swingweight
3           racquet_desc   racquet_head_size_sq_in
4    racquet_composition     racquet_balance_HH_HL
5          racquet_power        racquet_balance_in
6   racquet_stroke_style  racquet_strung_weight_oz
7    racquet_swing_speed         racquet_length_in
8         racquet_colors         racquet_stiffness
9           racquet_grip    racquet_avg_beam_width
10                  None             racquet_mains
11                  None           racquet_crosses
12                  None     racquet_tension_lower
13                  None     racquet_tension_upper


## Structured Combined Text Field

In [14]:
struct_combined_df = df.copy()
struct_combined_df["structured_combined_text"] = ""

In [27]:
struct_combined_df

Unnamed: 0,racquet_brand,racquet_img,racquet_name,racquet_rating,racquet_price,racquet_desc,racquet_swingweight,racquet_composition,racquet_power,racquet_stroke_style,...,racquet_balance_in,racquet_strung_weight_oz,racquet_length_in,racquet_stiffness,racquet_avg_beam_width,racquet_mains,racquet_crosses,racquet_tension_lower,racquet_tension_upper,structured_combined_text
0,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Boost Aero Pink,5.0,119.0,This racquet comes pre-strung for added conve...,309.0,Graphite,Low-Medium,Medium-Full,...,13.70,9.7,27.0,65.0,24.000000,16.0,19.0,50.0,55.0,
1,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Boost Aero,5.0,119.0,This racquet comes pre-strung for added conve...,309.0,Graphite,Low-Medium,Medium-Full,...,13.70,9.7,27.0,65.0,24.000000,16.0,19.0,50.0,55.0,
2,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero 2023,4.6,289.0,"With its ""best of class"" combination of speed...",318.0,Graphite,Low-Medium,Medium-Full,...,12.99,11.2,27.0,66.0,24.000000,16.0,19.0,50.0,59.0,
3,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero 98 2023,4.9,299.0,Babolat adds another chapter to the most cont...,321.0,Graphite,Low-Medium,Medium-Full,...,12.79,11.4,27.0,66.0,22.000000,16.0,20.0,50.0,59.0,
4,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero Lite 2023,4.4,269.0,Updated with a softer feel and wider string s...,304.0,Graphite,Low-Medium,Medium-Full,...,13.38,10.0,27.0,65.0,24.000000,16.0,19.0,50.0,59.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
316,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 98+,5.0,305.0,Yonex adds another chapter to the VCORE 98+! ...,333.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,13.18,11.4,27.5,62.0,22.333333,16.0,19.0,45.0,60.0,
317,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 100L,4.2,305.0,"With the 2023 version of the VCORE 100L, Yone...",312.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,13.38,10.5,27.0,66.0,24.200000,16.0,19.0,45.0,60.0,
318,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 95,4.8,305.0,Yonex adds another chapter to the VCORE 95! L...,321.0,2G-Namd FlexForce/H.M. Graphite,Low,Full,...,12.59,11.5,27.0,61.0,21.666667,16.0,20.0,45.0,60.0,
319,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 98,4.8,305.0,"With its redesigned beam and softer feel, the...",318.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,12.79,11.4,27.0,62.0,22.333333,16.0,19.0,45.0,60.0,


In [29]:
def structured_combine_text(df:pd.DataFrame, object_cols:list[str]) -> pd.DataFrame:
    _df = df.copy()
    _df["combined_col"] = ""
    
    _replacements = str.maketrans({
        "!":".",
        "_":" ",
        "&":"and",
        "²":"",
        "\xa0":"",
        "\n":"",
        "\r":"",
        '"':"",
        '“':"",
        "+":"plus",
        "%":"percent"
    })
    
    _title_dict = {
        "racquet_brand":"Racquet Brand",
        "racquet_name":"Racquet Name",
        "racquet_desc":"Racquet Description",
        "racquet_composition":"Racquet Composition",
        "racquet_power":"Racquet Power Level",
        "racquet_stroke_style":"Racquet Stroke Style",
        "racquet_swing_speed":"Racquet Swing Speed",
        "racquet_colors":"Racquet Colors",
        "racquet_grip":"Racquet Grip Type"
    }
    
    for col in object_cols:
        _content = _df[col].str.translate(_replacements)
        _content = _content.replace("in²", "inches squared").replace("  ", " ")
        _df["combined_col"] += _title_dict[col] + ": " +  _content + "\n"
        
    return _df["combined_col"]

In [30]:
struct_combined_df["structured_combined_text"] = structured_combine_text(
    df = struct_combined_df,
    object_cols = [col for col in object_cols if col != "racquet_img"]
)

In [31]:
struct_combined_df[["racquet_name", "racquet_price", "structured_combined_text"]]

Unnamed: 0,racquet_name,racquet_price,structured_combined_text
0,Babolat Boost Aero Pink,119.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
1,Babolat Boost Aero,119.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
2,Babolat Pure Aero 2023,289.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
3,Babolat Pure Aero 98 2023,299.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
4,Babolat Pure Aero Lite 2023,269.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
...,...,...,...
316,Yonex VCORE 98+,305.0,Racquet Brand: Yonex\nRacquet Name: Yonex VCOR...
317,Yonex VCORE 100L,305.0,Racquet Brand: Yonex\nRacquet Name: Yonex VCOR...
318,Yonex VCORE 95,305.0,Racquet Brand: Yonex\nRacquet Name: Yonex VCOR...
319,Yonex VCORE 98,305.0,Racquet Brand: Yonex\nRacquet Name: Yonex VCOR...


In [44]:
## Save to DataShelf

## Natural combined text field

In [35]:
natural_combined_df = df.copy()
natural_combined_df["nat_combined_text"] = ""

In [38]:
def create_natural_combined_text(row:pd.Series) -> str:
    def safe(val):
        return "unkown" if pd.isna(val) else str(val.strip())
    
    combined_text = (
    f"The {safe(row['racquet_name'])} is a {safe(row['racquet_power']).lower()} powered racquet designed for players with "
    f"{safe(row['racquet_stroke_style']).lower()} strokes and {safe(row['racquet_swing_speed']).lower()} swings. "
    f"It features a stiffness rating of {row['racquet_stiffness']} and a {str(row['racquet_composition']).lower()} "
    f"composition. The racquet has a {row['racquet_swingweight']} ounce swing weight, a {row['racquet_head_size_sq_in']} "
    f"square inch head size, a {row['racquet_strung_weight_oz']} ounce strung weight, "
    f"and has a {row['racquet_mains']} by {row['racquet_crosses']} string pattern."
    )
    
    return " ".join(combined_text.split())

In [39]:
natural_combined_df["nat_combined_text"] = natural_combined_df.apply(create_natural_combined_text, axis = 1)

In [40]:
natural_combined_df[["racquet_name", "racquet_price", "nat_combined_text"]]

Unnamed: 0,racquet_name,racquet_price,nat_combined_text
0,Babolat Boost Aero Pink,119.0,The Babolat Boost Aero Pink is a low-medium po...
1,Babolat Boost Aero,119.0,The Babolat Boost Aero is a low-medium powered...
2,Babolat Pure Aero 2023,289.0,The Babolat Pure Aero 2023 is a low-medium pow...
3,Babolat Pure Aero 98 2023,299.0,The Babolat Pure Aero 98 2023 is a low-medium ...
4,Babolat Pure Aero Lite 2023,269.0,The Babolat Pure Aero Lite 2023 is a low-mediu...
...,...,...,...
316,Yonex VCORE 98+,305.0,The Yonex VCORE 98+ is a low-medium powered ra...
317,Yonex VCORE 100L,305.0,The Yonex VCORE 100L is a low-medium powered r...
318,Yonex VCORE 95,305.0,The Yonex VCORE 95 is a low powered racquet de...
319,Yonex VCORE 98,305.0,The Yonex VCORE 98 is a low-medium powered rac...


In [42]:
## Save to DataShelf