# Semantic Search Text Fields Aggregation Experiments

This notebook documents the process of aggregating the various text-based metadata of each racquet into one query-able field. I create two different options. 

The first field is a structured field that lists each col in a dictionary-style format:

```
name of attribute: text\nname of attribute: text\n...
``` 

The second field is a more organically aggregated field that fits each piece of metadata into a pre-defined sentence structure:
```
"Looking for {attribute} racquet with {attribute} power..."
```

## Imports and data loading

In [1]:
from itertools import zip_longest
import datashelf.core as ds
import numpy as np
import pandas as pd
from searchlite.document import Document

In [2]:
ds.ls("coll-files", collection_name="racquets")

In [3]:
cleaned_data = ds.load(
    collection_name = "racquets",
    hash_value = "80c4ecec5ce5c6e86d3e4e7b08f1c2b7fab5c41fc7da3320a0a3d843a0f0c8bf"
)
cleaned_data.shape

(326, 24)

In [4]:
df = cleaned_data.copy()

In [5]:
object_cols = df.select_dtypes(include = ["object"]).columns.to_list()
numeric_cols = df.select_dtypes(include = [np.number]).columns.to_list()

print(pd.DataFrame(
    list(zip_longest(object_cols, numeric_cols, fillvalue = None)),
    columns = ["Object Columns", "Numeric Columns"]
))

          Object Columns           Numeric Columns
0          racquet_brand            racquet_rating
1            racquet_img             racquet_price
2           racquet_name       racquet_swingweight
3           racquet_desc   racquet_head_size_sq_in
4    racquet_composition         racquet_length_in
5          racquet_power  racquet_strung_weight_oz
6   racquet_stroke_style        racquet_balance_in
7    racquet_swing_speed     racquet_balance_HH_HL
8         racquet_colors         racquet_stiffness
9           racquet_grip    racquet_avg_beam_width
10                  None             racquet_mains
11                  None           racquet_crosses
12                  None     racquet_tension_lower
13                  None     racquet_tension_upper


## Structured Combined Text Field

In [6]:
struct_combined_df = df.copy()
struct_combined_df["structured_combined_text"] = ""

In [7]:
struct_combined_df

Unnamed: 0,racquet_brand,racquet_img,racquet_name,racquet_rating,racquet_price,racquet_desc,racquet_swingweight,racquet_composition,racquet_power,racquet_stroke_style,...,racquet_strung_weight_oz,racquet_balance_in,racquet_balance_HH_HL,racquet_stiffness,racquet_avg_beam_width,racquet_mains,racquet_crosses,racquet_tension_lower,racquet_tension_upper,structured_combined_text
0,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Boost Aero Pink,5.0,119.0,This racquet comes pre-strung for added conve...,309.0,Graphite,Low-Medium,Medium-Full,...,9.7,13.70,-2.0,65.0,24.000000,16.0,19.0,50.0,55.0,
1,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Boost Aero,5.0,119.0,This racquet comes pre-strung for added conve...,309.0,Graphite,Low-Medium,Medium-Full,...,9.7,13.70,-2.0,65.0,24.000000,16.0,19.0,50.0,55.0,
2,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero 2023,4.6,289.0,"With its ""best of class"" combination of speed...",318.0,Graphite,Low-Medium,Medium-Full,...,11.2,12.99,4.0,66.0,24.000000,16.0,19.0,50.0,59.0,
3,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero 98 2023,4.9,299.0,Babolat adds another chapter to the most cont...,321.0,Graphite,Low-Medium,Medium-Full,...,11.4,12.79,6.0,66.0,22.000000,16.0,20.0,50.0,59.0,
4,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero Lite 2023,4.6,269.0,Updated with a softer feel and wider string s...,304.0,Graphite,Low-Medium,Medium-Full,...,10.0,13.38,1.0,65.0,24.000000,16.0,19.0,50.0,59.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 98+,5.0,305.0,Yonex adds another chapter to the VCORE 98+! ...,333.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,11.4,13.18,5.0,62.0,22.333333,16.0,19.0,45.0,60.0,
322,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 100L,4.2,305.0,"With the 2023 version of the VCORE 100L, Yone...",312.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,10.5,13.38,1.0,66.0,24.200000,16.0,19.0,45.0,60.0,
323,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 95,4.8,305.0,Yonex adds another chapter to the VCORE 95! L...,321.0,2G-Namd FlexForce/H.M. Graphite,Low,Full,...,11.5,12.59,7.0,61.0,21.666667,16.0,20.0,45.0,60.0,
324,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 98,4.8,305.0,"With its redesigned beam and softer feel, the...",318.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,11.4,12.79,6.0,62.0,22.333333,16.0,19.0,45.0,60.0,


In [8]:
def structured_combine_text(df:pd.DataFrame, object_cols:list[str]) -> pd.DataFrame:
    _df = df.copy()
    _df["combined_col"] = ""
    
    _replacements = str.maketrans({
        "!":".",
        "_":" ",
        "&":"and",
        "²":"",
        "\xa0":"",
        "\n":"",
        "\r":"",
        '"':"",
        '“':"",
        "+":"plus",
        "%":"percent"
    })
    
    _title_dict = {
        "racquet_brand":"Racquet Brand",
        "racquet_name":"Racquet Name",
        "racquet_desc":"Racquet Description",
        "racquet_composition":"Racquet Composition",
        "racquet_power":"Racquet Power Level",
        "racquet_stroke_style":"Racquet Stroke Style",
        "racquet_swing_speed":"Racquet Swing Speed",
        "racquet_colors":"Racquet Colors",
        "racquet_grip":"Racquet Grip Type"
    }
    
    for col in object_cols:        
        _content = _df[col].fillna("").astype(str).str.translate(_replacements)
        _content = _content.str.replace("in²", "inches squared", regex=False).str.replace("  ", " ", regex=False)
        _df["combined_col"] += _title_dict[col] + ": " + _content + "\n"
            
    return _df["combined_col"]


In [9]:
struct_combined_df["structured_combined_text"] = structured_combine_text(
    df = struct_combined_df,
    object_cols = [col for col in object_cols if col != "racquet_img"]
)

In [10]:
struct_combined_df[["racquet_name", "racquet_price", "structured_combined_text"]]

Unnamed: 0,racquet_name,racquet_price,structured_combined_text
0,Babolat Boost Aero Pink,119.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
1,Babolat Boost Aero,119.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
2,Babolat Pure Aero 2023,289.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
3,Babolat Pure Aero 98 2023,299.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
4,Babolat Pure Aero Lite 2023,269.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
...,...,...,...
321,Yonex VCORE 98+,305.0,Racquet Brand: Yonex\nRacquet Name: Yonex VCOR...
322,Yonex VCORE 100L,305.0,Racquet Brand: Yonex\nRacquet Name: Yonex VCOR...
323,Yonex VCORE 95,305.0,Racquet Brand: Yonex\nRacquet Name: Yonex VCOR...
324,Yonex VCORE 98,305.0,Racquet Brand: Yonex\nRacquet Name: Yonex VCOR...


In [None]:
## Save to DataShelf


In [11]:
struct_combined_df

Unnamed: 0,racquet_brand,racquet_img,racquet_name,racquet_rating,racquet_price,racquet_desc,racquet_swingweight,racquet_composition,racquet_power,racquet_stroke_style,...,racquet_strung_weight_oz,racquet_balance_in,racquet_balance_HH_HL,racquet_stiffness,racquet_avg_beam_width,racquet_mains,racquet_crosses,racquet_tension_lower,racquet_tension_upper,structured_combined_text
0,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Boost Aero Pink,5.0,119.0,This racquet comes pre-strung for added conve...,309.0,Graphite,Low-Medium,Medium-Full,...,9.7,13.70,-2.0,65.0,24.000000,16.0,19.0,50.0,55.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
1,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Boost Aero,5.0,119.0,This racquet comes pre-strung for added conve...,309.0,Graphite,Low-Medium,Medium-Full,...,9.7,13.70,-2.0,65.0,24.000000,16.0,19.0,50.0,55.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
2,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero 2023,4.6,289.0,"With its ""best of class"" combination of speed...",318.0,Graphite,Low-Medium,Medium-Full,...,11.2,12.99,4.0,66.0,24.000000,16.0,19.0,50.0,59.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
3,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero 98 2023,4.9,299.0,Babolat adds another chapter to the most cont...,321.0,Graphite,Low-Medium,Medium-Full,...,11.4,12.79,6.0,66.0,22.000000,16.0,20.0,50.0,59.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
4,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero Lite 2023,4.6,269.0,Updated with a softer feel and wider string s...,304.0,Graphite,Low-Medium,Medium-Full,...,10.0,13.38,1.0,65.0,24.000000,16.0,19.0,50.0,59.0,Racquet Brand: Babolat\nRacquet Name: Babolat ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 98+,5.0,305.0,Yonex adds another chapter to the VCORE 98+! ...,333.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,11.4,13.18,5.0,62.0,22.333333,16.0,19.0,45.0,60.0,Racquet Brand: Yonex\nRacquet Name: Yonex VCOR...
322,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 100L,4.2,305.0,"With the 2023 version of the VCORE 100L, Yone...",312.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,10.5,13.38,1.0,66.0,24.200000,16.0,19.0,45.0,60.0,Racquet Brand: Yonex\nRacquet Name: Yonex VCOR...
323,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 95,4.8,305.0,Yonex adds another chapter to the VCORE 95! L...,321.0,2G-Namd FlexForce/H.M. Graphite,Low,Full,...,11.5,12.59,7.0,61.0,21.666667,16.0,20.0,45.0,60.0,Racquet Brand: Yonex\nRacquet Name: Yonex VCOR...
324,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 98,4.8,305.0,"With its redesigned beam and softer feel, the...",318.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,11.4,12.79,6.0,62.0,22.333333,16.0,19.0,45.0,60.0,Racquet Brand: Yonex\nRacquet Name: Yonex VCOR...


### Test Semantic Search

In [12]:
from searchlite.document import Document

In [13]:
text = struct_combined_df["structured_combined_text"]
metadata = struct_combined_df[["racquet_brand", "racquet_name", "racquet_rating", "racquet_price"]].to_dict(orient="records")

doc = Document(texts=text, metadata=metadata)

In [14]:
doc

Document instance with 326 texts. Metadata contains the following fields: racquet_brand, racquet_name, racquet_rating, racquet_price. Embeddings: Not Ready.
Embedder:TFIDFEmbedder object implemented using scikit-learn.
 Embedder fitted: True

In [15]:
doc.embed()

In [16]:
query = "I want a black and pink colored racquet, with low power that is under $200"
result = doc.query(query_text=query, top_k=10)

doc.display_results(output_list_dicts=result, style="tabulate") #it does an ok job, I think this version of combining text is adding noise though

+-----------------+----------------------------+------------------+-----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Natural combined text field

In [17]:
natural_combined_df = df.copy()
natural_combined_df["nat_combined_text"] = ""

In [18]:
natural_combined_df.head(1)

Unnamed: 0,racquet_brand,racquet_img,racquet_name,racquet_rating,racquet_price,racquet_desc,racquet_swingweight,racquet_composition,racquet_power,racquet_stroke_style,...,racquet_strung_weight_oz,racquet_balance_in,racquet_balance_HH_HL,racquet_stiffness,racquet_avg_beam_width,racquet_mains,racquet_crosses,racquet_tension_lower,racquet_tension_upper,nat_combined_text
0,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Boost Aero Pink,5.0,119.0,This racquet comes pre-strung for added conve...,309.0,Graphite,Low-Medium,Medium-Full,...,9.7,13.7,-2.0,65.0,24.0,16.0,19.0,50.0,55.0,


The Babolat Boost Aero Pink is a low-medium powered racquet designed for players with medium-full strokes and medium-fast swing speeds. It features a stiffness rating of 65.0 and has a graphite composition. The racquet has a 309 ounce swingweight, a 102 square inch head size, and has a 16.0 by 19.0 string pattern. The racquet has a tension range of 50.0 pounds (lbs) to 55.0 pounds (lbs). The racquet also has an average beam width of 24.0 inches, wiht a 13.7 inch balance point, and is 2 points head light. The racquet has a black/pink colorway and has a price of 119.0 dollars.

In [19]:
def create_natural_combined_text(row:pd.Series) -> str:
    def safe(val):
        return "unkown" if pd.isna(val) else str(val.strip())
    
    if row["racquet_balance_HH_HL"] < 0:
        hh_hl_tag = "head light"
    elif row["racquet_balance_HH_HL"]>0:
        hh_hl_tag = "head heavy"
    else:
        hh_hl_tag = "equally balanced"
    
    combined_text = (
        f"The {safe(row['racquet_name'])} is a {safe(row['racquet_power']).lower()} powered racquet designed for players with "
        f"{safe(row['racquet_stroke_style']).lower()} strokes and {safe(row['racquet_swing_speed']).lower()} swings. "
        f"It features a stiffness rating of {row['racquet_stiffness']} and a {str(row['racquet_composition']).lower()} "
        f"composition. The racquet has a {row['racquet_swingweight']} ounce swing weight, a {row['racquet_head_size_sq_in']} "
        f"square inch head size, a {row['racquet_strung_weight_oz']} ounce strung weight, "
        f"and has a {row['racquet_mains']} by {row['racquet_crosses']} string pattern. "
        f"The racquet has a tension range of {row['racquet_tension_lower']} pounds (lbs) to {row['racquet_tension_upper']} pounds (lbs). "
        f"The racquet has an average beam width of {row['racquet_avg_beam_width']}, with a {row['racquet_balance_in']} inch balance point, "
        f"and is {'' if row['racquet_balance_HH_HL'] == 0 else row['racquet_balance_HH_HL']} {hh_hl_tag}. The racquet has a "
        f"{row['racquet_colors']} colorway and has a price of {row['racquet_price']} dollars."
        )
    
    return " ".join(combined_text.split())

In [20]:
natural_combined_df["nat_combined_text"] = natural_combined_df.apply(create_natural_combined_text, axis = 1)

In [21]:
natural_combined_df[["racquet_name", "racquet_price", "nat_combined_text"]]

Unnamed: 0,racquet_name,racquet_price,nat_combined_text
0,Babolat Boost Aero Pink,119.0,The Babolat Boost Aero Pink is a low-medium po...
1,Babolat Boost Aero,119.0,The Babolat Boost Aero is a low-medium powered...
2,Babolat Pure Aero 2023,289.0,The Babolat Pure Aero 2023 is a low-medium pow...
3,Babolat Pure Aero 98 2023,299.0,The Babolat Pure Aero 98 2023 is a low-medium ...
4,Babolat Pure Aero Lite 2023,269.0,The Babolat Pure Aero Lite 2023 is a low-mediu...
...,...,...,...
321,Yonex VCORE 98+,305.0,The Yonex VCORE 98+ is a low-medium powered ra...
322,Yonex VCORE 100L,305.0,The Yonex VCORE 100L is a low-medium powered r...
323,Yonex VCORE 95,305.0,The Yonex VCORE 95 is a low powered racquet de...
324,Yonex VCORE 98,305.0,The Yonex VCORE 98 is a low-medium powered rac...


In [None]:
## Save to DataShelf


### Test Semantic Search

In [22]:
text = natural_combined_df["nat_combined_text"]
metadata = natural_combined_df[["racquet_brand", "racquet_name", "racquet_price", "racquet_rating"]].to_dict(orient="records")

In [23]:
doc = Document(texts=text, metadata=metadata)
doc

Document instance with 326 texts. Metadata contains the following fields: racquet_brand, racquet_name, racquet_price, racquet_rating. Embeddings: Not Ready.
Embedder:TFIDFEmbedder object implemented using scikit-learn.
 Embedder fitted: True

In [24]:
doc.embed()

In [25]:
query = "I want a black and pink colored racquet, with low power that is under $200"
result = doc.query(query_text=query, top_k=10)

doc.display_results(output_list_dicts=result, style="tabulate") #it does a slightly better job than the structured text
# Modified the f-string to include all information and it actually does decently well. 
# The only issue is that it has none of the information from the description column -- meaning that info is just unused.

+-----------------+----------------------------+-----------------+------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| racquet_brand   | racquet_name               |   racquet_price |   racquet_rating | text                                                                                                                                                                                            

## Natural combined tet field + racquet description field

In [26]:
natural_combined_df_v2 = df.copy()
natural_combined_df_v2["nat_combined_text_v2"] = ""

In [27]:
def create_natural_combined_text_v2(row:pd.Series) -> str:
    def safe(val):
        return "unkown" if pd.isna(val) else str(val.strip())
    
    if row["racquet_balance_HH_HL"] < 0:
        hh_hl_tag = "head light"
    elif row["racquet_balance_HH_HL"]>0:
        hh_hl_tag = "head heavy"
    else:
        hh_hl_tag = "equally balanced"
    
    combined_text = (
        f"The {safe(row['racquet_name'])} is a {safe(row['racquet_power']).lower()} powered racquet designed for players with "
        f"{safe(row['racquet_stroke_style']).lower()} strokes and {safe(row['racquet_swing_speed']).lower()} swings. "
        f"It features a stiffness rating of {row['racquet_stiffness']} and a {str(row['racquet_composition']).lower()} "
        f"composition. The racquet has a {row['racquet_swingweight']} ounce swing weight, a {row['racquet_head_size_sq_in']} "
        f"square inch head size, a {row['racquet_strung_weight_oz']} ounce strung weight, "
        f"and has a {row['racquet_mains']} by {row['racquet_crosses']} string pattern. "
        f"The racquet has a tension range of {row['racquet_tension_lower']} pounds (lbs) to {row['racquet_tension_upper']} pounds (lbs). "
        f"The racquet has an average beam width of {row['racquet_avg_beam_width']}, with a {row['racquet_balance_in']} inch balance point, "
        f"and is {'' if row['racquet_balance_HH_HL'] == 0 else row['racquet_balance_HH_HL']} {hh_hl_tag}. The racquet has a "
        f"{row['racquet_colors']} colorway and has a price of {row['racquet_price']} dollars."
        f"\n\nHere is the marketing blurb for the {safe(row['racquet_name'])}:\n{row['racquet_desc']}"
        )
    
    return " ".join(combined_text.split())

In [28]:
natural_combined_df_v2["nat_combined_text_v2"] = natural_combined_df_v2.apply(create_natural_combined_text_v2, axis = 1)

In [29]:
natural_combined_df_v2[["racquet_name", "racquet_price", "nat_combined_text_v2"]]

Unnamed: 0,racquet_name,racquet_price,nat_combined_text_v2
0,Babolat Boost Aero Pink,119.0,The Babolat Boost Aero Pink is a low-medium po...
1,Babolat Boost Aero,119.0,The Babolat Boost Aero is a low-medium powered...
2,Babolat Pure Aero 2023,289.0,The Babolat Pure Aero 2023 is a low-medium pow...
3,Babolat Pure Aero 98 2023,299.0,The Babolat Pure Aero 98 2023 is a low-medium ...
4,Babolat Pure Aero Lite 2023,269.0,The Babolat Pure Aero Lite 2023 is a low-mediu...
...,...,...,...
321,Yonex VCORE 98+,305.0,The Yonex VCORE 98+ is a low-medium powered ra...
322,Yonex VCORE 100L,305.0,The Yonex VCORE 100L is a low-medium powered r...
323,Yonex VCORE 95,305.0,The Yonex VCORE 95 is a low powered racquet de...
324,Yonex VCORE 98,305.0,The Yonex VCORE 98 is a low-medium powered rac...


In [None]:
## Save to DataShelf


### Test semantic search

In [30]:
text = natural_combined_df_v2["nat_combined_text_v2"]
metadata = natural_combined_df_v2[["racquet_brand", "racquet_name", "racquet_price"]].to_dict(orient="records")

doc = Document(texts=text, metadata=metadata)
doc

Document instance with 326 texts. Metadata contains the following fields: racquet_brand, racquet_name, racquet_price. Embeddings: Not Ready.
Embedder:TFIDFEmbedder object implemented using scikit-learn.
 Embedder fitted: True

In [31]:
doc.embed()

In [None]:
query = "I want a black and pink colored racquet, with low power that is under $200"
result = doc.query(query_text=query, top_k=10)

doc.display_results(output_list_dicts=result, style="tabulate") #it does a slightly better job than the structured text
# Modified the f-string to include all information and it actually does decently well and it includes the marketing description. 


+-----------------+----------------------------+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Test natural_combined_df_v2 for usability with all-MiniLM-L6-v2

In [43]:
# Create column for text lengths
natural_combined_df_v2["text_length"] = natural_combined_df_v2["nat_combined_text_v2"].apply(lambda x: len(x.split()))

In [41]:
natural_combined_df_v2

Unnamed: 0,racquet_brand,racquet_img,racquet_name,racquet_rating,racquet_price,racquet_desc,racquet_swingweight,racquet_composition,racquet_power,racquet_stroke_style,...,racquet_balance_in,racquet_balance_HH_HL,racquet_stiffness,racquet_avg_beam_width,racquet_mains,racquet_crosses,racquet_tension_lower,racquet_tension_upper,nat_combined_text_v2,text_length
0,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Boost Aero Pink,5.0,119.0,This racquet comes pre-strung for added conve...,309.0,Graphite,Low-Medium,Medium-Full,...,13.70,-2.0,65.0,24.000000,16.0,19.0,50.0,55.0,The Babolat Boost Aero Pink is a low-medium po...,321
1,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Boost Aero,5.0,119.0,This racquet comes pre-strung for added conve...,309.0,Graphite,Low-Medium,Medium-Full,...,13.70,-2.0,65.0,24.000000,16.0,19.0,50.0,55.0,The Babolat Boost Aero is a low-medium powered...,313
2,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero 2023,4.6,289.0,"With its ""best of class"" combination of speed...",318.0,Graphite,Low-Medium,Medium-Full,...,12.99,4.0,66.0,24.000000,16.0,19.0,50.0,59.0,The Babolat Pure Aero 2023 is a low-medium pow...,268
3,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero 98 2023,4.9,299.0,Babolat adds another chapter to the most cont...,321.0,Graphite,Low-Medium,Medium-Full,...,12.79,6.0,66.0,22.000000,16.0,20.0,50.0,59.0,The Babolat Pure Aero 98 2023 is a low-medium ...,304
4,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero Lite 2023,4.6,269.0,Updated with a softer feel and wider string s...,304.0,Graphite,Low-Medium,Medium-Full,...,13.38,1.0,65.0,24.000000,16.0,19.0,50.0,59.0,The Babolat Pure Aero Lite 2023 is a low-mediu...,257
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 98+,5.0,305.0,Yonex adds another chapter to the VCORE 98+! ...,333.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,13.18,5.0,62.0,22.333333,16.0,19.0,45.0,60.0,The Yonex VCORE 98+ is a low-medium powered ra...,301
322,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 100L,4.2,305.0,"With the 2023 version of the VCORE 100L, Yone...",312.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,13.38,1.0,66.0,24.200000,16.0,19.0,45.0,60.0,The Yonex VCORE 100L is a low-medium powered r...,315
323,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 95,4.8,305.0,Yonex adds another chapter to the VCORE 95! L...,321.0,2G-Namd FlexForce/H.M. Graphite,Low,Full,...,12.59,7.0,61.0,21.666667,16.0,20.0,45.0,60.0,The Yonex VCORE 95 is a low powered racquet de...,309
324,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 98,4.8,305.0,"With its redesigned beam and softer feel, the...",318.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,12.79,6.0,62.0,22.333333,16.0,19.0,45.0,60.0,The Yonex VCORE 98 is a low-medium powered rac...,358


In [42]:
max(natural_combined_df_v2["text_length"])

439

In [49]:
#Create a chunking function
chunks = []
for i in range(0, len(natural_combined_df_v2["nat_combined_text_v2"][1].split()),256):
    chunk = natural_combined_df_v2["nat_combined_text_v2"][1].split()[i:i+256]
    chunks.append(" ".join(chunk))

In [50]:
chunks

["The Babolat Boost Aero is a low-medium powered racquet designed for players with medium-full strokes and medium-fast swings. It features a stiffness rating of 65.0 and a graphite composition. The racquet has a 309.0 ounce swing weight, a 102.0 square inch head size, a 9.7 ounce strung weight, and has a 16.0 by 19.0 string pattern. The racquet has a tension range of 50.0 pounds (lbs) to 55.0 pounds (lbs). The racquet has an average beam width of 24.0, with a 13.7 inch balance point, and is -2.0 head light. The racquet has a Black/Yellow colorway and has a price of 119.0 dollars. Here is the marketing blurb for the Babolat Boost Aero: This racquet comes pre-strung for added convenience and value!The Babolat Boost Aero is ideal for beginners or recreational players who want a great value. This racquet should also work well for juniors who are ready to take the leap into their first adult size racquet. With its light weight and easy acceleration, the Boost Aero moves fast. The crisp feel

In [53]:
len(chunks[0].split())

256

In [54]:
def chunk_text(text, max_len = 256):
    if not text or not isinstance(text, str):
        return []
    
    words = text.strip().split()
    chunks = []
    for i in range(0, len(words), max_len):
        chunk = words[i:i+max_len]
        chunks.append(" ".join(chunk))
    return chunks

In [55]:
v3_df = natural_combined_df_v2.copy()

In [56]:
v3_df["id"] = v3_df.index

In [57]:
v3_df

Unnamed: 0,racquet_brand,racquet_img,racquet_name,racquet_rating,racquet_price,racquet_desc,racquet_swingweight,racquet_composition,racquet_power,racquet_stroke_style,...,racquet_balance_HH_HL,racquet_stiffness,racquet_avg_beam_width,racquet_mains,racquet_crosses,racquet_tension_lower,racquet_tension_upper,nat_combined_text_v2,text_length,id
0,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Boost Aero Pink,5.0,119.0,This racquet comes pre-strung for added conve...,309.0,Graphite,Low-Medium,Medium-Full,...,-2.0,65.0,24.000000,16.0,19.0,50.0,55.0,The Babolat Boost Aero Pink is a low-medium po...,321,0
1,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Boost Aero,5.0,119.0,This racquet comes pre-strung for added conve...,309.0,Graphite,Low-Medium,Medium-Full,...,-2.0,65.0,24.000000,16.0,19.0,50.0,55.0,The Babolat Boost Aero is a low-medium powered...,313,1
2,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero 2023,4.6,289.0,"With its ""best of class"" combination of speed...",318.0,Graphite,Low-Medium,Medium-Full,...,4.0,66.0,24.000000,16.0,19.0,50.0,59.0,The Babolat Pure Aero 2023 is a low-medium pow...,268,2
3,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero 98 2023,4.9,299.0,Babolat adds another chapter to the most cont...,321.0,Graphite,Low-Medium,Medium-Full,...,6.0,66.0,22.000000,16.0,20.0,50.0,59.0,The Babolat Pure Aero 98 2023 is a low-medium ...,304,3
4,Babolat,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Aero Lite 2023,4.6,269.0,Updated with a softer feel and wider string s...,304.0,Graphite,Low-Medium,Medium-Full,...,1.0,65.0,24.000000,16.0,19.0,50.0,59.0,The Babolat Pure Aero Lite 2023 is a low-mediu...,257,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 98+,5.0,305.0,Yonex adds another chapter to the VCORE 98+! ...,333.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,5.0,62.0,22.333333,16.0,19.0,45.0,60.0,The Yonex VCORE 98+ is a low-medium powered ra...,301,321
322,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 100L,4.2,305.0,"With the 2023 version of the VCORE 100L, Yone...",312.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,1.0,66.0,24.200000,16.0,19.0,45.0,60.0,The Yonex VCORE 100L is a low-medium powered r...,315,322
323,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 95,4.8,305.0,Yonex adds another chapter to the VCORE 95! L...,321.0,2G-Namd FlexForce/H.M. Graphite,Low,Full,...,7.0,61.0,21.666667,16.0,20.0,45.0,60.0,The Yonex VCORE 95 is a low powered racquet de...,309,323
324,Yonex,https://img.tennis-warehouse.com/watermark/rs....,Yonex VCORE 98,4.8,305.0,"With its redesigned beam and softer feel, the...",318.0,2G-Namd FlexForce/H.M. Graphite,Low-Medium,Medium-Full,...,6.0,62.0,22.333333,16.0,19.0,45.0,60.0,The Yonex VCORE 98 is a low-medium powered rac...,358,324


In [67]:
chunked_rows = []
for _, row in v3_df.iterrows():
    chunks = chunk_text(row["nat_combined_text_v2"], max_len = 256)
    for chunk in chunks:
        chunked_rows.append({
            "id": row["id"],
            "text": chunk,
            "embedding": None
        })
        
chunked_rows[0:5] #As a first pass, allow for mid-sentence splitting since all-MiniLM-L6-v2 seems to be robust to partial sentences.

[{'id': 0,
  'text': "The Babolat Boost Aero Pink is a low-medium powered racquet designed for players with medium-full strokes and medium-fast swings. It features a stiffness rating of 65.0 and a graphite composition. The racquet has a 309.0 ounce swing weight, a 102.0 square inch head size, a 9.7 ounce strung weight, and has a 16.0 by 19.0 string pattern. The racquet has a tension range of 50.0 pounds (lbs) to 55.0 pounds (lbs). The racquet has an average beam width of 24.0, with a 13.7 inch balance point, and is -2.0 head light. The racquet has a Black/Pink colorway and has a price of 119.0 dollars. Here is the marketing blurb for the Babolat Boost Aero Pink: This racquet comes pre-strung for added convenience and value!Featuring a hot pink flare, the Babolat Boost Aero Pink is ideal for beginners or recreational players who want a great value. This racquet should also work well for juniors who are ready to take the leap into their first adult size racquet. With its light weight and

In [68]:
df_chunked = pd.DataFrame(chunked_rows)

In [69]:
df_chunked

Unnamed: 0,id,text,embedding
0,0,The Babolat Boost Aero Pink is a low-medium po...,
1,0,at net this racquet comes around wonderfully f...,
2,1,The Babolat Boost Aero is a low-medium powered...,
3,1,The speedy feel is also an asset on serves whe...,
4,2,The Babolat Pure Aero 2023 is a low-medium pow...,
...,...,...,...
601,323,"feedback, this update continues with 2-NAMD Fl...",
602,324,The Yonex VCORE 98 is a low-medium powered rac...,
603,324,"new features include SIF Grommets, which deplo...",
604,325,The Yonex VCORE 98 Tour is a low-medium powere...,


In [70]:
df_chunked.id.unique()

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 18

In [71]:
v3_df.id.unique()

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 18

In [64]:
from sentence_transformers import SentenceTransformer

In [65]:
model = SentenceTransformer("all-MiniLM-L6-v2")

In [66]:
embeddings = model.encode(df_chunked["text"].tolist(), show_progress_bar = True)

Batches: 100%|██████████| 19/19 [00:04<00:00,  4.03it/s]


In [76]:
df_chunked["embedding"] = embeddings.tolist()

In [77]:
df_chunked

Unnamed: 0,id,text,embedding
0,0,The Babolat Boost Aero Pink is a low-medium po...,"[0.015591485425829887, 0.008179176598787308, -..."
1,0,at net this racquet comes around wonderfully f...,"[0.0009336236980743706, 0.0005472057964652777,..."
2,1,The Babolat Boost Aero is a low-medium powered...,"[0.0034383346792310476, 0.02328336052596569, -..."
3,1,The speedy feel is also an asset on serves whe...,"[-0.016193866729736328, 0.0009101557079702616,..."
4,2,The Babolat Pure Aero 2023 is a low-medium pow...,"[0.010298267006874084, -0.03734634444117546, -..."
...,...,...,...
601,323,"feedback, this update continues with 2-NAMD Fl...","[-0.031446125358343124, 0.015612219460308552, ..."
602,324,The Yonex VCORE 98 is a low-medium powered rac...,"[-0.00898976344615221, 0.02108895033597946, -0..."
603,324,"new features include SIF Grommets, which deplo...","[-0.029609326273202896, 0.006963102146983147, ..."
604,325,The Yonex VCORE 98 Tour is a low-medium powere...,"[0.006291043944656849, 0.029425427317619324, 0..."
