# Preliminary Racquet Features EDA
In this notebook, I do a preliminary overview of the racquet data I scraped from TW's website. In particular I take note of any outlier columns and potential data type issues.

## Table of Contents
1. [Load data and basic snapshot](#load-data-and-basic-snapshot)

2. [Examining object columns that need to be converted to float or int](#examining-object-columns-that-need-to-be-converted-to-float-or-int)

3. [Looking at how junior racquets affect the data](#looking-at-how-junior-racquets-affect-the-data)

4. [Takeaways and next steps](#takeaways-and-next-steps)

In [1]:
import datashelf.core as ds
import pandas as pd
import numpy as np

## Load data and basic snapshot
In this section, I run the basic EDA functions on the data to get an idea of what it actually looks like. I take note of any oddities or potential issues as I look at each output.

In [2]:
ds.ls("coll-files") # enter "racquets" when prompted for collection name

+------------------------+------------------------------------------------------------------+---------------------+----------------------+---------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+-----------+
| name                   | hash                                                             | date_created        | date_last_modified   | tag     |   version | message                                                                                                                                                                                 | file_path                                                                                                       | deleted   |
| racquets_metadata.yaml |                            

In [5]:
scraped_racquet_data = ds.load(
    collection_name = "racquets",
    hash_value = "55dabe54d8b602a3c993460db0bf085737dc2c78a148e6d9fe5ea09f75b0e8ef"
)
scraped_racquet_data

Unnamed: 0,racquet_img,racquet_name,racquet_rating,racquet_price,racquet_desc,Head Size,Length,Strung Weight,Balance,Swingweight,...,Swing Speed:,Racquet Colors:,Grip Type:,String Pattern:,String Tension:,Age,Weight,Height,Other,Strung Weight.1
0,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Drive 2025,4.9,289.00,The Pure Drive is popular for a reason. Boast...,100 in² / 645.16 cm²,27in / 68.58cm,11.2oz / 318g,12.99in / 32.99cm / 4 pts HL,317.0,...,,,,,,,,,,
1,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Drive 98 2025,4.5,299.00,Originally launched in 2019 under the VS moni...,98 in² / 632.26 cm²,27in / 68.58cm,11.4oz / 323g,13.18in / 33.48cm / 3 pts HL,326.0,...,,,,,,,,,,
2,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Drive 98 2-Pack 2025,5.0,579.00,This product is for 2 Pure Drive 98 racquets....,98 in² / 632.26 cm²,27in / 68.58cm,11.4oz / 323g,13.18in / 33.48cm / 3 pts HL,323.0,...,,,,,,,,,,
3,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Drive Plus 2025,5.0,289.00,Babolat adds another chapter to one of the ga...,100 in² / 645.16 cm²,27.5in / 69.85cm,11.2oz / 318g,13in / 33.02cm / 6 pts HL,325.0,...,,,,,,,,,,
4,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Drive Team 2025,5.0,269.00,The Pure Drive Team 2025 is defined by its us...,100 in² / 645.16 cm²,27in / 68.58cm,10.6oz / 301g,12.85in / 32.64cm / 5 pts HL,308.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
385,https://img.tennis-warehouse.com/watermark/rs....,Solinco Blackout 285,4.5,229.99,Introducing the Blackout 285! Like the heavie...,100 in² / 645.16 cm²,27in / 68.58cm,10.6oz / 301g,13.38in / 33.99cm / 1 pts HL,315.0,...,,,,,,,,,,
386,https://img.tennis-warehouse.com/watermark/rs....,Solinco Blackout 300 XTD,4.8,229.99,"With the Blackout 300 XTD, Solinco takes the ...",100 in² / 645.16 cm²,27.5in / 69.85cm,11.3oz / 320g,12.8in / 32.51cm / 8 pts HL,328.0,...,,,,,,,,,,
387,https://img.tennis-warehouse.com/watermark/rs....,Solinco Blackout 300 XTD+,5.0,229.99,"With the Blackout 300 XTD+, Solinco gives adv...",100 in² / 645.16 cm²,28in / 71.12cm,11.3oz / 320g,12.8in / 32.51cm / 10 pts HL,333.0,...,,,,,,,,,,
388,https://img.tennis-warehouse.com/watermark/rs....,Lacoste L23,4.5,199.00,Introducing the Lascoste L23! Following on th...,100 in² / 645.16 cm²,27in / 68.58cm,11.1oz / 315g,12.9in / 32.77cm / 5 pts HL,318.0,...,,,,,,,,,,


It might be good to extract the racquet brand name for metadata when I go to create embeddings for each racquet.

I look at the shape of the df below. It has 397 rows and **37** columns. This is odd because I did not have 37 columns in the dictionary I built when writing the scraper. My initial guess is that something unexpected happened with the spec tables because that was the section that was causing me the most trouble when scraping.

In [6]:
raw_df = scraped_racquet_data.copy()
raw_df.shape

(390, 37)

Below is a list of each column, its non-null count, and its data type. Interestingly there is a group of columns (20 to 31) that are clearly duplicated and have all null values.

In [8]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 390 entries, 0 to 389
Data columns (total 37 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   racquet_img      390 non-null    object 
 1   racquet_name     390 non-null    object 
 2   racquet_rating   337 non-null    float64
 3   racquet_price    390 non-null    float64
 4   racquet_desc     390 non-null    object 
 5   Head Size        361 non-null    object 
 6   Length           361 non-null    object 
 7   Strung Weight    307 non-null    object 
 8   Balance          309 non-null    object 
 9   Swingweight      309 non-null    float64
 10  Stiffness        308 non-null    object 
 11  Beam Width       309 non-null    object 
 12  Composition      360 non-null    object 
 13  Power Level      309 non-null    object 
 14  Stroke Style     309 non-null    object 
 15  Swing Speed      309 non-null    object 
 16  Racquet Colors   351 non-null    object 
 17  Grip Type       

Looking at the list of column names further confirms the reality that the spec columns were duplicated for some reason. However, since the rest of the data seems to be in good condition, I think it would just be easier to remove the duplicated columns rather than trying to edit the scraper.

In [9]:
raw_df.columns

Index(['racquet_img', 'racquet_name', 'racquet_rating', 'racquet_price',
       'racquet_desc', 'Head Size', 'Length', 'Strung Weight', 'Balance',
       'Swingweight', 'Stiffness', 'Beam Width', 'Composition', 'Power Level',
       'Stroke Style', 'Swing Speed', 'Racquet Colors', 'Grip Type',
       'String Pattern', 'String Tension', 'Balance:', 'Swingweight:',
       'Stiffness:', 'Beam Width:', 'Composition:', 'Power Level:',
       'Stroke Style:', 'Swing Speed:', 'Racquet Colors:', 'Grip Type:',
       'String Pattern:', 'String Tension:', 'Age', 'Weight', 'Height',
       'Other', 'Strung  Weight'],
      dtype='object')

## Examining object columns that need to be converted to `float` or `int`

Below is the descriptive statistics for the quantiative columns. Notice that some columns that **should** be quantitative, like "Head Size" are not shown. This is because I recorded the spec values as strings (with their units) when scraping TW. I will need to process those columns to convert them into float data types. This is confirmed by the .describe() function's output for the object columns, which include "Head Size", "Length", "Balance", etc.

In [10]:
raw_df.describe()

Unnamed: 0,racquet_rating,racquet_price,Swingweight,Balance:,Swingweight:,Stiffness:,Beam Width:,Composition:,Power Level:,Stroke Style:,Swing Speed:,Racquet Colors:,Grip Type:,String Pattern:,String Tension:
count,337.0,390.0,309.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,4.725223,199.357769,316.919094,,,,,,,,,,,,
std,0.467566,87.426284,11.176416,,,,,,,,,,,,
min,1.0,12.99,270.0,,,,,,,,,,,,
25%,4.6,129.0,310.0,,,,,,,,,,,,
50%,4.9,199.475,318.0,,,,,,,,,,,,
75%,5.0,269.0,325.0,,,,,,,,,,,,
max,5.0,579.0,345.0,,,,,,,,,,,,


In [11]:
raw_df.describe(include = ["object"])

Unnamed: 0,racquet_img,racquet_name,racquet_desc,Head Size,Length,Strung Weight,Balance,Stiffness,Beam Width,Composition,...,Swing Speed,Racquet Colors,Grip Type,String Pattern,String Tension,Age,Weight,Height,Other,Strung Weight.1
count,390,390,390,361,361,307,309,308,309,360,...,309,351,309,312,306,52,3,3,5,2
unique,387,383,384,39,24,34,81,22,95,54,...,6,85,45,15,23,9,3,3,4,1
top,https://img.tennis-warehouse.com/watermark/rs....,Wilson Pro Staff Six.One 95 v14,Introducing the second generation of the Boom...,100 in² / 645.16 cm²,27in / 68.58cm,11.4oz / 323g,13in / 33.02cm / 4 pts HL,65,23mm / 26mm / 23mm,Graphite,...,Medium-Fast,Blue,Wilson Pro Performance,16 Mains / 19 Crosses\n\n\nMains skip,50-60 pounds,9-10,8.1 ounces / 230 grams,50-55 inches / 127cm-140cm,String Tension: 45-55 pounds\nSolinco recommen...,11.1oz / 315g
freq,2,2,2,149,254,49,50,37,30,117,...,222,61,48,122,95,18,1,1,2,2


In [12]:
# Display columns that need preprocessing
raw_df[["Head Size", "Length", "Strung Weight", "Balance",
        "Stiffness", "Beam Width", "String Pattern",
        "String Tension"]]

Unnamed: 0,Head Size,Length,Strung Weight,Balance,Stiffness,Beam Width,String Pattern,String Tension
0,100 in² / 645.16 cm²,27in / 68.58cm,11.2oz / 318g,12.99in / 32.99cm / 4 pts HL,69,23mm / 26mm / 23mm,16 Mains / 19 CrossesMains skip,46-55 pounds
1,98 in² / 632.26 cm²,27in / 68.58cm,11.4oz / 323g,13.18in / 33.48cm / 3 pts HL,69,21mm / 23mm / 21mm,16 Mains / 20 Crosses\n\n\nMains skip,46-55 pounds
2,98 in² / 632.26 cm²,27in / 68.58cm,11.4oz / 323g,13.18in / 33.48cm / 3 pts HL,69,21mm / 23mm / 21mm,16 Mains / 20 Crosses\n\n\nMains skip,46-55 pounds
3,100 in² / 645.16 cm²,27.5in / 69.85cm,11.2oz / 318g,13in / 33.02cm / 6 pts HL,69,23mm / 26mm / 23mm,16 Mains / 19 CrossesMains skip,46-55 pounds
4,100 in² / 645.16 cm²,27in / 68.58cm,10.6oz / 301g,12.85in / 32.64cm / 5 pts HL,69,23mm / 26mm / 23mm,16 Mains / 19 CrossesMains skip,44-53 pounds
...,...,...,...,...,...,...,...,...
385,100 in² / 645.16 cm²,27in / 68.58cm,10.6oz / 301g,13.38in / 33.99cm / 1 pts HL,70,23.5mm / 26mm / 23mm,16 Mains / 19 Crosses\n\n\nMains skip,50-60 pounds
386,100 in² / 645.16 cm²,27.5in / 69.85cm,11.3oz / 320g,12.8in / 32.51cm / 8 pts HL,70,23.5mm / 26mm / 23mm,16 Mains / 19 Crosses\n\n\nMains skip,50-60 pounds
387,100 in² / 645.16 cm²,28in / 71.12cm,11.3oz / 320g,12.8in / 32.51cm / 10 pts HL,66,23.5mm / 26mm / 23mm,16 Mains / 19 CrossesMains skip,50-60 pounds
388,100 in² / 645.16 cm²,27in / 68.58cm,11.1oz / 315g,12.9in / 32.77cm / 5 pts HL,69,23mm / 25mm / 23mm,16 Mains / 19 Crosses\n\n\nMains skip,51-55 pounds


## Looking at how junior racquets affect the data

There are two types of racquets in the data. Regular racquets and junior racquets. My guess is that the junior racquets will have much less informatoin on their specs compared to the regular ones. In the grand context of building a semantic search tool for rackets, I think excluding racquets won't have a noticeable effect on the end product. But, I will still investigate to see if the NA values change noticeably without junior racquets in the data.

In [15]:
examine_jr_df = raw_df.copy()
examine_jr_df["Junior"] = examine_jr_df["racquet_name"].str.lower().str.contains("junior")
examine_jr_df.head()

Unnamed: 0,racquet_img,racquet_name,racquet_rating,racquet_price,racquet_desc,Head Size,Length,Strung Weight,Balance,Swingweight,...,Racquet Colors:,Grip Type:,String Pattern:,String Tension:,Age,Weight,Height,Other,Strung Weight.1,Junior
0,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Drive 2025,4.9,289.0,The Pure Drive is popular for a reason. Boast...,100 in² / 645.16 cm²,27in / 68.58cm,11.2oz / 318g,12.99in / 32.99cm / 4 pts HL,317.0,...,,,,,,,,,,False
1,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Drive 98 2025,4.5,299.0,Originally launched in 2019 under the VS moni...,98 in² / 632.26 cm²,27in / 68.58cm,11.4oz / 323g,13.18in / 33.48cm / 3 pts HL,326.0,...,,,,,,,,,,False
2,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Drive 98 2-Pack 2025,5.0,579.0,This product is for 2 Pure Drive 98 racquets....,98 in² / 632.26 cm²,27in / 68.58cm,11.4oz / 323g,13.18in / 33.48cm / 3 pts HL,323.0,...,,,,,,,,,,False
3,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Drive Plus 2025,5.0,289.0,Babolat adds another chapter to one of the ga...,100 in² / 645.16 cm²,27.5in / 69.85cm,11.2oz / 318g,13in / 33.02cm / 6 pts HL,325.0,...,,,,,,,,,,False
4,https://img.tennis-warehouse.com/watermark/rs....,Babolat Pure Drive Team 2025,5.0,269.0,The Pure Drive Team 2025 is defined by its us...,100 in² / 645.16 cm²,27in / 68.58cm,10.6oz / 301g,12.85in / 32.64cm / 5 pts HL,308.0,...,,,,,,,,,,False


Below, I can clearly see that removing the junior racquets significantly reduces the percentage of NA values in the spec columns. This na table also confirms the duplication issues discussed above. 

In [16]:
# Create a df showing the number of NA values per column, their percentage NA, 
# and their percentage NA when the junior rackets are removed from the data

na_vals = pd.Series(examine_jr_df.isna().sum(), name = "nas").to_frame()
na_vals["perc_nas"] = ((na_vals["nas"]/len(examine_jr_df)) * 100).round(2)
na_vals["perc_nas_junior_removed"] = (((
    examine_jr_df[~examine_jr_df["racquet_name"].str.contains("Junior")].isna().sum()
    )/len(
        examine_jr_df[~examine_jr_df["racquet_name"].str.contains("Junior")]
        )) * 100).round(2)

na_vals.sort_values(by = "perc_nas", ascending = True)

Unnamed: 0,nas,perc_nas,perc_nas_junior_removed
racquet_img,0,0.0,0.0
racquet_desc,0,0.0,0.0
racquet_price,0,0.0,0.0
Junior,0,0.0,0.0
racquet_name,0,0.0,0.0
Head Size,29,7.44,3.74
Length,29,7.44,3.74
Composition,30,7.69,4.05
Racquet Colors,39,10.0,4.05
racquet_rating,53,13.59,1.56


## Takeaways and next steps

1. Get rid of all junior racquets from df

2. Drop all duplicated columns

3. Add a racquet_brand column by extracting the brand name from racquet_name

4. Convert the following columns from object to float or int using regex or str logic:
    - Head Size
    - Length
    - Strung Weight
    - Balance
        - Create two columns: racquet_balance_in and racquet_balance_HH_HL
    - Stiffness
    - Beam width
    - String Pattern
        - Create two columns: racquet_mains and racquet_crosses
    - String Tension
        - Create two columns: racquet_tension_lower and racquet_tension_upper

Overall, the scraper did a decent job for it being the first one I've ever made! For some reason, the spec columns got duplicated. My best guess is that there was some discrepancies in how they were labelled in the html. There are junior racquets present in the data, which aren't relevant and there are several columns that need to be processed into a different data type. 

Since the preprocessing required to fix these errors is relatively minimal, I've decided to use the scraper as-is and just quickly performing the basic preprocessing on the data afterwards.