# Tennis Warehouse Racket Information Scaper

This notebook is my intial code exploring how to use requests and beautiful soup to get all individual racquet page links from a brand page and then scrape each individual racquet page for its relevant information

## Table of Contents
1. [Dependencies](#dependencies)
2. [Parsing a brand page](#parsing-a-brand-page-to-get-product-links)
3. [Parsing individual product pages and compiling features](#parsing-individual-product-pages-and-compiling-features)
    - [Racquet description failed extraction methods](#racquet-description-failed-extraction-methods)
    - [Correct racquet description extraction method](#correct-racquet-description-extraction-method)
4. [Racquet specs table extraction](#racquet-specs-table-extraction)
5. [Creating a function to scrape each racquet page](#creating-a-function-to-scrape-each-racquet-page)

## Dependencies

Install the following packages using pip:
- requests
- beautfulsoup4
- pandas
- numpy
- re

In [None]:
# Imports
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re

## Parsing a brand page to get product links

In this section, I work through parsing a brand page to get a list of all the product page links. I also retroactively highlight some irregularities that I discovered from downstream testing. 

In [None]:
# Start with a brand page's URL - in this case it's Wilson
URL = "https://www.tennis-warehouse.com/Wilsonracquets.html?_gl=1%2a1c95t5v%2a_up%2aMQ..%2a_gs%2aMQ..&gclid=Cj0KCQjwjdTCBhCLARIsAEu8bpKvCdsiLVcvVUTdp6neK9zOYKzsjzvtS8HnYRXhHqq1hCs1aApcdmcaAsTbEALw_wcB&gbraid=0AAAAADyKN7jLrP48ChXOutfRtNVCtwHO4"

In [None]:
# Get html
webpage = requests.get(URL)

In [None]:
# Display html
webpage.content

b'<!DOCTYPE html>\r<html lang="en">\r<head>\r\r<link rel="preconnect" href="https://img.tennis-warehouse.com" crossorigin>\r<link rel="dns-prefetch" href="https://img.tennis-warehouse.com" crossorigin>\r<link rel="preconnect" href="https://static.tennis-warehouse.com" crossorigin>\r<link rel="dns-prefetch" href="https://www.livechatinc.com" crossorigin>\r<link rel="preconnect" href="https://www.googletagmanager.com" crossorigin>\r<link rel="preconnect" href="https://cdn.cookielaw.org" crossorigin>\r<link rel="preload" href="https://js.stripe.com/v3/" as="script">\r<link rel="preload" href="https://cdn.cookielaw.org/scripttemplates/otSDKStub.js" as="script">\r<script>\r\rwindow.dataLayer = window.dataLayer || [];\rfunction gtag(){ dataLayer.push(arguments); }\rgtag(\'event\', \'initialPageview\');\rgtag(\'consent\', \'default\', {\r"ad_storage": "denied",\r"ad_user_data": "granted",\r"ad_personalization": "granted",\r"analytics_storage": "granted",\r"functionality_storage": "granted",\r

In [None]:
# Parse html with Beautiful Soup
soup = BeautifulSoup(webpage.content, "html.parser")

In [None]:
# Extract all links on the page - each link should correspond to an individual racket
# Extracting link by looking for a tags with the speciic class "cattable-wrap-cell-info"
links = soup.find_all("a", attrs = {"class": "cattable-wrap-cell-info"})

In [None]:
# Display the first 5 "links" to check if it worked
links[0:5]

[<a class="cattable-wrap-cell-info" href="https://www.tennis-warehouse.com/Wilson_RF_01_Pro/descpageRCWILSON-WRFPR.html"> <div class="cattable-wrap-cell-info-name">Wilson RF 01 Pro</div> <div class="cattable-wrap-cell-info-price"> <span>$309.00</span> </div> </a>,
 <a class="cattable-wrap-cell-info" href="https://www.tennis-warehouse.com/Wilson_RF_01/descpageRCWILSON-WRF1R.html"> <div class="cattable-wrap-cell-info-name">Wilson RF 01</div> <div class="cattable-wrap-cell-info-price"> <span>$289.00</span> </div> </a>,
 <a class="cattable-wrap-cell-info ga_event" data-gtm_promo_creative="AdBlock" data-gtm_promo_id="RFAB" data-gtm_promo_name="RF Collection" data-gtm_promo_position="Primary 2" data-trackaction="cat-adBlock" data-trackcategory="Banner" data-tracklabel="RF Collection" href="/catpage-RFPP.html"> <div class="cattable-wrap-cell-info-title">RF Collection</div> <span class="cattable-wrap-cell-info-more">Learn More</span> </a>,
 <a class="cattable-wrap-cell-info" href="https://www.

In [None]:
# Extract the actual link content from the a tag
link = links[0].get("href")

In [None]:
# Display the link content
link

'https://www.tennis-warehouse.com/Wilson_RF_01_Pro/descpageRCWILSON-WRFPR.html'

In [None]:
# After downstream testing, I ran into a few "alternative" page layouts
# Create some other product links here to test how to handle those exceptions
diff_div_link = "https://www.tennis-warehouse.com/Wilson_Clash_108_v3/descpageRCWILSON-CL183V.html" # Uses a different div class
diff_tag_link = "https://www.tennis-warehouse.com/Wilson_Ultra_100UL_v4/descpageRCWILSON-WUULV4.html" # Uses a different tag 
babolat_link = "https://www.tennis-warehouse.com/Babolat_EVO_Drive_Lite_W/descpageRCBAB-BEVODW.html" # Different brand racquet

## Parsing individual product pages and compiling features

In this section, I work on the logic for parsing an individual product page for all of the racquet's features and explore the best way to store all of the information.

In [None]:
# Get html from racquet product page
racquet_page = requests.get(babolat_link)

In [None]:
# Parse racquet product page html 
soup = BeautifulSoup(racquet_page.content, "html.parser")

In [None]:
# Extract racquet image link
racquet_image = soup.find("img", attrs = {"class": "main_image is-zoomable"}).get("src"); racquet_image

'https://img.tennis-warehouse.com/watermark/rs.php?path=BEVODW-1.jpg&nw=455'

In [None]:
# Extract racquet name
racquet_name = soup.find("h1", attrs = {"class": "h2 desc_top-head-title"}).text; racquet_name

'Babolat EVO Drive Lite W'

In [None]:
# Extract racquet rating as a float- take note of exception below
racquet_rating_out_of_5 = float(soup.find("div", attrs = {"class": "review_agg"}).text); racquet_rating_out_of_5

5.0

Note that in the `03-rk-refactored-scraper-test.ipynb`, we needed to add an if statement to check if the "review_agg" class existed because some rackets have no ratings. In such cases, we entered the value as `np.nan`.

In [None]:
# Extract racquet price as a float
racquet_price = float(soup.find("span", attrs = {"class": "afterpay-full_price"}).text); racquet_price

129.0

### Racquet description failed extraction methods

Can't use the following 2 cells because there are some descriptions that use a `<div>` instead of a `<p>`.

In [96]:
#racquet_desc = soup.find("p", attrs = {"style": "text-align: justify;"}).text; racquet_desc

In [62]:
#soup.find("div", attrs = {"style": "text-align: justify;"}).text

Can't use the following 2 clauses because it doesn't account for the variations in `<div>` tagging.

In [63]:
"""
try:
    racquet_desc = soup.find("p", attrs = {"style": "text-align: justify;"}).text
except AttributeError:
    racquet_desc = soup.find("div", attrs = {"style": "text-align: justify;"}).text
    
racquet_desc
"""

'\ntry:\n    racquet_desc = soup.find("p", attrs = {"style": "text-align: justify;"}).text\nexcept AttributeError:\n    racquet_desc = soup.find("div", attrs = {"style": "text-align: justify;"}).text\n    \nracquet_desc\n'

In [64]:
"""
if soup.find("p", attrs = {"style": "text-align: justify;"}):
    racquet_desc = soup.find("p", attrs = {"style": "text-align: justify;"}).text
else:
    racquet_desc = soup.find("div", attrs = {"style": "text-align: justify;"}).text
    
racquet_desc
"""

'\nif soup.find("p", attrs = {"style": "text-align: justify;"}):\n    racquet_desc = soup.find("p", attrs = {"style": "text-align: justify;"}).text\nelse:\n    racquet_desc = soup.find("div", attrs = {"style": "text-align: justify;"}).text\n    \nracquet_desc\n'

### Correct racquet description extraction method

The cell below is the final solution to the description problem. All descriptions (regardless of deeper nested tagging variations) have the "check_read-inner" div.

In [None]:
# Extract racquet description using "check_read_inner" div class
racquet_desc = soup.find("div", attrs = {"class": "check_read-inner"}).text; racquet_desc

" Pre-strung for extra savings!Introducing the EVO Drive Lite W! This racquet features the same specs as the standard EVO Drive Lite but comes in a sleek white cosmetic. With this racquet, Babolat gives beginners, early intermediates and occasional players an easy swinging racquet with impressive all-around playability. In addition to being lighter and faster than the standard EVO Drive, this stick gives you the benefit of a powerful and comfortable 104 square inch head. It also has an open 16x17 string pattern, making it a great tool for learning how to hit with more spin. As with Babolat's premium performance frames, the EVO Drive Lite has SWX EVO Feel, a thin viscoelastic rubber added to multiple locations inside the frame in order to reduce harsh vibrations. The EVO Drive Lite W also benefits from Babolat's time-tested Woofer technology, a grommet system that optimizes frame-string interaction to create added dwell time (for control) along greater shock absorption (for comfort). As

## Racquet specs table extraction

In this section, I explore how to programmatically extract the spec label and spec value from the table of specs listed on each racquet's product page.

In [None]:
# Testing head size extraction from specs table
head_size_string = soup.find("tbody").find_all("tr")[0].text; head_size_string
racquet_head_size = float(head_size_string.split(":")[1].split("in")[0].strip()); racquet_head_size

In [None]:
# Testing head size value extraction
soup.find("tbody").find_all("tr")[0].find("strong").text.strip()
soup.find("tbody").find_all("tr")[0].text.split(":")[1]

' 100 in² / 645.16 cm²'

In [None]:
# Looking to see how to identify only relevant spec tags
soup.find("tbody").find_all("td")

[<td class="SpecsLt" style="width: 386px;"><strong>Head Size:</strong> 100 in² / 645.16 cm²</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Length:</strong> 27in / 68.58cm</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Strung Weight:</strong> 11.2oz / 318g</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Balance:</strong> 12.99in / 32.99cm / 4 pts HL</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Swingweight:</strong> 317</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Stiffness:</strong> 69</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Beam Width:</strong> 23mm / 26mm / 23mm</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Composition:</strong> Graphite</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Power Level:</strong> Low-Medium</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Stroke Style:</strong> Medium-Full</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Swing Speed:</strong> Medium-Fast

In [None]:
# Extract spec elements by filtering on elements with a class name containing "Specs"
spec_elements = soup.find("tbody").find_all("td", class_=re.compile("Specs")); spec_elements

[<td class="SpecsLt" style="width: 386px;"><strong>Head Size:</strong> 100 in² / 645.16 cm²</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Length:</strong> 27in / 68.58cm</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Strung Weight:</strong> 11.2oz / 318g</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Balance:</strong> 12.99in / 32.99cm / 4 pts HL</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Swingweight:</strong> 317</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Stiffness:</strong> 69</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Beam Width:</strong> 23mm / 26mm / 23mm</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Composition:</strong> Graphite</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Power Level:</strong> Low-Medium</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Stroke Style:</strong> Medium-Full</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Swing Speed:</strong> Medium-Fast

In [None]:
# Alternative way to filter for spec td tags
spec_elements_alt = soup.find("div", attrs = {"id": "product_specs"}).find("div", attrs = {"class": "check_read-inner"}).find("table").find_all("td", class_ = re.compile("Specs")); spec_elements_alt

[<td class="SpecsLt" style="width: 386px;"><strong>Head Size:</strong> 100 in² / 645.16 cm²</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Length:</strong> 27in / 68.58cm</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Strung Weight:</strong> 11.2oz / 318g</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Balance:</strong> 12.99in / 32.99cm / 4 pts HL</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Swingweight:</strong> 317</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Stiffness:</strong> 69</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Beam Width:</strong> 23mm / 26mm / 23mm</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Composition:</strong> Graphite</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Power Level:</strong> Low-Medium</td>,
 <td class="SpecsDk" style="width: 386px;"><strong>Stroke Style:</strong> Medium-Full</td>,
 <td class="SpecsLt" style="width: 386px;"><strong>Swing Speed:</strong> Medium-Fast

In [None]:
# Write an if statement to check if product HAS a spec table, 
# then identify and record the spec label and value into a dictionry if it does have a table
if soup.find("tbody"):
    spec_elements = soup.find("tbody").find_all("td", class_=re.compile("Specs"))
    specs_dict = {}
    for spec in spec_elements:
        if spec.find("strong"):
            label = spec.find("strong").text.split(":")[0].strip()
            value = spec.text.split(":")[1].strip()

        else:
            label = "Other"
            value = spec.text.strip()

        specs_dict[label] = value
else:
    specs_dict = {"Head Size": np.nan,
                  "Length": np.nan,
                  "Strung Weight": np.nan,
                  'Balance:': np.nan,
                  'Swingweight:': np.nan,
                  'Stiffness:': np.nan,
                  'Beam Width:': np.nan,
                  'Composition:': np.nan,
                  'Power Level:': np.nan,
                  'Stroke Style:': np.nan,
                  'Swing Speed:': np.nan,
                  'Racquet Colors:': np.nan,
                  'Grip Type:': np.nan,
                  'String Pattern:': np.nan,
                  'String Tension:': np.nan}


In [None]:
# Check if Wilson dictionary returns nan (Wilson racquet doesn't have any specs listed) - PASSED!

# Don't run this anymore since the cells above and below have been changed to Babolat
# specs_dict

{'Head Size': nan,
 'Length': nan,
 'Strung Weight': nan,
 'Balance:': nan,
 'Swingweight:': nan,
 'Stiffness:': nan,
 'Beam Width:': nan,
 'Composition:': nan,
 'Power Level:': nan,
 'Stroke Style:': nan,
 'Swing Speed:': nan,
 'Racquet Colors:': nan,
 'Grip Type:': nan,
 'String Pattern:': nan,
 'String Tension:': nan}

In [None]:
# Create a dictionary of the features gathered in the previous section
racquet_top_dict = {
    "img": racquet_image,
    "name": racquet_name,
    "rating": racquet_rating_out_of_5,
    "price": racquet_price,
    "description": racquet_desc
}

In [None]:
# Create a combined dictionary that concatenates the feature dictionary with the spec dictionary

total_dict = {**racquet_top_dict, **specs_dict}

In [None]:
# View total dict for babolat
total_dict

{'img': 'https://img.tennis-warehouse.com/watermark/rs.php?path=BPD25R-1.jpg&nw=455',
 'name': 'Babolat Pure Drive 2025',
 'rating': 4.8,
 'price': 289.0,
 'description': " The\xa0Pure Drive is popular for a reason. Boasting an appeal that cuts across ability levels, this modern player's racquet offers an unmistakably easy learning curve to virtually anyone looking to play aggressive tennis. Although it reserves its greatest charm for the baseliner who likes dictating action with heavy pace, it’s also dangerous in the front court, where the quick handling and powerful response practically beg you to finish volleys with a bang. Serving is another strength of the Pure Drive, where surgically placed slices and kickers are as easy to hit as flat bombs down the T. For 2025, Babolat, enhances comfort with an updated version of NF2 Tech, a dampening technology that uses flax fibers in the throat to smooth out harsh vibrations. This model also inherits FSI Power from the previous generation, c

## Creating a function to scrape each racquet page

In [None]:
# Create a function to scrape a racquet page
def scrape_racquet_page(link):
    #Initialize a dictionary to hold racquet information
    racquet_info = {}
    
    #Request the racquet page and parse with bs4
    racquet_page = requests.get(link)
    soup = BeautifulSoup(racquet_page.content, "html.parser")
    
    #Extract features from top of page
    racquet_info["racquet_img"] = soup.find("img", attrs = {"class": "main_image is-zoomable"}).get("src")
    racquet_info["racquet_name"] = soup.find("h1", attrs = {"class": "h2 desc_top-head-title"}).text
    racquet_info["racquet_rating"] = float(soup.find("div", attrs = {"class": "review_agg"}).text)
    racquet_info["racquet_price"] = float(soup.find("span", attrs = {"class": "afterpay-full_price"}).text)
    racquet_info["racquet_desc"] = soup.find("p", attrs = {"style": "text-align: justify;"}).text
    
    # Extract racquet specs
    
    # Check if spec table exists
    if soup.find("tbody"):
        # If it does exist, extract the spec tags from the body of the table and initialize a dictionary
        racquet_spec_elements = soup.find("tbody").find_all("td", class_=re.compile("Specs"))
        racquet_specs = {}
        
        #Iterate over each spec tag and extract the spec label and value
        for spec in racquet_spec_elements:
            # Check if the label is bolded 
            # (main specs that are consistent over majority of racquets)
            if spec.find("strong"):
                label = spec.find("strong").text.split(":")[0].strip()
                value = spec.text.split(":")[1].strip()

            # If there is an additional spec listing, categorize it as "other"
            else:
                label = "Other"
                value = spec.text.strip()

            # Assign the value to the label and add to dictionary
            racquet_specs[label] = value
    
    #If the spec table doesn't exist, return a spec dictionary of NAs
    else:
        racquet_specs = {"Head Size": np.nan,
                         "Length": np.nan,
                         "Strung Weight": np.nan,
                         'Balance:': np.nan,
                         'Swingweight:': np.nan,
                         'Stiffness:': np.nan,
                         'Beam Width:': np.nan,
                         'Composition:': np.nan,
                         'Power Level:': np.nan,
                         'Stroke Style:': np.nan,
                         'Swing Speed:': np.nan,
                         'Racquet Colors:': np.nan,
                         'Grip Type:': np.nan,
                         'String Pattern:': np.nan,
                         'String Tension:': np.nan}

    #Combine top info and specs dictionaries
    racquet_info.update(racquet_specs)
    
    return racquet_info

In [None]:
# Test aggregate function on Wilson RF 01
wilson_rf01 = scrape_racquet_page(link)

In [None]:
# View resulting df - PASSED!
pd.DataFrame(wilson_rf01, index = [0])

Unnamed: 0,racquet_img,racquet_name,racquet_rating,racquet_price,racquet_desc,Head Size,Length,Strung Weight,Balance,Swingweight,Stiffness,Beam Width,Composition,Power Level,Stroke Style,Swing Speed,Racquet Colors,Grip Type,String Pattern,String Tension
0,https://img.tennis-warehouse.com/watermark/rs....,Wilson RF 01 Pro,4.4,309.0,Initially designed for the twilight of his pla...,98 in² / 632.26 cm²,27in / 68.58cm,11.9oz / 337g,12.75in / 32.39cm / 6 pts HL,331,67,23.2mm / 23mm / 22mm,Carbon +Carbon Braid,Low-Medium,Medium-Full,Medium-Fast,Black,Leather,16 Mains / 19 CrossesMains skip,50-60 pounds
