
# BestBuy Canada – Product Reviews Scraper & Sentiment Analysis (Selenium + BeautifulSoup + VADER/HF)

This notebook demonstrates how to scrape publicly visible product reviews from **BestBuy Canada** (`bestbuy.ca`) for a single product that has **≥ 50 reviews**, apply multiple **filters/sort orders**, paginate via **"Show more"**, extract key review fields, and run **sentiment analysis**.

> **Important:** Use responsibly. This notebook only collects **publicly visible** information for a **single product** and follows site terms. Respect robots.txt and BestBuy policies. Throttle requests with random delays and avoid aggressive scraping.



## What you'll get
- Robust Selenium-based scraper with **filters & pagination**.
- Fields extracted per review:
  - `review_id` (Primary Key)
  - `title`
  - `review_text`
  - `date` (YYYY-MM-DD)
  - `rating` (0–5)
  - `source` (domain)
  - `reviewer_name`
  - `sort_applied`
- CSV export for downstream analytics.
- Sentiment analysis with:
  - **NLTK VADER** (light-weight, offline)
  - Optional **Hugging Face** pipeline (requires internet to download model on first run).
- A small **insights summary** (top aspects, rating distribution, recommendations).



## 0) Environment Setup
> Run this once; comment out installs after initial run for faster iteration.


In [None]:

# If running locally for the first time, uncomment the lines below.
# %pip install --upgrade pip
# %pip install selenium webdriver-manager undetected-chromedriver bs4 lxml pandas numpy python-dateutil nltk transformers torch

import os, time, random, re, sys, json
from dataclasses import dataclass, asdict
from typing import List, Dict, Optional
import pandas as pd
from datetime import datetime
from dateutil import parser as dateparser

# Parsing & scraping
from bs4 import BeautifulSoup

# Selenium stack
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException, NoSuchElementException, ElementClickInterceptedException

# Sentiment
import nltk
nltk.download('vader_lexicon', quiet=True)
from nltk.sentiment import SentimentIntensityAnalyzer



## 1) Configuration

- Set the `PRODUCT_URL` to any BestBuy Canada product page **with at least 50 reviews**.
- Choose which sort orders (filters) to apply.
- Tweak timeouts, sleeps, and headless mode as needed.


In [None]:

# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# REQUIRED: Paste a BestBuy Canada product URL with >= 50 reviews
# Example (placeholder) — replace with a real product reviews URL:
PRODUCT_URL = "https://www.bestbuy.ca/en-ca/product/replace-with-bestbuy-ca-product-url"

# Which sort orders (filters) to apply. These labels must match the site options exactly.
SORT_ORDERS = [
    "Most relevant",
    "Most helpful",
    "Newest",
    "Highest rating",
    "Lowest rating",
]

# Selenium settings
HEADLESS = True
PAGELOAD_TIMEOUT = 30
WAIT_TIMEOUT = 20

# Randomized polite delays (seconds)
MIN_WAIT, MAX_WAIT = 0.8, 1.8

# Output paths
OUTPUT_CSV = "bestbuy_reviews.csv"        # raw reviews
OUTPUT_CSV_DEDUP = "bestbuy_reviews_dedup.csv"
SENTIMENT_CSV = "bestbuy_reviews_sentiment.csv"
SOURCE_DOMAIN = "bestbuy.ca"
# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<



## 2) Browser Helper – launch, navigate, and utilities
- Uses **undetected_chromedriver** to reduce automation signals.
- Adds a **custom user-agent** and disables some automation flags.
- Provides helpers to: accept cookies, open the reviews tab, select sort order, and click "Show more".


In [None]:

def make_driver():
    ua = (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0.0.0 Safari/537.36"
    )
    options = uc.ChromeOptions()
    if HEADLESS:
        options.add_argument("--headless=new")
    options.add_argument("--disable-blink-features=AutomationControlled")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--no-sandbox")
    options.add_argument(f"--user-agent={ua}")
    options.add_argument("--lang=en-US,en;q=0.9")
    driver = uc.Chrome(options=options)
    driver.set_page_load_timeout(PAGELOAD_TIMEOUT)
    return driver

def polite_sleep():
    time.sleep(random.uniform(MIN_WAIT, MAX_WAIT))

def safe_click(driver, element):
    try:
        driver.execute_script("arguments[0].scrollIntoView({block:'center'});", element)
        polite_sleep()
        element.click()
        polite_sleep()
        return True
    except (ElementClickInterceptedException, TimeoutException, NoSuchElementException) as e:
        return False

def accept_cookies_if_present(driver):
    # BestBuy.ca sometimes pops a cookie consent. Try common selectors; ignore if absent.
    possible = [
        (By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"),
        (By.CSS_SELECTOR, "button[aria-label='Accept All']"),
        (By.XPATH, "//button[contains(., 'Accept') and contains(., 'cookies')]"),
    ]
    for by, sel in possible:
        try:
            btn = WebDriverWait(driver, 3).until(EC.element_to_be_clickable((by, sel)))
            if safe_click(driver, btn):
                return True
        except TimeoutException:
            pass
    return False

def open_reviews_tab(driver):
    # Click the "Reviews" tab/anchor if not already visible.
    try:
        reviews_tab = WebDriverWait(driver, WAIT_TIMEOUT).until(
            EC.element_to_be_clickable((By.XPATH, "//button[contains(., 'Reviews')] | //a[contains(., 'Reviews')]"))
        )
        safe_click(driver, reviews_tab)
        # Wait for reviews container to load
        WebDriverWait(driver, WAIT_TIMEOUT).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "[data-automation='ugc-reviews']"))
        )
        return True
    except TimeoutException:
        return False

def set_sort_order(driver, sort_label: str):
    """Select a sort option like 'Most helpful', 'Newest', etc."""
    try:
        # Look for sort dropdown and open it
        dropdown = WebDriverWait(driver, WAIT_TIMEOUT).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, "[data-automation='ugc-reviews-sort'] select, select#sort"))
        )
        driver.execute_script("arguments[0].scrollIntoView({block:'center'});", dropdown)
        polite_sleep()
        dropdown.click()
        polite_sleep()

        # Select the option by visible text
        from selenium.webdriver.support.ui import Select
        Select(dropdown).select_by_visible_text(sort_label)
        polite_sleep()

        # Wait for reviews to update
        WebDriverWait(driver, WAIT_TIMEOUT).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "[data-automation='ugc-reviews']"))
        )
        polite_sleep()
        return True
    except Exception as e:
        # Fallback: try clicking a button/filter chips if present
        try:
            chip = driver.find_element(By.XPATH, f"//button[contains(., '{sort_label}')]")
            return safe_click(driver, chip)
        except Exception:
            return False

def click_show_more_until_end(driver, max_clicks: int = 999):
    """Click 'Show more' until it's gone/disabled or we hit max_clicks."""
    clicks = 0
    while clicks < max_clicks:
        polite_sleep()
        try:
            btn = WebDriverWait(driver, 3).until(
                EC.element_to_be_clickable((By.XPATH, "//button[contains(., 'Show more')]"))
            )
            if not safe_click(driver, btn):
                break
            clicks += 1
        except TimeoutException:
            break
    return clicks



## 3) Parsing reviews from the page HTML
We use **BeautifulSoup** on `driver.page_source` to extract review elements and normalize fields.


In [None]:

@dataclass
class Review:
    review_id: str
    title: str
    review_text: str
    date: str  # YYYY-MM-DD
    rating: float
    source: str
    reviewer_name: str
    sort_applied: str

def normalize_date(text: str) -> str:
    try:
        dt = dateparser.parse(text, fuzzy=True)
        return dt.strftime("%Y-%m-%d")
    except Exception:
        return ""

def parse_reviews_from_html(html: str, sort_applied: str) -> List[Review]:
    soup = BeautifulSoup(html, "lxml")

    # Reviews container may vary. We look for common structures.
    review_cards = soup.select("[data-automation='ugc-review']") or soup.select("article.review, li.review")
    out = []
    for card in review_cards:
        # Primary key
        review_id = card.get("id") or card.get("data-review-id") or ""
        # Title
        title_el = card.select_one("[data-automation='ugc-review-title']") or card.select_one(".reviewTitle, h4")
        title = title_el.get_text(strip=True) if title_el else ""

        # Text
        text_el = card.select_one("[data-automation='ugc-review-body']") or card.select_one(".reviewText, .content, p")
        review_text = text_el.get_text(" ", strip=True) if text_el else ""

        # Date
        date_el = card.select_one("[data-automation='ugc-review-date'] time") or card.select_one("time")
        raw_date = date_el.get_text(strip=True) if date_el else ""
        date = normalize_date(raw_date)

        # Rating (look for aria-label like '4 out of 5' or data-rating attr)
        rating = None
        star = card.select_one("[aria-label*='out of 5']") or card.select_one("[data-rating]")
        if star and star.has_attr("aria-label"):
            m = re.search(r"(\d+(?:\.\d+)?)\s*out of\s*5", star["aria-label"])
            if m:
                rating = float(m.group(1))
        if rating is None and star and star.has_attr("data-rating"):
            try:
                rating = float(star["data-rating"])
            except:
                rating = None
        if rating is None:
            # Try textual stars in the card
            txt = card.get_text(" ", strip=True)
            m = re.search(r"(\d(?:\.\d)?)\s*out of\s*5", txt)
            rating = float(m.group(1)) if m else None
        rating = rating if rating is not None else 0.0

        # Reviewer name
        name_el = card.select_one("[data-automation='ugc-review-author']") or card.select_one(".author, .reviewer")
        reviewer_name = name_el.get_text(strip=True) if name_el else ""

        out.append(Review(
            review_id=review_id or "",
            title=title,
            review_text=review_text,
            date=date,
            rating=rating,
            source=SOURCE_DOMAIN,
            reviewer_name=reviewer_name,
            sort_applied=sort_applied
        ))
    return out



## 4) Run the scraper
- Launch browser
- Accept cookies, open the Reviews tab
- For each sort order, apply filter, expand **Show more**, and parse reviews
- Save to CSV (and a **deduped** CSV by `review_id`)


In [None]:

def scrape_product_reviews(product_url: str, sort_orders: list) -> pd.DataFrame:
    driver = make_driver()
    all_reviews: List[Dict] = []
    try:
        driver.get(product_url)
        accept_cookies_if_present(driver)
        if not open_reviews_tab(driver):
            print("Could not find/open the Reviews tab. Ensure the URL is a product page with reviews.")
            return pd.DataFrame()

        for sort_label in sort_orders:
            ok = set_sort_order(driver, sort_label)
            polite_sleep()
            clicks = click_show_more_until_end(driver, max_clicks=999)
            html = driver.page_source
            records = parse_reviews_from_html(html, sort_applied=sort_label)
            all_reviews.extend([asdict(r) for r in records])
            print(f"[{sort_label}] Parsed {len(records)} reviews after {clicks} 'Show more' clicks.")

        df = pd.DataFrame(all_reviews)
        if not df.empty:
            # fill review_id fallback if missing
            if "review_id" in df.columns:
                df["review_id"] = df["review_id"].replace("", pd.NA).fillna(df.index.map(lambda i: f"rev_{i:06d}"))
            # Normalize date
            if "date" in df.columns:
                df["date"] = df["date"].fillna("")
        return df
    finally:
        driver.quit()

# Run (uncomment to execute locally)
# df_reviews = scrape_product_reviews(PRODUCT_URL, SORT_ORDERS)
# df_reviews.to_csv(OUTPUT_CSV, index=False)
# df_reviews.drop_duplicates(subset=["review_id"]).to_csv(OUTPUT_CSV_DEDUP, index=False)
# df_reviews.head()



## 5) Sentiment Analysis
Two options:
1. **VADER** (rule-based, fast, no large downloads). Good for short reviews.
2. **Hugging Face** (e.g., `distilbert-base-uncased-finetuned-sst-2-english`) — better nuance, requires internet on first load.

We'll implement both with a fallback to VADER.


In [None]:

def add_vader_sentiment(df: pd.DataFrame) -> pd.DataFrame:
    sid = SentimentIntensityAnalyzer()
    scores = df["review_text"].fillna("").map(sid.polarity_scores).tolist()
    s = pd.DataFrame(scores)
    df = df.copy()
    df[["neg","neu","pos","compound"]] = s[["neg","neu","pos","compound"]]
    # Label
    def label(c):
        if c >= 0.3:
            return "Positive"
        elif c <= -0.3:
            return "Negative"
        else:
            return "Neutral"
    df["sentiment_label"] = df["compound"].map(label)
    return df

def try_hf_sentiment(df: pd.DataFrame) -> Optional[pd.DataFrame]:
    try:
        from transformers import pipeline
        clf = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
        preds = clf(df["review_text"].fillna("").tolist(), truncation=True)
        # Map to simple columns
        labels = [p["label"] for p in preds]
        scores = [p["score"] for p in preds]
        out = df.copy()
        out["hf_label"] = labels
        out["hf_score"] = scores
        return out
    except Exception as e:
        print("HF pipeline unavailable; falling back to VADER only.", e)
        return None

ASPECT_KEYWORDS = {
    "Design": ["design", "look", "aesthetic", "style", "build", "finish", "colour", "color"],
    "Quality": ["quality", "durable", "sturdy", "fragile", "cheap"],
    "Battery": ["battery", "charge", "power", "life"],
    "PriceValue": ["price", "cost", "value", "expensive", "cheap"],
    "Delivery": ["delivery", "shipping", "arrived", "packaging"],
    "Setup": ["setup", "install", "installation", "assemble", "assembly", "configuration"],
    "Performance": ["performance", "speed", "lag", "smooth", "fast", "slow"],
    "CustomerService": ["support", "service", "warranty", "return", "refund"],
}

def tag_aspects(text: str) -> list:
    t = text.lower()
    found = []
    for aspect, kws in ASPECT_KEYWORDS.items():
        if any(kw in t for kw in kws):
            found.append(aspect)
    return found

def make_human_readable_category(sentiment_label: str, aspects: list) -> list:
    # Example mapping to the requested format like: ['Good Design & Quality (Pos)']
    if not aspects:
        return [f"{sentiment_label} Overall ({'Pos' if sentiment_label=='Positive' else 'Neg' if sentiment_label=='Negative' else 'Neu'})"]
    # Group some aspect names for brevity
    label_short = "Pos" if sentiment_label == "Positive" else "Neg" if sentiment_label == "Negative" else "Neu"
    # Keep up to 2 aspects in the display label to stay concise
    disp = " & ".join(aspects[:2])
    nice = {
        "Positive": f"Favourable {disp} ({label_short})",
        "Negative": f"Issues: {disp} ({label_short})",
        "Neutral":  f"Mixed/Neutral {disp} ({label_short})"
    }
    return [nice.get(sentiment_label, f"{sentiment_label} ({label_short})")]



## Business Context – How to read the results
After running on a real product:
- **Top drivers of satisfaction**: look at Positive reviews' aspect tags (e.g., *Design*, *Performance*).
- **Top drivers of dissatisfaction**: look at Negative reviews' aspect tags (e.g., *Battery*, *PriceValue*, *Delivery*).
- **Recommendations**: target the highest-frequency negative aspects; amplify marketing around top positive aspects.
