==================================================================================================================================
# <div align="center">PROJECT 03: Etsy Print-On-Demand Trends</div>
==================================================================================================================================

### üìù BUSINESS IDEA

**Print-On-Demand (POD) Business** ‚Äì What the project is about

### ‚ö†Ô∏è PROBLEM

No Free API exists to access the market data needed, requiring web scraping to gather insights ‚Äì The challenge we‚Äôre addressing

### üî∞ SOLUTION FRAMEWORK

Web scrape etsy for a specific POD product

Collect the data necessary to clean & analyze


| **Development**                                                                                                                                             | **Presentation**                 |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------- |
| **Business Idea** ‚Üí **Problem Definition** ‚Üí **Data Research & Visualization** ‚Üí **Insights** ‚Üí **Interpretation** ‚Üí **Implications** ‚Üí **Business Impact** | **Limitations & Considerations** |


---

### üìì SECTION OVERVIEW

- **Project / Business Idea:** What the project is about

- **Problem:** The challenge we‚Äôre addressing

- **Solution / Approach:** How we solve it

- **Research & Plots:** How we analyzed data visually

- **Insights:** What we discovered

- **Interpretation:** Why it matters

- **Implications:** What actions the business can take

- **Business Impact:** Expected results for the business

- **Limitations:** What constraints or gaps exist

==================================================================================================================================
# <div align="center">RESEARCH</div>
==================================================================================================================================

### üåê **Which Are the Best-Selling POD Products on Etsy?**

I‚Äôm researching print-on-demand products to sell on Etsy that only require **digital artwork and marketing**, while the POD provider handles **printing, packaging, and shipping**.


### ‚≠ê **Using Google Trends for POD Product Research**
üí° **Goal:** Identify which POD product category has been searched the most on Google over the past 5 years (2020‚Äì2025).

Below is the list of product categories I‚Äôm comparing:

### üéØ **Chosen POD product to research is :** `tote bags`

| Category              | Subcategories / Examples                                      |
|-----------------------|---------------------------------------------------------------|
| **Custom Apparel**        | T-shirts, Hoodies, Sweatshirts, Tank tops                     |
| **Mug**                   | Ceramic mugs, Color-changing mugs, Espresso mugs, Travel mugs |
| **Tote Bag**              | Cotton totes, All-over print totes                            |
| **Phone Case**            | iPhone / Samsung cases, Tough / Slim cases                    |
| **Stickers**              | Die-cut stickers, Kiss-cut stickers, Sticker sheets           |
| **Hats**                  | Baseball caps, Trucker hats, Beanies                          |
| **Pillows / Cushions**    | Pillow covers, Stuffed pillows, All-over print pillow designs|
| **Blanket**               | Fleece blankets, Sherpa blankets, Woven blankets             |
| **Wall Art**              | Posters, Canvas prints, Framed posters, Metal prints         |
| **Doormat**               | Printed coir doormats, Rubber-backed doormats                |
| **Drinkware**             | Stainless steel tumblers, Water bottles, Wine tumblers       |
| **Calendar**              | Custom printed wall calendars                                 |
| **Yoga Mat**              | Printed yoga mats                                             |
| **Bedding**               | Duvet covers, Pillowcases, All-over print bed sets           |
| **Pet Accessories**       | Pet bandanas, Pet beds, Pet bowls, Pet blankets              |
| **Ornaments**             | Ceramic ornaments, Wood ornaments, Metal ornaments           |


### **BEFORE GETTING STARTED :**

```Etsy``` is a dynamic website, so scraping it requires careful handling.

Since ```Etsy``` uses ```JavaScript``` to load some content,

```requests``` +  ``BeautifulSoup`` might work for static parts (like search results), 

but for dynamic content, ``Selenium`` is more reliable. 

I will be using ``requests`` + ``BeautifulSoup`` for ```product listings``` **(title, price, link)**

Important Note: Etsy uses dynamic loading + anti-bot protections.

Using code with standard HTML scraping can work as long as Etsy doesn‚Äôt block the request.

If blocked, using headers, rotating proxies, or the Etsy API will be required.

==================================================================================================================================
# <div align="center">WEB SCRAPING</div>
==================================================================================================================================

### üßê QUESTIONS

- Which keywords in product titles and descriptions drive the most sales?

- Which product niches have the highest demand?

- What keywords improve search visibility on Etsy?

- When is the best period to sell based on review trends?

- Which price ranges generate the most sales?

- Which country's customers are buying the most of this product?

----

### üß∞ **Install for web scraping**

In [None]:
# install requests & beautifulsoup
!pip install requests beautifulsoup4 fake-useragent pandas

# install selenium
!pip install selenium pandas

---

### üìå **Avoid web BLOCKED**
| Version                                   | Best For          | Pros                                           | Cons                          |
| ----------------------------------------- | ----------------- | ---------------------------------------------- | ----------------------------- |
| **Requests + BeautifulSoup + Pagination** | Simple scraping   | Fast, clean                                    | Etsy may block request        |
| **Selenium + BeautifulSoup + Pagination** | Reliable scraping | Bypasses bot protection, loads dynamic content | Slower, requires ChromeDriver |


---

## üìå **Product PAGE**

### ‚≠ê **Etsy Product Info**
The main data fields to extract from Etsy's product page :

| Field Name             | Python Data Type       | Concise Definition                      | Long Definition                                                                                       |
|------------------------|-----------------------|----------------------------------------|-------------------------------------------------------------------------------------------------------|
| **product_title**          | `str`                 | Product title                           | The full name of the product, same across all variants.                                              |
| **product_url**           | `str`                 | Short URL to product listing        | Etsy listing URL in the format `https://www.etsy.com/listing/product_id/` (e.g., "https://www.etsy.com/listing/1289965137/"). |
| **product_id**             | `str`                 | Unique product ID                       | Extracted from `product_url`; a unique identifier assigned by Etsy for each listing.               |
| **product_txt**             | `str`                 | Unique product TEXT                    | Extracted after loading the page using the `product_url`; a unique text identifier assigned by Etsy for each listing that only shows once product listing is selected.               |
| **var_extension**          | `str`                 | Variations(s) URL extension             | The variation(s) extension added to `product_url` to generate `var_url`.                        |
| **var_url**               | `str`                 | Full URL to a variant product                      | Complete link to a specific variant, formed by appending `var_extension` to `product_url`.          |
| **product_options**        | `List[dict]`          | Available product options / eg variation0               | List of all product options (size, color, material, etc.), each stored as a dictionary.             |
| **product_var**            | `dict`                | Variant‚Äôs selected options              | Dictionary representing the specific option(s) chosen for this variant.                              |
| **var_current_price**      | `float` or `Decimal`  | Current price for variant               | Price of this variant after applying any discounts.                                                  |
| **var_old_price**          | `float` or `Decimal`  | Original price for variant              | Price of this variant before any discounts were applied (if available).                              |
| **var_discount_percentage**| `float`               | Variant discount percentage             | Discount applied to this variant, calculated if both current and old prices are available.           |
| **product_rating**         | `float`               | Average product rating                  | Average rating of the product out of 5 (e.g., 4.5).                                                 |
| **txt_reviews**            | `str`                 | Concatenated review text                | All review texts or summary text for the product; may include the number of reviews in parentheses. |
| **nbr_reviews**            | `int`                 | Total number of reviews                 | Total count of reviews received by the product.                                                     |
| **listed_date**            | `date`                | Date product was first listed           | The date the product was originally published on Etsy.                                               |
| **product_description**            | `str`                | Product's description           | The text content of the description of the product.                                               |



variation0=5886526755

variation0=5886526755 & variation1=5886526755


https://www.etsy.com/listing/1716154949/boho-embroidered-floral-tote-bag-in-sage

https://www.etsy.com/listing/product_id/product_txt/product_var


product_var  ?variation0=&variation1=variation2=

---

### ‚≠ê **Insighted Data `product_niche` from `product_title` and `product_description`**

| Field Name                 | Python Data Type       | Concise Definition                               |
|---------------------------|-------------------------|---------------------------------------------------|
| **product_niche**             | `str`                     | Product theme or genre (comedy, anime‚Ä¶) based on `product_title` & `product_description`.         |

---

### ‚≠ê **Etsy Product Reviews (Extra dataset)**

All of the product Reviews `Comment`, `Rating`, and `Date` when each review was posted

| Field Name               | Python Data Type | Concise Definition                                      |
|--------------------------|------------------|----------------------------------------------------------|
| **review_product_var**   | `str`            | The specific product variant purchased by the reviewer   |
| **review_rating**        | `float`          | The rating the customer gave the product                 |
| **review_comment**       | `str`            | The text comment the customer wrote                      |
| **review_date**          | `date`           | The date when the customer posted the review             |
| **review_profile_url**   | `str`            | URL to the reviewer's profile page                       |
| **review_username**      | `str`            | The username of the reviewer extracted from reviewer_profile_url |
| **review_country**       | `str`            | The reviewer's country/location                          |



---

# üìå **CODE**

### V4 (var_prices)

In [6]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal, InvalidOperation
from datetime import datetime
import time
import random
import re
from urllib.parse import urlparse, unquote, quote_plus
from itertools import product

# --------------------------
# Chrome options
# --------------------------
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument(
    "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
)

driver = uc.Chrome(options=options)

# --------------------------
# Helper functions
# --------------------------
def safe_decimal(s):
    if s is None:
        return None
    try:
        s = str(s).strip().replace("\u00A0", " ")
        s_norm = re.sub(r"[^\d,.\-]", "", s)
        if "," in s_norm and "." in s_norm:
            s_norm = s_norm.replace(",", "")
        elif "," in s_norm and "." not in s_norm:
            s_norm = s_norm.replace(",", ".")
        if s_norm in ("", ".", "-", "-.", None):
            return None
        return Decimal(s_norm)
    except (InvalidOperation, ValueError):
        return None

def extract_nbr_reviews(txt):
    if not txt:
        return 0
    match = re.search(r'\(([\d\.,\sKk]+)\)', txt)
    if not match:
        return 0
    value = match.group(1).strip().replace(" ", "").replace(",", ".")
    try:
        if "K" in value.upper():
            return int(float(value.upper().replace("K", "")) * 1000)
        return int(float(value))
    except:
        digits = ''.join(ch for ch in value if ch.isdigit())
        return int(digits) if digits else 0

CURRENCY_MAP = {
    "$": "USD", "US$": "USD",
    "CA$": "CAD", "C$": "CAD",
    "AU$": "AUD", "A$": "AUD",
    "‚Ç¨": "EUR",
    "¬£": "GBP",
    "¬•": "JPY",
    "‚Çπ": "INR",
}

def detect_currency(ld_data, soup):
    try:
        if isinstance(ld_data, dict):
            offers = ld_data.get("offers", {})
            if isinstance(offers, dict):
                iso = offers.get("priceCurrency") or offers.get("currency")
                if iso and isinstance(iso, str) and len(iso) >= 2:
                    return iso.upper()
    except:
        pass

    price_text = None
    selectors = [
        "p[data-buy-box-region='price']",
        "p[data-testid='listing-page-price']",
        "span[class*='currency-value']",
        "span[class*='wt-price']",
    ]
    for sel in selectors:
        tag = soup.select_one(sel)
        if tag and tag.get_text(strip=True):
            price_text = tag.get_text(" ", strip=True)
            break
    if not price_text:
        candidate = soup.find(text=re.compile(r'[\$\‚Ç¨\¬£\¬•\‚Çπ]'))
        if candidate:
            price_text = candidate.strip()
    if price_text:
        for sym in sorted(CURRENCY_MAP.keys(), key=len, reverse=True):
            if sym in price_text:
                return CURRENCY_MAP[sym]
        m = re.search(r'([$\‚Ç¨\¬£\¬•\‚Çπ])', price_text)
        if m:
            return CURRENCY_MAP.get(m.group(1), None)
    return None

def parse_price_text(text):
    if not text:
        return None
    text = text.replace("+"," ").replace(",",".")
    matches = re.findall(r"\d+\.\d+|\d+", text)
    for match in matches:
        val = safe_decimal(match)
        if val and val > 0 and val < 100000:
            return val
    return None

def extract_product_txt_from_url(product_url):
    parsed = urlparse(product_url)
    path_parts = parsed.path.strip("/").split("/")

    if path_parts and path_parts[0] in ["fr", "de", "es", "it", "nl"]:
        path_parts = path_parts[1:]

    product_id = ""
    product_txt = ""

    if len(path_parts) >= 2 and path_parts[0] == "listing":
        product_id = path_parts[1]
        if len(path_parts) >= 3:
            product_txt = unquote(path_parts[2])

    return product_id, product_txt

# --------------------------
# NEW FUNCTION: extract price from variant URL
# --------------------------
def extract_variant_price(variant_url):
    driver.get(variant_url)
    time.sleep(random.uniform(1.5, 2.5))

    soup = BeautifulSoup(driver.page_source, "html.parser")

    current_price = None
    old_price = None

    price_container = soup.find("div", {"data-buy-box-region": "price"})
    if price_container:
        now_p = price_container.find("p")
        if now_p:
            current_price = parse_price_text(now_p.get_text(" ", strip=True))

        old_span = price_container.find("span", class_=re.compile("wt-text-strikethrough"))
        if old_span:
            old_price = parse_price_text(old_span.get_text(" ", strip=True))

    if current_price and old_price:
        try:
            discount = float(((old_price - current_price) / old_price) * 100)
        except:
            discount = None
    else:
        discount = None

    return (
        float(current_price) if current_price else None,
        float(old_price) if old_price else None,
        discount
    )

# --------------------------
# Etsy product extractor
# --------------------------
def extract_etsy_product(product_url):
    product_id, product_txt = extract_product_txt_from_url(product_url)
    product_url_full = f"https://www.etsy.com/listing/{product_id}/{product_txt}" if product_txt else f"https://www.etsy.com/listing/{product_id}/"

    driver.get(product_url_full)
    WebDriverWait(driver, 20).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
    )

    soup = BeautifulSoup(driver.page_source, "html.parser")

    ld_data = {}
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.string.strip())
        except:
            ld_data = {}

    product_title = (ld_data.get("name") if isinstance(ld_data, dict) else "") or ""
    product_description = (ld_data.get("description") if isinstance(ld_data, dict) else "") or ""

    current_price = None
    old_price = None
    var_discount_percentage = None

    price_container = soup.find("div", {"data-buy-box-region": "price"})
    if price_container:
        now_p = price_container.find("p")
        if now_p:
            current_price = parse_price_text(now_p.get_text(" ", strip=True))
        old_span = price_container.find("span", class_=re.compile("wt-text-strikethrough"))
        if old_span:
            old_price = parse_price_text(old_span.get_text(" ", strip=True))

    if old_price and current_price:
        try:
            var_discount_percentage = float(((old_price - current_price) / old_price) * 100)
        except:
            var_discount_percentage = None

    currency_iso = detect_currency(ld_data, soup)

    # --------------------------
    # VARIANT EXTRACTION
    # --------------------------
    product_options = []
    try:
        # JSON source
        variants_script = soup.find("script", {"id": "listing-page-data"})
        if variants_script:
            j = json.loads(variants_script.string)
            variations = j.get("listing", {}).get("variations", [])
            for v in variations:
                prop_name = v.get("property_name") or v.get("name") or ""
                opts = []
                for o in v.get("options", []):
                    opt_id = str(o.get("option_id") or o.get("value") or o.get("id") or "")
                    opt_label = str(o.get("listing_option_display_name") or o.get("label") or o.get("value") or o.get("name") or "")
                    if opt_id:
                        opts.append({"id": opt_id, "label": opt_label})
                    else:
                        opts.append({"id": opt_label, "label": opt_label})
                product_options.append({"name": prop_name, "options": opts})

        # fallback HTML <select>
        if not product_options:
            container = soup.find("div", {"data-selector": "listing-page-variations"})
            if container:
                selects = container.find_all("select", {"data-variation-number": True})
                selects_sorted = sorted(selects, key=lambda s: int(s.get("data-variation-number", 0)))
                for sel in selects_sorted:
                    prop_num = sel.get("data-variation-number")
                    parent = sel.find_parent()
                    label_tag = parent.find("label") if parent else None
                    prop_name = label_tag.get_text(" ", strip=True) if label_tag else f"variation_{prop_num}"
                    opts = []
                    for opt in sel.find_all("option"):
                        val = opt.get("value", "").strip()
                        txt = opt.get_text(" ", strip=True)
                        if val == "":
                            continue
                        opts.append({"id": val, "label": txt})
                    if opts:
                        product_options.append({"name": prop_name, "options": opts})
    except:
        pass

    # Ratings
    try:
        rating_div = soup.find("div", class_="reviews-header appears-ready")
        if rating_div:
            txt_reviews_el = rating_div.find("h2", class_="review-header-text wt-mt-xs-2 wt-mt-lg-0")
            txt_reviews = txt_reviews_el.text.strip() if txt_reviews_el else ""
            nbr_reviews = extract_nbr_reviews(txt_reviews)
            rating_value_tag = rating_div.find("span", class_="wt-text-heading-large")
            product_rating = float(rating_value_tag.text.strip()) if rating_value_tag else 0
        else:
            txt_reviews = ""
            nbr_reviews = 0
            product_rating = 0
    except:
        txt_reviews = ""
        nbr_reviews = 0
        product_rating = 0

    # Listed date
    try:
        date_meta = soup.find("meta", {"property": "og:updated_time"})
        if date_meta and date_meta.get("content"):
            listed_date = datetime.strptime(date_meta["content"], "%Y-%m-%dT%H:%M:%S%z").date()
        else:
            listed_date = None
    except:
        listed_date = None

    return {
        "product_title": product_title,
        "product_url": f"https://www.etsy.com/listing/{product_id}/",
        "product_url_full": product_url_full,
        "product_id": product_id,
        "product_txt": product_txt,
        "product_options": product_options,
        "var_current_price": float(current_price) if current_price else None,
        "var_old_price": float(old_price) if old_price else None,
        "var_discount_percentage": float(var_discount_percentage) if var_discount_percentage else None,
        "currency": currency_iso,
        "product_rating": product_rating,
        "txt_reviews": txt_reviews,
        "nbr_reviews": nbr_reviews,
        "listed_date": listed_date,
        "product_description": product_description
    }

# -----------------------------------------------------------
# VARIANT EXPANSION + VARIANT PRICE EXTRACTION
# -----------------------------------------------------------
def expand_product_variants(product_data):

    product_id = product_data.get("product_id") or ""
    product_txt = product_data.get("product_txt") or ""
    base_url = f"https://www.etsy.com/listing/{product_id}/"
    product_options = product_data.get("product_options") or []

    # ---------- NO VARIANTS ----------
    if not product_options:
        row = dict(product_data)
        variant_url = product_data.get("product_url_full") or base_url

        vcurr, vold, vdisc = extract_variant_price(variant_url)
        row["product_variant_url"] = variant_url
        row["var_current_price"] = vcurr
        row["var_old_price"] = vold
        row["var_discount_percentage"] = vdisc
        return [row]

    # Build list-of-lists for combinations
    option_lists = []
    for v in product_options:
        cleaned = []
        for o in v.get("options", []):
            oid = str(o.get("id", "")).strip()
            label = str(o.get("label", "")).strip()
            if oid == "":
                oid = label
            cleaned.append({"id": oid, "label": label})
        option_lists.append(cleaned)

    expanded_rows = []

    for combo in product(*option_lists):

        row = dict(product_data)

        # Assign dynamic variant_id_1 ... etc.
        for i, opt in enumerate(combo):
            row[f"variant_id_{i+1}"] = opt.get("id")
            row[f"variant_label_{i+1}"] = opt.get("label")

        # Build variant URL
        query_parts = []
        for i, opt in enumerate(combo):
            val = opt.get("id", "")
            query_parts.append(f"variation{i}={quote_plus(str(val))}")

        query_str = "&".join(query_parts)
        variant_url = f"{base_url}?{query_str}"

        # Extract REAL PRICE from variant URL
        vcurr, vold, vdisc = extract_variant_price(variant_url)

        row["product_variant_url"] = variant_url
        row["var_current_price"] = vcurr
        row["var_old_price"] = vold
        row["var_discount_percentage"] = vdisc

        expanded_rows.append(row)

    return expanded_rows

# --------------------------
# Scrape best-selling products with pagination
# --------------------------
def scrape_best_selling_products_paginated(product_limit=50):
    base_url = "https://www.etsy.com/fr/market/best_seller_tote_bag?is_star_seller=1&page={}"
    product_urls = []
    visited_urls = set()
    page = 1

    while len(product_urls) < product_limit:
        driver.get(base_url.format(page))
        time.sleep(random.uniform(3,5))

        product_elements = driver.find_elements(By.XPATH, "//a[contains(@href,'/listing/')]")
        for elem in product_elements:
            url = elem.get_attribute("href").split("?")[0]
            if url not in visited_urls:
                visited_urls.add(url)
                product_urls.append(url)
            if len(product_urls) >= product_limit:
                break

        if not product_elements:
            break
        page += 1

    print(f"Collected {len(product_urls)} product URLs.")

    all_products = []
    for idx, url in enumerate(product_urls, start=1):
        try:
            print(f"Scraping {idx}/{len(product_urls)}: {url}")
            data = extract_etsy_product(url)
            expanded_rows = expand_product_variants(data)
            all_products.extend(expanded_rows)

        except Exception as e:
            print(f"Error scraping {url}: {e}")
        time.sleep(random.uniform(2,4))

    return all_products

# --------------------------
# Run scraper
# --------------------------
if __name__ == "__main__":
    product_limit = 2
    all_products = scrape_best_selling_products_paginated(product_limit=product_limit)
    df = pd.DataFrame(all_products)
    df.to_csv("../data/raw/03_raw_data.csv", index=False)
    print("Scraping finished!")
df.shape
df.head(1000)

Collected 2 product URLs.
Scraping 1/2: https://www.etsy.com/fr/listing/4387299123/sac-fourre-tout-en-coton-matelasse


KeyboardInterrupt: 

In [9]:
df.head(318)

Unnamed: 0,product_title,product_url,product_url_full,product_id,product_txt,product_options,var_current_price,var_old_price,var_discount_percentage,currency,product_rating,txt_reviews,nbr_reviews,listed_date,product_description,variant_id_1,variant_label_1,variant_id_2,variant_label_2,product_variant_url
0,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6028745751,Farida Prints,https://www.etsy.com/listing/4387299123/?varia...
1,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6041538846,Pink Strips Buta,https://www.etsy.com/listing/4387299123/?varia...
2,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6041538850,Pink Blue Floral,https://www.etsy.com/listing/4387299123/?varia...
3,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6028745761,Red Strawberry Strip,https://www.etsy.com/listing/4387299123/?varia...
4,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6041538860,Blue Strips Heart,https://www.etsy.com/listing/4387299123/?varia...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
312,"Sac fourre-tout en toile brod√©e, sac de plage ...",https://www.etsy.com/listing/1852832959/,https://www.etsy.com/listing/1852832959/sac-fo...,1852832959,sac-fourre-tout-en-toile-brodee-sac-de,"[{'name': 'variation_0', 'options': [{'id': '5...",6.95,17.36,59.965438,EUR,4.8,Avis sur cet article (59),59,,"Un sac √† main d&#39;inspiration vintage, fait ...",5031616034,Rouge,5007392731,"Rose (12,50 ‚Ç¨)",https://www.etsy.com/listing/1852832959/?varia...
313,"Sac fourre-tout en toile brod√©e, sac de plage ...",https://www.etsy.com/listing/1852832959/,https://www.etsy.com/listing/1852832959/sac-fo...,1852832959,sac-fourre-tout-en-toile-brodee-sac-de,"[{'name': 'variation_0', 'options': [{'id': '5...",6.95,17.36,59.965438,EUR,4.8,Avis sur cet article (59),59,,"Un sac √† main d&#39;inspiration vintage, fait ...",5031616034,Rouge,5031616004,"Gold (12,50 ‚Ç¨)",https://www.etsy.com/listing/1852832959/?varia...
314,"Sac fourre-tout en toile brod√©e, sac de plage ...",https://www.etsy.com/listing/1852832959/,https://www.etsy.com/listing/1852832959/sac-fo...,1852832959,sac-fourre-tout-en-toile-brodee-sac-de,"[{'name': 'variation_0', 'options': [{'id': '5...",6.95,17.36,59.965438,EUR,4.8,Avis sur cet article (59),59,,"Un sac √† main d&#39;inspiration vintage, fait ...",5031616034,Rouge,5031616006,"Green (12,50 ‚Ç¨)",https://www.etsy.com/listing/1852832959/?varia...
315,"Sac fourre-tout en toile brod√©e, sac de plage ...",https://www.etsy.com/listing/1852832959/,https://www.etsy.com/listing/1852832959/sac-fo...,1852832959,sac-fourre-tout-en-toile-brodee-sac-de,"[{'name': 'variation_0', 'options': [{'id': '5...",6.95,17.36,59.965438,EUR,4.8,Avis sur cet article (59),59,,"Un sac √† main d&#39;inspiration vintage, fait ...",5031616034,Rouge,5031616008,"Pink (12,50 ‚Ç¨)",https://www.etsy.com/listing/1852832959/?varia...


### V3 (product_variant_url DONE)

In [None]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal, InvalidOperation
from datetime import datetime
import time
import random
import re
from urllib.parse import urlparse, unquote, quote_plus
from itertools import product

# --------------------------
# Chrome options
# --------------------------
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument(
    "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
)

driver = uc.Chrome(options=options)

# --------------------------
# Helper functions
# --------------------------
def safe_decimal(s):
    if s is None:
        return None
    try:
        s = str(s).strip().replace("\u00A0", " ")
        s_norm = re.sub(r"[^\d,.\-]", "", s)
        if "," in s_norm and "." in s_norm:
            s_norm = s_norm.replace(",", "")
        elif "," in s_norm and "." not in s_norm:
            s_norm = s_norm.replace(",", ".")
        if s_norm in ("", ".", "-", "-.", None):
            return None
        return Decimal(s_norm)
    except (InvalidOperation, ValueError):
        return None

def extract_nbr_reviews(txt):
    if not txt:
        return 0
    match = re.search(r'\(([\d\.,\sKk]+)\)', txt)
    if not match:
        return 0
    value = match.group(1).strip().replace(" ", "").replace(",", ".")
    try:
        if "K" in value.upper():
            return int(float(value.upper().replace("K", "")) * 1000)
        return int(float(value))
    except:
        digits = ''.join(ch for ch in value if ch.isdigit())
        return int(digits) if digits else 0

CURRENCY_MAP = {
    "$": "USD", "US$": "USD",
    "CA$": "CAD", "C$": "CAD",
    "AU$": "AUD", "A$": "AUD",
    "‚Ç¨": "EUR",
    "¬£": "GBP",
    "¬•": "JPY",
    "‚Çπ": "INR",
}

def detect_currency(ld_data, soup):
    try:
        if isinstance(ld_data, dict):
            offers = ld_data.get("offers", {})
            if isinstance(offers, dict):
                iso = offers.get("priceCurrency") or offers.get("currency")
                if iso and isinstance(iso, str) and len(iso) >= 2:
                    return iso.upper()
    except:
        pass

    price_text = None
    selectors = [
        "p[data-buy-box-region='price']",
        "p[data-testid='listing-page-price']",
        "span[class*='currency-value']",
        "span[class*='wt-price']",
    ]
    for sel in selectors:
        tag = soup.select_one(sel)
        if tag and tag.get_text(strip=True):
            price_text = tag.get_text(" ", strip=True)
            break
    if not price_text:
        candidate = soup.find(text=re.compile(r'[\$\‚Ç¨\¬£\¬•\‚Çπ]'))
        if candidate:
            price_text = candidate.strip()
    if price_text:
        for sym in sorted(CURRENCY_MAP.keys(), key=len, reverse=True):
            if sym in price_text:
                return CURRENCY_MAP[sym]
        m = re.search(r'([$\‚Ç¨\¬£\¬•\‚Çπ])', price_text)
        if m:
            return CURRENCY_MAP.get(m.group(1), None)
    return None

def parse_price_text(text):
    if not text:
        return None
    text = text.replace("+"," ").replace(",",".")
    matches = re.findall(r"\d+\.\d+|\d+", text)
    for match in matches:
        val = safe_decimal(match)
        if val and val > 0 and val < 100000:
            return val
    return None

def extract_product_txt_from_url(product_url):
    parsed = urlparse(product_url)
    path_parts = parsed.path.strip("/").split("/")

    if path_parts and path_parts[0] in ["fr", "de", "es", "it", "nl"]:
        path_parts = path_parts[1:]

    product_id = ""
    product_txt = ""

    if len(path_parts) >= 2 and path_parts[0] == "listing":
        product_id = path_parts[1]
        if len(path_parts) >= 3:
            product_txt = unquote(path_parts[2])

    return product_id, product_txt

# --------------------------
# Etsy product extractor
# --------------------------
def extract_etsy_product(product_url):
    product_id, product_txt = extract_product_txt_from_url(product_url)
    product_url_full = f"https://www.etsy.com/listing/{product_id}/{product_txt}" if product_txt else f"https://www.etsy.com/listing/{product_id}/"

    driver.get(product_url_full)
    WebDriverWait(driver, 20).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
    )

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # JSON-LD
    ld_data = {}
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.string.strip())
        except:
            ld_data = {}

    product_title = (ld_data.get("name") if isinstance(ld_data, dict) else "") or ""
    product_description = (ld_data.get("description") if isinstance(ld_data, dict) else "") or ""

    # Price
    current_price = None
    old_price = None
    var_discount_percentage = None

    price_container = soup.find("div", {"data-buy-box-region": "price"})
    if price_container:
        now_p = price_container.find("p")
        if now_p:
            current_price = parse_price_text(now_p.get_text(" ", strip=True))
        old_span = price_container.find("span", class_=re.compile("wt-text-strikethrough"))
        if old_span:
            old_price = parse_price_text(old_span.get_text(" ", strip=True))

    try:
        if old_price and current_price:
            var_discount_percentage = float(((old_price - current_price) / old_price) * 100)
    except:
        var_discount_percentage = None

    currency_iso = detect_currency(ld_data, soup)

    # === Variants: try JSON first, then fall back to HTML <select>s ===
    product_options = []  # we'll store as list of dicts: { "name": <name>, "options": [ {"id": "604...", "label": "Small|..."} ] }
    try:
        # JSON variant source (preferred)
        variants_script = soup.find("script", {"id": "listing-page-data"})
        if variants_script:
            variants_json = json.loads(variants_script.string)
            variations = variants_json.get("listing", {}).get("variations", [])
            for v in variations:
                prop_name = v.get("property_name") or v.get("name") or ""
                opts = []
                for o in v.get("options", []):
                    # Etsy's option in JSON may provide id & value or just text
                    opt_id = str(o.get("option_id") or o.get("value") or o.get("id") or "")
                    opt_label = str(o.get("listing_option_display_name") or o.get("label") or o.get("value") or o.get("name") or o.get("text") or opt_id)
                    if opt_id:
                        opts.append({"id": opt_id, "label": opt_label})
                    else:
                        # if JSON lacks id, fallback to label-only (will be used in URL encoded form)
                        opts.append({"id": opt_label, "label": opt_label})
                product_options.append({"name": prop_name, "options": opts})

        # If no JSON or product_options empty, parse the HTML <select> structure (the HTML you pasted)
        if not product_options:
            container = soup.find("div", {"data-selector": "listing-page-variations"})
            if container:
                selects = container.find_all("select", {"data-variation-number": True})
                # sort by variation-number to keep order consistent
                selects_sorted = sorted(selects, key=lambda s: int(s.get("data-variation-number", 0)))
                for sel in selects_sorted:
                    prop_num = sel.get("data-variation-number")
                    label_tag = None
                    # try to find the corresponding label
                    parent = sel.find_parent()
                    if parent:
                        label_tag = parent.find("label")
                    prop_name = ""
                    if label_tag:
                        prop_name = label_tag.get_text(" ", strip=True)
                    else:
                        prop_name = f"variation_{prop_num}"
                    opts = []
                    for opt in sel.find_all("option"):
                        val = opt.get("value", "").strip()
                        txt = opt.get_text(" ", strip=True)
                        if val == "":
                            # skip the "select an option" placeholder
                            continue
                        opts.append({"id": val, "label": txt})
                    if opts:
                        product_options.append({"name": prop_name, "options": opts})
    except Exception:
        product_options = []

    # Ratings & reviews
    try:
        rating_div = soup.find("div", class_="reviews-header appears-ready")
        if rating_div:
            txt_reviews_el = rating_div.find("h2", class_="review-header-text wt-mt-xs-2 wt-mt-lg-0")
            txt_reviews = txt_reviews_el.text.strip() if txt_reviews_el else ""
            nbr_reviews = extract_nbr_reviews(txt_reviews)
            rating_value_tag = rating_div.find("span", class_="wt-text-heading-large")
            product_rating = float(rating_value_tag.text.strip()) if rating_value_tag else 0
        else:
            txt_reviews = ""
            nbr_reviews = 0
            product_rating = 0
    except:
        txt_reviews = ""
        nbr_reviews = 0
        product_rating = 0

    # Listed date
    try:
        date_meta = soup.find("meta", {"property": "og:updated_time"})
        if date_meta and date_meta.get("content"):
            listed_date = datetime.strptime(date_meta["content"], "%Y-%m-%dT%H:%M:%S%z").date()
        else:
            listed_date = None
    except:
        listed_date = None

    # Return base product dict (no expanded variants yet)
    return {
        "product_title": product_title,
        "product_url": f"https://www.etsy.com/listing/{product_id}/",
        "product_url_full": product_url_full,
        "product_id": product_id,
        "product_txt": product_txt,
        "product_options": product_options,  # structured list (name + options with id/label)
        "var_current_price": float(current_price) if current_price else None,
        "var_old_price": float(old_price) if old_price else None,
        "var_discount_percentage": float(var_discount_percentage) if var_discount_percentage else None,
        "currency": currency_iso,
        "product_rating": product_rating,
        "txt_reviews": txt_reviews,
        "nbr_reviews": nbr_reviews,
        "listed_date": listed_date,
        "product_description": product_description
    }

# -----------------------------------------------------------
# >>> VARIANT EXPANSION: produce one row per variant combo <<<
# -----------------------------------------------------------
def expand_product_variants(product_data):
    """
    Input: product_data returned by extract_etsy_product
    Output: list of dicts, each dict is one variant combination row
    Each row includes dynamic keys:
      variant_id_1, variant_id_2, ...  (strings: the option.value from the page/JSON)
    And product_variant_url which is:
      product_url + "?variation0=<id>&variation1=<id>&..."
    """
    product_id = product_data.get("product_id") or ""
    product_txt = product_data.get("product_txt") or ""
    base_url = f"https://www.etsy.com/listing/{product_id}/"
    # make sure product_options is present
    product_options = product_data.get("product_options") or []

    # if no variants, return single row with product_variant_url equal to product_url_full
    if not product_options:
        row = dict(product_data)  # shallow copy
        row["product_variant_url"] = product_data.get("product_url_full") or base_url
        return [row]

    # Build lists of option dicts for itertools.product
    option_lists = []
    for v in product_options:
        # v is {"name": ..., "options": [ {"id":..., "label":...}, ...]}
        opts = v.get("options", [])
        # ensure each option has an id (string)
        cleaned = []
        for o in opts:
            oid = str(o.get("id", "")).strip()
            label = str(o.get("label", "")).strip()
            if oid == "":
                # fallback to label as id if numeric id missing
                oid = label
            cleaned.append({"id": oid, "label": label})
        if cleaned:
            option_lists.append(cleaned)
        else:
            # if a variation exists but has no options, treat as single empty option
            option_lists.append([{"id": "", "label": ""}])

    expanded_rows = []
    # The URL expects variation0, variation1 ... numbering starting at 0 in the order found.
    for combo in product(*option_lists):
        # combo is a tuple of option dicts (one per variation)
        row = dict(product_data)  # base fields
        # Add variant_id_1 ... variant_id_n and also variant_label_1 ...
        for i, opt in enumerate(combo):
            # user wants variant_id_1 ... variant_id_2 ... (1-based keys in dict)
            row[f"variant_id_{i+1}"] = opt.get("id")
            row[f"variant_label_{i+1}"] = opt.get("label")

        # Build product_variant_url using variation0..variationN with IDs (URL-encoded ids if needed)
        query_parts = []
        for i, opt in enumerate(combo):
            # use variation index starting at 0 for URL param name
            val = opt.get("id", "")
            # Etsy expects raw ids (numbers) usually ‚Äî still safe to quote
            query_parts.append(f"variation{i}={quote_plus(str(val))}")
        query_str = "&".join(query_parts)
        product_variant_url = f"{base_url}?{query_str}" if query_str else base_url
        row["product_variant_url"] = product_variant_url

        expanded_rows.append(row)

    return expanded_rows

# --------------------------
# Scrape best-selling products with pagination
# --------------------------
def scrape_best_selling_products_paginated(product_limit=50):
    base_url = "https://www.etsy.com/fr/market/best_seller_tote_bag?is_star_seller=1&page={}"
    product_urls = []
    visited_urls = set()
    page = 1

    while len(product_urls) < product_limit:
        driver.get(base_url.format(page))
        time.sleep(random.uniform(3,5))

        product_elements = driver.find_elements(By.XPATH, "//a[contains(@href,'/listing/')]")
        for elem in product_elements:
            url = elem.get_attribute("href").split("?")[0]
            if url not in visited_urls:
                visited_urls.add(url)
                product_urls.append(url)
            if len(product_urls) >= product_limit:
                break

        if not product_elements or len(product_elements) == 0:
            break
        page += 1

    print(f"Collected {len(product_urls)} product URLs.")

    all_products = []
    for idx, url in enumerate(product_urls, start=1):
        try:
            print(f"Scraping {idx}/{len(product_urls)}: {url}")
            data = extract_etsy_product(url)

            # Expand into variant rows and extend
            expanded_rows = expand_product_variants(data)
            all_products.extend(expanded_rows)

        except Exception as e:
            print(f"Error scraping {url}: {e}")
        time.sleep(random.uniform(2,4))

    return all_products

# --------------------------
# Run scraper
# --------------------------
if __name__ == "__main__":
    product_limit = 2  # number of products to scrape
    all_products = scrape_best_selling_products_paginated(product_limit=product_limit)
    df = pd.DataFrame(all_products)
    df.to_csv("../data/raw/03_raw_data.csv", index=False)
    print("Scraping finished! Saved to CSV.")



Collected 2 product URLs.
Scraping 1/2: https://www.etsy.com/fr/listing/4387299123/sac-fourre-tout-en-coton-matelasse
Scraping 2/2: https://www.etsy.com/fr/listing/1852832959/sac-fourre-tout-en-toile-brodee-sac-de
Scraping finished! Saved to CSV.


Unnamed: 0,product_title,product_url,product_url_full,product_id,product_txt,product_options,var_current_price,var_old_price,var_discount_percentage,currency,product_rating,txt_reviews,nbr_reviews,listed_date,product_description,variant_id_1,variant_label_1,variant_id_2,variant_label_2,product_variant_url
0,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6028745751,Farida Prints,https://www.etsy.com/listing/4387299123/?varia...
1,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6041538846,Pink Strips Buta,https://www.etsy.com/listing/4387299123/?varia...
2,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6041538850,Pink Blue Floral,https://www.etsy.com/listing/4387299123/?varia...
3,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6028745761,Red Strawberry Strip,https://www.etsy.com/listing/4387299123/?varia...
4,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6041538860,Blue Strips Heart,https://www.etsy.com/listing/4387299123/?varia...


In [5]:
df.head(1000)

Unnamed: 0,product_title,product_url,product_url_full,product_id,product_txt,product_options,var_current_price,var_old_price,var_discount_percentage,currency,product_rating,txt_reviews,nbr_reviews,listed_date,product_description,variant_id_1,variant_label_1,variant_id_2,variant_label_2,product_variant_url
0,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6028745751,Farida Prints,https://www.etsy.com/listing/4387299123/?varia...
1,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6041538846,Pink Strips Buta,https://www.etsy.com/listing/4387299123/?varia...
2,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6041538850,Pink Blue Floral,https://www.etsy.com/listing/4387299123/?varia...
3,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6028745761,Red Strawberry Strip,https://www.etsy.com/listing/4387299123/?varia...
4,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6041538860,Blue Strips Heart,https://www.etsy.com/listing/4387299123/?varia...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
312,"Sac fourre-tout en toile brod√©e, sac de plage ...",https://www.etsy.com/listing/1852832959/,https://www.etsy.com/listing/1852832959/sac-fo...,1852832959,sac-fourre-tout-en-toile-brodee-sac-de,"[{'name': 'variation_0', 'options': [{'id': '5...",6.95,17.36,59.965438,EUR,4.8,Avis sur cet article (59),59,,"Un sac √† main d&#39;inspiration vintage, fait ...",5031616034,Rouge,5007392731,"Rose (12,50 ‚Ç¨)",https://www.etsy.com/listing/1852832959/?varia...
313,"Sac fourre-tout en toile brod√©e, sac de plage ...",https://www.etsy.com/listing/1852832959/,https://www.etsy.com/listing/1852832959/sac-fo...,1852832959,sac-fourre-tout-en-toile-brodee-sac-de,"[{'name': 'variation_0', 'options': [{'id': '5...",6.95,17.36,59.965438,EUR,4.8,Avis sur cet article (59),59,,"Un sac √† main d&#39;inspiration vintage, fait ...",5031616034,Rouge,5031616004,"Gold (12,50 ‚Ç¨)",https://www.etsy.com/listing/1852832959/?varia...
314,"Sac fourre-tout en toile brod√©e, sac de plage ...",https://www.etsy.com/listing/1852832959/,https://www.etsy.com/listing/1852832959/sac-fo...,1852832959,sac-fourre-tout-en-toile-brodee-sac-de,"[{'name': 'variation_0', 'options': [{'id': '5...",6.95,17.36,59.965438,EUR,4.8,Avis sur cet article (59),59,,"Un sac √† main d&#39;inspiration vintage, fait ...",5031616034,Rouge,5031616006,"Green (12,50 ‚Ç¨)",https://www.etsy.com/listing/1852832959/?varia...
315,"Sac fourre-tout en toile brod√©e, sac de plage ...",https://www.etsy.com/listing/1852832959/,https://www.etsy.com/listing/1852832959/sac-fo...,1852832959,sac-fourre-tout-en-toile-brodee-sac-de,"[{'name': 'variation_0', 'options': [{'id': '5...",6.95,17.36,59.965438,EUR,4.8,Avis sur cet article (59),59,,"Un sac √† main d&#39;inspiration vintage, fait ...",5031616034,Rouge,5031616008,"Pink (12,50 ‚Ç¨)",https://www.etsy.com/listing/1852832959/?varia...


## V2 (variants!)

In [2]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal, InvalidOperation
from datetime import datetime
import time
import random
import re
from urllib.parse import urlparse, unquote, quote_plus
from itertools import product

# --------------------------
# Chrome options
# --------------------------
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument(
    "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
)

driver = uc.Chrome(options=options)

# --------------------------
# Helper functions
# --------------------------
def safe_decimal(s):
    if s is None:
        return None
    try:
        s = str(s).strip().replace("\u00A0", " ")
        s_norm = re.sub(r"[^\d,.\-]", "", s)
        if "," in s_norm and "." in s_norm:
            s_norm = s_norm.replace(",", "")
        elif "," in s_norm and "." not in s_norm:
            s_norm = s_norm.replace(",", ".")
        if s_norm in ("", ".", "-", "-.", None):
            return None
        return Decimal(s_norm)
    except (InvalidOperation, ValueError):
        return None

def extract_nbr_reviews(txt):
    if not txt:
        return 0
    match = re.search(r'\(([\d\.,\sKk]+)\)', txt)
    if not match:
        return 0
    value = match.group(1).strip().replace(" ", "").replace(",", ".")
    try:
        if "K" in value.upper():
            return int(float(value.upper().replace("K", "")) * 1000)
        return int(float(value))
    except:
        digits = ''.join(ch for ch in value if ch.isdigit())
        return int(digits) if digits else 0

CURRENCY_MAP = {
    "$": "USD", "US$": "USD",
    "CA$": "CAD", "C$": "CAD",
    "AU$": "AUD", "A$": "AUD",
    "‚Ç¨": "EUR",
    "¬£": "GBP",
    "¬•": "JPY",
    "‚Çπ": "INR",
}

def detect_currency(ld_data, soup):
    try:
        if isinstance(ld_data, dict):
            offers = ld_data.get("offers", {})
            if isinstance(offers, dict):
                iso = offers.get("priceCurrency") or offers.get("currency")
                if iso and isinstance(iso, str) and len(iso) >= 2:
                    return iso.upper()
    except:
        pass

    price_text = None
    selectors = [
        "p[data-buy-box-region='price']",
        "p[data-testid='listing-page-price']",
        "span[class*='currency-value']",
        "span[class*='wt-price']",
    ]
    for sel in selectors:
        tag = soup.select_one(sel)
        if tag and tag.get_text(strip=True):
            price_text = tag.get_text(" ", strip=True)
            break
    if not price_text:
        candidate = soup.find(text=re.compile(r'[\$\‚Ç¨\¬£\¬•\‚Çπ]'))
        if candidate:
            price_text = candidate.strip()
    if price_text:
        for sym in sorted(CURRENCY_MAP.keys(), key=len, reverse=True):
            if sym in price_text:
                return CURRENCY_MAP[sym]
        m = re.search(r'([$\‚Ç¨\¬£\¬•\‚Çπ])', price_text)
        if m:
            return CURRENCY_MAP.get(m.group(1), None)
    return None

def parse_price_text(text):
    if not text:
        return None
    text = text.replace("+"," ").replace(",",".")
    matches = re.findall(r"\d+\.\d+|\d+", text)
    for match in matches:
        val = safe_decimal(match)
        if val and val > 0 and val < 100000:
            return val
    return None

def extract_product_txt_from_url(product_url):
    parsed = urlparse(product_url)
    path_parts = parsed.path.strip("/").split("/")

    if path_parts and path_parts[0] in ["fr", "de", "es", "it", "nl"]:
        path_parts = path_parts[1:]

    product_id = ""
    product_txt = ""

    if len(path_parts) >= 2 and path_parts[0] == "listing":
        product_id = path_parts[1]
        if len(path_parts) >= 3:
            product_txt = unquote(path_parts[2])

    return product_id, product_txt

# --------------------------
# Etsy product extractor
# --------------------------
def extract_etsy_product(product_url):
    product_id, product_txt = extract_product_txt_from_url(product_url)
    product_url_full = f"https://www.etsy.com/listing/{product_id}/{product_txt}" if product_txt else f"https://www.etsy.com/listing/{product_id}/"

    driver.get(product_url_full)
    WebDriverWait(driver, 20).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
    )

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # JSON-LD
    ld_data = {}
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.string.strip())
        except:
            ld_data = {}

    product_title = (ld_data.get("name") if isinstance(ld_data, dict) else "") or ""
    product_description = (ld_data.get("description") if isinstance(ld_data, dict) else "") or ""

    # Price
    current_price = None
    old_price = None
    var_discount_percentage = None

    price_container = soup.find("div", {"data-buy-box-region": "price"})
    if price_container:
        now_p = price_container.find("p")
        if now_p:
            current_price = parse_price_text(now_p.get_text(" ", strip=True))
        old_span = price_container.find("span", class_=re.compile("wt-text-strikethrough"))
        if old_span:
            old_price = parse_price_text(old_span.get_text(" ", strip=True))

    try:
        if old_price and current_price:
            var_discount_percentage = float(((old_price - current_price) / old_price) * 100)
    except:
        var_discount_percentage = None

    currency_iso = detect_currency(ld_data, soup)

    # === Variants: try JSON first, then fall back to HTML <select>s ===
    product_options = []  # we'll store as list of dicts: { "name": <name>, "options": [ {"id": "604...", "label": "Small|..."} ] }
    try:
        # JSON variant source (preferred)
        variants_script = soup.find("script", {"id": "listing-page-data"})
        if variants_script:
            variants_json = json.loads(variants_script.string)
            variations = variants_json.get("listing", {}).get("variations", [])
            for v in variations:
                prop_name = v.get("property_name") or v.get("name") or ""
                opts = []
                for o in v.get("options", []):
                    # Etsy's option in JSON may provide id & value or just text
                    opt_id = str(o.get("option_id") or o.get("value") or o.get("id") or "")
                    opt_label = str(o.get("listing_option_display_name") or o.get("label") or o.get("value") or o.get("name") or o.get("text") or opt_id)
                    if opt_id:
                        opts.append({"id": opt_id, "label": opt_label})
                    else:
                        # if JSON lacks id, fallback to label-only (will be used in URL encoded form)
                        opts.append({"id": opt_label, "label": opt_label})
                product_options.append({"name": prop_name, "options": opts})

        # If no JSON or product_options empty, parse the HTML <select> structure (the HTML you pasted)
        if not product_options:
            container = soup.find("div", {"data-selector": "listing-page-variations"})
            if container:
                selects = container.find_all("select", {"data-variation-number": True})
                # sort by variation-number to keep order consistent
                selects_sorted = sorted(selects, key=lambda s: int(s.get("data-variation-number", 0)))
                for sel in selects_sorted:
                    prop_num = sel.get("data-variation-number")
                    label_tag = None
                    # try to find the corresponding label
                    parent = sel.find_parent()
                    if parent:
                        label_tag = parent.find("label")
                    prop_name = ""
                    if label_tag:
                        prop_name = label_tag.get_text(" ", strip=True)
                    else:
                        prop_name = f"variation_{prop_num}"
                    opts = []
                    for opt in sel.find_all("option"):
                        val = opt.get("value", "").strip()
                        txt = opt.get_text(" ", strip=True)
                        if val == "":
                            # skip the "select an option" placeholder
                            continue
                        opts.append({"id": val, "label": txt})
                    if opts:
                        product_options.append({"name": prop_name, "options": opts})
    except Exception:
        product_options = []

    # Ratings & reviews
    try:
        rating_div = soup.find("div", class_="reviews-header appears-ready")
        if rating_div:
            txt_reviews_el = rating_div.find("h2", class_="review-header-text wt-mt-xs-2 wt-mt-lg-0")
            txt_reviews = txt_reviews_el.text.strip() if txt_reviews_el else ""
            nbr_reviews = extract_nbr_reviews(txt_reviews)
            rating_value_tag = rating_div.find("span", class_="wt-text-heading-large")
            product_rating = float(rating_value_tag.text.strip()) if rating_value_tag else 0
        else:
            txt_reviews = ""
            nbr_reviews = 0
            product_rating = 0
    except:
        txt_reviews = ""
        nbr_reviews = 0
        product_rating = 0

    # Listed date
    try:
        date_meta = soup.find("meta", {"property": "og:updated_time"})
        if date_meta and date_meta.get("content"):
            listed_date = datetime.strptime(date_meta["content"], "%Y-%m-%dT%H:%M:%S%z").date()
        else:
            listed_date = None
    except:
        listed_date = None

    # Return base product dict (no expanded variants yet)
    return {
        "product_title": product_title,
        "product_url": f"https://www.etsy.com/listing/{product_id}/",
        "product_url_full": product_url_full,
        "product_id": product_id,
        "product_txt": product_txt,
        "product_options": product_options,  # structured list (name + options with id/label)
        "var_current_price": float(current_price) if current_price else None,
        "var_old_price": float(old_price) if old_price else None,
        "var_discount_percentage": float(var_discount_percentage) if var_discount_percentage else None,
        "currency": currency_iso,
        "product_rating": product_rating,
        "txt_reviews": txt_reviews,
        "nbr_reviews": nbr_reviews,
        "listed_date": listed_date,
        "product_description": product_description
    }

# -----------------------------------------------------------
# >>> VARIANT EXPANSION: produce one row per variant combo <<<
# -----------------------------------------------------------
def expand_product_variants(product_data):
    """
    Input: product_data returned by extract_etsy_product
    Output: list of dicts, each dict is one variant combination row
    Each row includes dynamic keys:
      variant_id_1, variant_id_2, ...  (strings: the option.value from the page/JSON)
    And product_variant_url which is:
      product_url + "?variation0=<id>&variation1=<id>&..."
    """
    product_id = product_data.get("product_id") or ""
    product_txt = product_data.get("product_txt") or ""
    base_url = f"https://www.etsy.com/listing/{product_id}/"
    # make sure product_options is present
    product_options = product_data.get("product_options") or []

    # if no variants, return single row with product_variant_url equal to product_url_full
    if not product_options:
        row = dict(product_data)  # shallow copy
        row["product_variant_url"] = product_data.get("product_url_full") or base_url
        return [row]

    # Build lists of option dicts for itertools.product
    option_lists = []
    for v in product_options:
        # v is {"name": ..., "options": [ {"id":..., "label":...}, ...]}
        opts = v.get("options", [])
        # ensure each option has an id (string)
        cleaned = []
        for o in opts:
            oid = str(o.get("id", "")).strip()
            label = str(o.get("label", "")).strip()
            if oid == "":
                # fallback to label as id if numeric id missing
                oid = label
            cleaned.append({"id": oid, "label": label})
        if cleaned:
            option_lists.append(cleaned)
        else:
            # if a variation exists but has no options, treat as single empty option
            option_lists.append([{"id": "", "label": ""}])

    expanded_rows = []
    # The URL expects variation0, variation1 ... numbering starting at 0 in the order found.
    for combo in product(*option_lists):
        # combo is a tuple of option dicts (one per variation)
        row = dict(product_data)  # base fields
        # Add variant_id_1 ... variant_id_n and also variant_label_1 ...
        for i, opt in enumerate(combo):
            # user wants variant_id_1 ... variant_id_2 ... (1-based keys in dict)
            row[f"variant_id_{i+1}"] = opt.get("id")
            row[f"variant_label_{i+1}"] = opt.get("label")

        # Build product_variant_url using variation0..variationN with IDs (URL-encoded ids if needed)
        query_parts = []
        for i, opt in enumerate(combo):
            # use variation index starting at 0 for URL param name
            val = opt.get("id", "")
            # Etsy expects raw ids (numbers) usually ‚Äî still safe to quote
            query_parts.append(f"variation{i}={quote_plus(str(val))}")
        query_str = "&".join(query_parts)
        product_variant_url = f"{base_url}?{query_str}" if query_str else base_url
        row["product_variant_url"] = product_variant_url

        expanded_rows.append(row)

    return expanded_rows

# --------------------------
# Scrape best-selling products with pagination
# --------------------------
def scrape_best_selling_products_paginated(product_limit=50):
    base_url = "https://www.etsy.com/fr/market/best_seller_tote_bag?is_star_seller=1&page={}"
    product_urls = []
    visited_urls = set()
    page = 1

    while len(product_urls) < product_limit:
        driver.get(base_url.format(page))
        time.sleep(random.uniform(3,5))

        product_elements = driver.find_elements(By.XPATH, "//a[contains(@href,'/listing/')]")
        for elem in product_elements:
            url = elem.get_attribute("href").split("?")[0]
            if url not in visited_urls:
                visited_urls.add(url)
                product_urls.append(url)
            if len(product_urls) >= product_limit:
                break

        if not product_elements or len(product_elements) == 0:
            break
        page += 1

    print(f"Collected {len(product_urls)} product URLs.")

    all_products = []
    for idx, url in enumerate(product_urls, start=1):
        try:
            print(f"Scraping {idx}/{len(product_urls)}: {url}")
            data = extract_etsy_product(url)

            # Expand into variant rows and extend
            expanded_rows = expand_product_variants(data)
            all_products.extend(expanded_rows)

        except Exception as e:
            print(f"Error scraping {url}: {e}")
        time.sleep(random.uniform(2,4))

    return all_products

# --------------------------
# Run scraper
# --------------------------
if __name__ == "__main__":
    product_limit = 2  # number of products to scrape
    all_products = scrape_best_selling_products_paginated(product_limit=product_limit)
    df = pd.DataFrame(all_products)
    df.to_csv("../data/raw/02_raw_data.csv", index=False)
    print("Scraping finished! Saved to CSV.")

df.head()


Collected 2 product URLs.
Scraping 1/2: https://www.etsy.com/fr/listing/4387299123/sac-fourre-tout-en-coton-matelasse
Scraping 2/2: https://www.etsy.com/fr/listing/1852832959/sac-fourre-tout-en-toile-brodee-sac-de
Scraping finished! Saved to CSV.


Unnamed: 0,product_title,product_url,product_url_full,product_id,product_txt,product_options,var_current_price,var_old_price,var_discount_percentage,currency,product_rating,txt_reviews,nbr_reviews,listed_date,product_description,variant_id_1,variant_label_1,variant_id_2,variant_label_2,product_variant_url
0,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6028745751,Farida Prints,https://www.etsy.com/listing/4387299123/?varia...
1,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6041538846,Pink Strips Buta,https://www.etsy.com/listing/4387299123/?varia...
2,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6041538850,Pink Blue Floral,https://www.etsy.com/listing/4387299123/?varia...
3,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6028745761,Red Strawberry Strip,https://www.etsy.com/listing/4387299123/?varia...
4,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,"[{'name': 'variation_0', 'options': [{'id': '6...",17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,6041538838,"Small| Plain| No Zip (17,57 ‚Ç¨)",6041538860,Blue Strips Heart,https://www.etsy.com/listing/4387299123/?varia...


### V1 (product_variants)

In [1]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal, InvalidOperation
from datetime import datetime
import time
import random
import re
from urllib.parse import urlparse, unquote
from itertools import product

# --------------------------
# Chrome options
# --------------------------
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument(
    "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
)

driver = uc.Chrome(options=options)

# --------------------------
# Helper functions
# --------------------------
def safe_decimal(s):
    if s is None:
        return None
    try:
        s = str(s).strip().replace("\u00A0", " ")
        s_norm = re.sub(r"[^\d,.\-]", "", s)
        if "," in s_norm and "." in s_norm:
            s_norm = s_norm.replace(",", "")
        elif "," in s_norm and "." not in s_norm:
            s_norm = s_norm.replace(",", ".")
        if s_norm in ("", ".", "-", "-.", None):
            return None
        return Decimal(s_norm)
    except (InvalidOperation, ValueError):
        return None

def extract_nbr_reviews(txt):
    if not txt:
        return 0
    match = re.search(r'\(([\d\.,\sKk]+)\)', txt)
    if not match:
        return 0
    value = match.group(1).strip().replace(" ", "").replace(",", ".")
    try:
        if "K" in value.upper():
            return int(float(value.upper().replace("K", "")) * 1000)
        return int(float(value))
    except:
        digits = ''.join(ch for ch in value if ch.isdigit())
        return int(digits) if digits else 0

CURRENCY_MAP = {
    "$": "USD", "US$": "USD",
    "CA$": "CAD", "C$": "CAD",
    "AU$": "AUD", "A$": "AUD",
    "‚Ç¨": "EUR",
    "¬£": "GBP",
    "¬•": "JPY",
    "‚Çπ": "INR",
}

def detect_currency(ld_data, soup):
    try:
        if isinstance(ld_data, dict):
            offers = ld_data.get("offers", {})
            if isinstance(offers, dict):
                iso = offers.get("priceCurrency") or offers.get("currency")
                if iso and isinstance(iso, str) and len(iso) >= 2:
                    return iso.upper()
    except:
        pass

    price_text = None
    selectors = [
        "p[data-buy-box-region='price']",
        "p[data-testid='listing-page-price']",
        "span[class*='currency-value']",
        "span[class*='wt-price']",
    ]
    for sel in selectors:
        tag = soup.select_one(sel)
        if tag and tag.get_text(strip=True):
            price_text = tag.get_text(" ", strip=True)
            break
    if not price_text:
        candidate = soup.find(text=re.compile(r'[\$\‚Ç¨\¬£\¬•\‚Çπ]'))
        if candidate:
            price_text = candidate.strip()
    if price_text:
        for sym in sorted(CURRENCY_MAP.keys(), key=len, reverse=True):
            if sym in price_text:
                return CURRENCY_MAP[sym]
        m = re.search(r'([$\‚Ç¨\¬£\¬•\‚Çπ])', price_text)
        if m:
            return CURRENCY_MAP.get(m.group(1), None)
    return None

def parse_price_text(text):
    if not text:
        return None
    text = text.replace("+"," ").replace(",",".")
    matches = re.findall(r"\d+\.\d+|\d+", text)
    for match in matches:
        val = safe_decimal(match)
        if val and val > 0 and val < 100000:
            return val
    return None

def extract_product_txt_from_url(product_url):
    parsed = urlparse(product_url)
    path_parts = parsed.path.strip("/").split("/")

    if path_parts and path_parts[0] in ["fr", "de", "es", "it", "nl"]:
        path_parts = path_parts[1:]

    product_id = ""
    product_txt = ""

    if len(path_parts) >= 2 and path_parts[0] == "listing":
        product_id = path_parts[1]
        if len(path_parts) >= 3:
            product_txt = unquote(path_parts[2])

    return product_id, product_txt

# --------------------------
# Etsy product extractor
# --------------------------
def extract_etsy_product(product_url):
    product_id, product_txt = extract_product_txt_from_url(product_url)
    product_url_full = f"https://www.etsy.com/listing/{product_id}/{product_txt}" if product_txt else f"https://www.etsy.com/listing/{product_id}/"

    driver.get(product_url_full)
    WebDriverWait(driver, 20).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
    )

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # JSON-LD
    ld_data = {}
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.string.strip())
        except:
            ld_data = {}

    product_title = (ld_data.get("name") if isinstance(ld_data, dict) else "") or ""
    product_description = (ld_data.get("description") if isinstance(ld_data, dict) else "") or ""

    # Price
    current_price = None
    old_price = None
    var_discount_percentage = None

    price_container = soup.find("div", {"data-buy-box-region": "price"})
    if price_container:
        now_p = price_container.find("p")
        if now_p:
            current_price = parse_price_text(now_p.get_text(" ", strip=True))
        old_span = price_container.find("span", class_=re.compile("wt-text-strikethrough"))
        if old_span:
            old_price = parse_price_text(old_span.get_text(" ", strip=True))

    try:
        if old_price and current_price:
            var_discount_percentage = float(((old_price - current_price) / old_price) * 100)
    except:
        var_discount_percentage = None

    currency_iso = detect_currency(ld_data, soup)

    # Variants
    product_options = []
    try:
        variants_script = soup.find("script", {"id": "listing-page-data"})
        if variants_script:
            variants_json = json.loads(variants_script.string)
            variations = variants_json.get("listing", {}).get("variations", [])
            for v in variations:
                product_options.append({v.get("property_name"): v.get("options")})
    except:
        product_options = []

    # Ratings & reviews
    try:
        rating_div = soup.find("div", class_="reviews-header appears-ready")
        if rating_div:
            txt_reviews_el = rating_div.find("h2", class_="review-header-text wt-mt-xs-2 wt-mt-lg-0")
            txt_reviews = txt_reviews_el.text.strip() if txt_reviews_el else ""
            nbr_reviews = extract_nbr_reviews(txt_reviews)
            rating_value_tag = rating_div.find("span", class_="wt-text-heading-large")
            product_rating = float(rating_value_tag.text.strip()) if rating_value_tag else 0
        else:
            txt_reviews = ""
            nbr_reviews = 0
            product_rating = 0
    except:
        txt_reviews = ""
        nbr_reviews = 0
        product_rating = 0

    # Listed date
    try:
        date_meta = soup.find("meta", {"property": "og:updated_time"})
        if date_meta and date_meta.get("content"):
            listed_date = datetime.strptime(date_meta["content"], "%Y-%m-%dT%H:%M:%S%z").date()
        else:
            listed_date = None
    except:
        listed_date = None

    return {
        "product_title": product_title,
        "product_url": f"https://www.etsy.com/listing/{product_id}/",
        "product_url_full": product_url_full,
        "product_id": product_id,
        "product_txt": product_txt,
        "product_options": product_options,
        "var_current_price": float(current_price) if current_price else None,
        "var_old_price": float(old_price) if old_price else None,
        "var_discount_percentage": float(var_discount_percentage) if var_discount_percentage else None,
        "currency": currency_iso,
        "product_rating": product_rating,
        "txt_reviews": txt_reviews,
        "nbr_reviews": nbr_reviews,
        "listed_date": listed_date,
        "product_description": product_description
    }

# -----------------------------------------------------------
# >>> ADDED FOR VARIANT EXPANSION <<<
# -----------------------------------------------------------
def expand_product_variants(product_data):
    """
    Takes the product dictionary returned by extract_etsy_product()
    and expands it into one row per variant combination.
    """
    product_id = product_data["product_id"]
    product_txt = product_data["product_txt"]
    base_url = f"https://www.etsy.com/listing/{product_id}/{product_txt}"

    # Convert product_options to {variation0: [..], variation1: [..]}
    variants_raw = product_data.get("product_options", [])
    
    variant_dict = {}
    for i, v in enumerate(variants_raw):
        key = f"variation{i}"
        options = list(v.values())[0] if v else []
        variant_dict[key] = options

    # If product has no variants ‚Üí return only 1 row
    if not variant_dict:
        product_data["variant_url"] = product_data["product_url_full"]
        return [product_data]

    # Generate all combinations
    variant_keys = list(variant_dict.keys())
    variant_lists = [variant_dict[k] for k in variant_keys]

    expanded_rows = []

    for combination in product(*variant_lists):
        row = product_data.copy()

        # Add variation fields to row
        for i, value in enumerate(combination):
            row[f"variation{i}"] = value

        # Build Etsy variant URL:
        # ?variation0=value&variation1=value&...
        query = "&".join([f"variation{i}={value}" for i, value in enumerate(combination)])
        row["variant_url"] = f"{base_url}?{query}"

        expanded_rows.append(row)

    return expanded_rows

# --------------------------
# Scrape best-selling products with pagination
# --------------------------
def scrape_best_selling_products_paginated(product_limit=50):
    base_url = "https://www.etsy.com/fr/market/best_seller_tote_bag?is_star_seller=1&page={}"
    product_urls = []
    visited_urls = set()
    page = 1

    while len(product_urls) < product_limit:
        driver.get(base_url.format(page))
        time.sleep(random.uniform(3,5))

        product_elements = driver.find_elements(By.XPATH, "//a[contains(@href,'/listing/')]")
        for elem in product_elements:
            url = elem.get_attribute("href").split("?")[0]
            if url not in visited_urls:
                visited_urls.add(url)
                product_urls.append(url)
            if len(product_urls) >= product_limit:
                break

        if not product_elements or len(product_elements) == 0:
            break
        page += 1

    print(f"Collected {len(product_urls)} product URLs.")

    all_products = []
    for idx, url in enumerate(product_urls, start=1):
        try:
            print(f"Scraping {idx}/{len(product_urls)}: {url}")
            data = extract_etsy_product(url)

            # >>> EXPAND VARIANTS <<<
            expanded_rows = expand_product_variants(data)
            all_products.extend(expanded_rows)

        except Exception as e:
            print(f"Error scraping {url}: {e}")
        time.sleep(random.uniform(2,4))

    return all_products

# --------------------------
# Run scraper
# --------------------------
if __name__ == "__main__":
    product_limit = 2  # number of products to scrape
    all_products = scrape_best_selling_products_paginated(product_limit=product_limit)
    df = pd.DataFrame(all_products)
    df.to_csv("../data/raw/01_raw_data.csv", index=False)
    print("Scraping finished! Saved to CSV.")

df.head()


Collected 2 product URLs.
Scraping 1/2: https://www.etsy.com/fr/listing/4392351544/sac-cabas-en-velours-cotele-personnalise
Scraping 2/2: https://www.etsy.com/fr/listing/4387299123/sac-fourre-tout-en-coton-matelasse
Scraping finished! Saved to CSV.


Unnamed: 0,product_title,product_url,product_url_full,product_id,product_txt,product_options,var_current_price,var_old_price,var_discount_percentage,currency,product_rating,txt_reviews,nbr_reviews,listed_date,product_description,variant_url
0,"Sac cabas en velours c√¥tel√© personnalis√©, sac ...",https://www.etsy.com/listing/4392351544/,https://www.etsy.com/listing/4392351544/sac-ca...,4392351544,sac-cabas-en-velours-cotele-personnalise,[],10.18,18.51,45.002701,EUR,5.0,Avis sur cet article (1),1,,Le compagnon id√©al du quotidien : notre sac ca...,https://www.etsy.com/listing/4392351544/sac-ca...
1,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,[],17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...,https://www.etsy.com/listing/4387299123/sac-fo...


### V0 (now i can work on the product variations)

In [50]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal, InvalidOperation
from datetime import datetime
import time
import random
import re
from urllib.parse import urlparse, unquote

# --------------------------
# Chrome options
# --------------------------
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument(
    "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
)

driver = uc.Chrome(options=options)

# --------------------------
# Helper functions
# --------------------------
def safe_decimal(s):
    if s is None:
        return None
    try:
        s = str(s).strip().replace("\u00A0", " ")
        s_norm = re.sub(r"[^\d,.\-]", "", s)
        if "," in s_norm and "." in s_norm:
            s_norm = s_norm.replace(",", "")
        elif "," in s_norm and "." not in s_norm:
            s_norm = s_norm.replace(",", ".")
        if s_norm in ("", ".", "-", "-.", None):
            return None
        return Decimal(s_norm)
    except (InvalidOperation, ValueError):
        return None

def extract_nbr_reviews(txt):
    if not txt:
        return 0
    match = re.search(r'\(([\d\.,\sKk]+)\)', txt)
    if not match:
        return 0
    value = match.group(1).strip().replace(" ", "").replace(",", ".")
    try:
        if "K" in value.upper():
            return int(float(value.upper().replace("K", "")) * 1000)
        return int(float(value))
    except:
        digits = ''.join(ch for ch in value if ch.isdigit())
        return int(digits) if digits else 0

CURRENCY_MAP = {
    "$": "USD", "US$": "USD",
    "CA$": "CAD", "C$": "CAD",
    "AU$": "AUD", "A$": "AUD",
    "‚Ç¨": "EUR",
    "¬£": "GBP",
    "¬•": "JPY",
    "‚Çπ": "INR",
}

def detect_currency(ld_data, soup):
    try:
        if isinstance(ld_data, dict):
            offers = ld_data.get("offers", {})
            if isinstance(offers, dict):
                iso = offers.get("priceCurrency") or offers.get("currency")
                if iso and isinstance(iso, str) and len(iso) >= 2:
                    return iso.upper()
    except:
        pass
    price_text = None
    selectors = [
        "p[data-buy-box-region='price']",
        "p[data-testid='listing-page-price']",
        "span[class*='currency-value']",
        "span[class*='wt-price']",
    ]
    for sel in selectors:
        tag = soup.select_one(sel)
        if tag and tag.get_text(strip=True):
            price_text = tag.get_text(" ", strip=True)
            break
    if not price_text:
        candidate = soup.find(text=re.compile(r'[\$\‚Ç¨\¬£\¬•\‚Çπ]'))
        if candidate:
            price_text = candidate.strip()
    if price_text:
        for sym in sorted(CURRENCY_MAP.keys(), key=len, reverse=True):
            if sym in price_text:
                return CURRENCY_MAP[sym]
        m = re.search(r'([$\‚Ç¨\¬£\¬•\‚Çπ])', price_text)
        if m:
            return CURRENCY_MAP.get(m.group(1), None)
    return None

def parse_price_text(text):
    if not text:
        return None
    text = text.replace("+"," ").replace(",",".")
    matches = re.findall(r"\d+\.\d+|\d+", text)
    for match in matches:
        val = safe_decimal(match)
        if val and val > 0 and val < 100000:
            return val
    return None

def extract_product_txt_from_url(product_url):
    """
    Extract product_id and product_txt from a full Etsy listing URL.
    """
    parsed = urlparse(product_url)
    path_parts = parsed.path.strip("/").split("/")

    # Remove language prefix if exists
    if path_parts and path_parts[0] in ["fr", "de", "es", "it", "nl"]:
        path_parts = path_parts[1:]

    product_id = ""
    product_txt = ""

    if len(path_parts) >= 2 and path_parts[0] == "listing":
        product_id = path_parts[1]
        if len(path_parts) >= 3:
            product_txt = unquote(path_parts[2])

    return product_id, product_txt

# --------------------------
# Etsy product extractor
# --------------------------
def extract_etsy_product(product_url):
    product_id, product_txt = extract_product_txt_from_url(product_url)
    product_url_full = f"https://www.etsy.com/listing/{product_id}/{product_txt}" if product_txt else f"https://www.etsy.com/listing/{product_id}/"

    driver.get(product_url_full)
    WebDriverWait(driver, 20).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
    )

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # JSON-LD
    ld_data = {}
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.string.strip())
        except:
            ld_data = {}

    product_title = (ld_data.get("name") if isinstance(ld_data, dict) else "") or ""
    product_description = (ld_data.get("description") if isinstance(ld_data, dict) else "") or ""

    # Price
    current_price = None
    old_price = None
    var_discount_percentage = None

    price_container = soup.find("div", {"data-buy-box-region": "price"})
    if price_container:
        now_p = price_container.find("p")
        if now_p:
            current_price = parse_price_text(now_p.get_text(" ", strip=True))
        old_span = price_container.find("span", class_=re.compile("wt-text-strikethrough"))
        if old_span:
            old_price = parse_price_text(old_span.get_text(" ", strip=True))

    try:
        if old_price and current_price:
            var_discount_percentage = float(((old_price - current_price) / old_price) * 100)
    except:
        var_discount_percentage = None

    currency_iso = detect_currency(ld_data, soup)

    # Variants
    product_options = []
    try:
        variants_script = soup.find("script", {"id": "listing-page-data"})
        if variants_script:
            variants_json = json.loads(variants_script.string)
            variations = variants_json.get("listing", {}).get("variations", [])
            for v in variations:
                product_options.append({v.get("property_name"): v.get("options")})
    except:
        product_options = []

    # Ratings & reviews
    try:
        rating_div = soup.find("div", class_="reviews-header appears-ready")
        if rating_div:
            txt_reviews_el = rating_div.find("h2", class_="review-header-text wt-mt-xs-2 wt-mt-lg-0")
            txt_reviews = txt_reviews_el.text.strip() if txt_reviews_el else ""
            nbr_reviews = extract_nbr_reviews(txt_reviews)
            rating_value_tag = rating_div.find("span", class_="wt-text-heading-large")
            product_rating = float(rating_value_tag.text.strip()) if rating_value_tag else 0
        else:
            txt_reviews = ""
            nbr_reviews = 0
            product_rating = 0
    except:
        txt_reviews = ""
        nbr_reviews = 0
        product_rating = 0

    # Listed date
    try:
        date_meta = soup.find("meta", {"property": "og:updated_time"})
        if date_meta and date_meta.get("content"):
            listed_date = datetime.strptime(date_meta["content"], "%Y-%m-%dT%H:%M:%S%z").date()
        else:
            listed_date = None
    except:
        listed_date = None

    return {
        "product_title": product_title,
        "product_url": f"https://www.etsy.com/listing/{product_id}/",
        "product_url_full": product_url_full,
        "product_id": product_id,
        "product_txt": product_txt,
        "product_options": product_options,
        "var_current_price": float(current_price) if current_price else None,
        "var_old_price": float(old_price) if old_price else None,
        "var_discount_percentage": float(var_discount_percentage) if var_discount_percentage else None,
        "currency": currency_iso,
        "product_rating": product_rating,
        "txt_reviews": txt_reviews,
        "nbr_reviews": nbr_reviews,
        "listed_date": listed_date,
        "product_description": product_description
    }

# --------------------------
# Scrape best-selling products with pagination
# --------------------------
def scrape_best_selling_products_paginated(product_limit=50):
    base_url = "https://www.etsy.com/fr/market/best_seller_tote_bag?is_star_seller=1&page={}"
    product_urls = []
    visited_urls = set()
    page = 1

    while len(product_urls) < product_limit:
        driver.get(base_url.format(page))
        time.sleep(random.uniform(3,5))

        product_elements = driver.find_elements(By.XPATH, "//a[contains(@href,'/listing/')]")
        for elem in product_elements:
            url = elem.get_attribute("href").split("?")[0]
            if url not in visited_urls:
                visited_urls.add(url)
                product_urls.append(url)
            if len(product_urls) >= product_limit:
                break

        if not product_elements or len(product_elements) == 0:
            break
        page += 1

    print(f"Collected {len(product_urls)} product URLs.")

    all_products = []
    for idx, url in enumerate(product_urls, start=1):
        try:
            print(f"Scraping {idx}/{len(product_urls)}: {url}")
            data = extract_etsy_product(url)
            all_products.append(data)
        except Exception as e:
            print(f"Error scraping {url}: {e}")
        time.sleep(random.uniform(2,4))

    return all_products

# --------------------------
# Run scraper
# --------------------------
if __name__ == "__main__":
    product_limit = 2  # number of products to scrape
    all_products = scrape_best_selling_products_paginated(product_limit=product_limit)
    df = pd.DataFrame(all_products)
    df.to_csv("../data/raw/etsy_best_selling_tote_bags.csv", index=False)
    print("Scraping finished! Saved to CSV.")
df.head()


Collected 2 product URLs.
Scraping 1/2: https://www.etsy.com/fr/listing/4392351544/sac-cabas-en-velours-cotele-personnalise
Scraping 2/2: https://www.etsy.com/fr/listing/4387299123/sac-fourre-tout-en-coton-matelasse
Scraping finished! Saved to CSV.


Unnamed: 0,product_title,product_url,product_url_full,product_id,product_txt,product_options,var_current_price,var_old_price,var_discount_percentage,currency,product_rating,txt_reviews,nbr_reviews,listed_date,product_description
0,"Sac cabas en velours c√¥tel√© personnalis√©, sac ...",https://www.etsy.com/listing/4392351544/,https://www.etsy.com/listing/4392351544/sac-ca...,4392351544,sac-cabas-en-velours-cotele-personnalise,[],10.18,18.51,45.002701,EUR,5.0,Avis sur cet article (1),1,,Le compagnon id√©al du quotidien : notre sac ca...
1,Sac fourre-tout en coton matelass√© : imprim√© f...,https://www.etsy.com/listing/4387299123/,https://www.etsy.com/listing/4387299123/sac-fo...,4387299123,sac-fourre-tout-en-coton-matelasse,[],17.57,23.44,25.042662,EUR,0.0,,0,,üå∏ Sac matelass√© en coton √† imprim√© blocs brod√©...


==================================================================================================================================
# <div align="center">DATA CLEANING & ANALYSIS</div>
==================================================================================================================================

#### üóÉÔ∏è **Raw data**

- Web scraped data saved in a DataFrame then a CSV file and uploaded to google drive
- The df_url has to be a downloadable link to the csv file from google drive
- We load the csv to use for data cleaning and analysis

In [None]:
import pandas as pd

# Load RAW DATA CSV
df_url = 'link to the dataFrame collected from scraping as a downloadable link from google drive'
df_etsy = pd.read_csv(df_url)

print("STEP 1 : RAW CSV loaded successfully!")
df_etsy.head()


==================================================================================================================================
# <div align="center">PLOTS</div>
==================================================================================================================================

### üìä PLOT 01:

In [None]:
# PLOT 1

### üìä PLOT 02:

In [None]:
# PLOT 2

### üìä PLOT 03:

In [None]:
# PLOT 3

### üìä PLOT 04:

In [None]:
# PLOT 4

### üìä PLOT 05:

In [None]:
# PLOT 5

==================================================================================================================================
# <div align="center">INSIGHTS</div>
==================================================================================================================================

### üß† INSIGHT 01:
Text

----

### üß† INSIGHT 02:
Text

---

### üß† INSIGHT 03:
Text


==================================================================================================================================