==================================================================================================================================
# <div align="center">PROJECT 03: Etsy Print-On-Demand Trends</div>
==================================================================================================================================

### üìù BUSINESS IDEA

**Print-On-Demand (POD) Business** ‚Äì What the project is about

### ‚ö†Ô∏è PROBLEM

No Free API exists to access the market data needed, requiring web scraping to gather insights ‚Äì The challenge we‚Äôre addressing

### üî∞ SOLUTION FRAMEWORK

Web scrape etsy for a specific POD product

Collect the data necessary to clean & analyze


| **Development**                                                                                                                                             | **Presentation**                 |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------- |
| **Business Idea** ‚Üí **Problem Definition** ‚Üí **Data Research & Visualization** ‚Üí **Insights** ‚Üí **Interpretation** ‚Üí **Implications** ‚Üí **Business Impact** | **Limitations & Considerations** |


---

### üìì SECTION OVERVIEW

- **Project / Business Idea:** What the project is about

- **Problem:** The challenge we‚Äôre addressing

- **Solution / Approach:** How we solve it

- **Research & Plots:** How we analyzed data visually

- **Insights:** What we discovered

- **Interpretation:** Why it matters

- **Implications:** What actions the business can take

- **Business Impact:** Expected results for the business

- **Limitations:** What constraints or gaps exist

==================================================================================================================================
# <div align="center">RESEARCH</div>
==================================================================================================================================

### üåê **Which Are the Best-Selling POD Products on Etsy?**

I‚Äôm researching print-on-demand products to sell on Etsy that only require **digital artwork and marketing**, while the POD provider handles **printing, packaging, and shipping**.


### ‚≠ê **Using Google Trends for POD Product Research**
üí° **Goal:** Identify which POD product category has been searched the most on Google over the past 5 years (2020‚Äì2025).

Below is the list of product categories I‚Äôm comparing:

### üéØ **Chosen POD product to research is :** `tote bags`

| Category              | Subcategories / Examples                                      |
|-----------------------|---------------------------------------------------------------|
| **Custom Apparel**        | T-shirts, Hoodies, Sweatshirts, Tank tops                     |
| **Mug**                   | Ceramic mugs, Color-changing mugs, Espresso mugs, Travel mugs |
| **Tote Bag**              | Cotton totes, All-over print totes                            |
| **Phone Case**            | iPhone / Samsung cases, Tough / Slim cases                    |
| **Stickers**              | Die-cut stickers, Kiss-cut stickers, Sticker sheets           |
| **Hats**                  | Baseball caps, Trucker hats, Beanies                          |
| **Pillows / Cushions**    | Pillow covers, Stuffed pillows, All-over print pillow designs|
| **Blanket**               | Fleece blankets, Sherpa blankets, Woven blankets             |
| **Wall Art**              | Posters, Canvas prints, Framed posters, Metal prints         |
| **Doormat**               | Printed coir doormats, Rubber-backed doormats                |
| **Drinkware**             | Stainless steel tumblers, Water bottles, Wine tumblers       |
| **Calendar**              | Custom printed wall calendars                                 |
| **Yoga Mat**              | Printed yoga mats                                             |
| **Bedding**               | Duvet covers, Pillowcases, All-over print bed sets           |
| **Pet Accessories**       | Pet bandanas, Pet beds, Pet bowls, Pet blankets              |
| **Ornaments**             | Ceramic ornaments, Wood ornaments, Metal ornaments           |


### **BEFORE GETTING STARTED :**

```Etsy``` is a dynamic website, so scraping it requires careful handling.

Since ```Etsy``` uses ```JavaScript``` to load some content,

```requests``` +  ``BeautifulSoup`` might work for static parts (like search results), 

but for dynamic content, ``Selenium`` is more reliable. 

I will be using ``requests`` + ``BeautifulSoup`` for ```product listings``` **(title, price, link)**

Important Note: Etsy uses dynamic loading + anti-bot protections.

Using code with standard HTML scraping can work as long as Etsy doesn‚Äôt block the request.

If blocked, using headers, rotating proxies, or the Etsy API will be required.

==================================================================================================================================
# <div align="center">WEB SCRAPING</div>
==================================================================================================================================

### üßê QUESTIONS

- Which keywords in product titles and descriptions drive the most sales?

- Which product niches have the highest demand?

- What keywords improve search visibility on Etsy?

- When is the best period to sell based on review trends?

- Which price ranges generate the most sales?

- Which country's customers are buying the most of this product?

----

### üß∞ **Install for web scraping**

In [None]:
# install requests & beautifulsoup
!pip install requests beautifulsoup4 fake-useragent pandas

# install selenium
!pip install selenium pandas

---

### üìå **Avoid web BLOCKED**
| Version                                   | Best For          | Pros                                           | Cons                          |
| ----------------------------------------- | ----------------- | ---------------------------------------------- | ----------------------------- |
| **Requests + BeautifulSoup + Pagination** | Simple scraping   | Fast, clean                                    | Etsy may block request        |
| **Selenium + BeautifulSoup + Pagination** | Reliable scraping | Bypasses bot protection, loads dynamic content | Slower, requires ChromeDriver |


---

## üìå **Product PAGE**

### ‚≠ê **Etsy Product Info**
The main data fields to extract from Etsy's product page :

| Field Name             | Python Data Type       | Concise Definition                      | Long Definition                                                                                       |
|------------------------|-----------------------|----------------------------------------|-------------------------------------------------------------------------------------------------------|
| **product_title**          | `str`                 | Product title                           | The full name of the product, same across all variants.                                              |
| **product_url**           | `str`                 | Short URL to product listing        | Etsy listing URL in the format `https://www.etsy.com/listing/product_id/` (e.g., "https://www.etsy.com/listing/1289965137/"). |
| **product_id**             | `str`                 | Unique product ID                       | Extracted from `product_url`; a unique identifier assigned by Etsy for each listing.               |
| **product_txt**             | `str`                 | Unique product TEXT                    | Extracted after loading the page using the `product_url`; a unique text identifier assigned by Etsy for each listing that only shows once product listing is selected.               |
| **var_extension**          | `str`                 | Variations(s) URL extension             | The variation(s) extension added to `product_url` to generate `var_url`.                        |
| **var_url**               | `str`                 | Full URL to a variant product                      | Complete link to a specific variant, formed by appending `var_extension` to `product_url`.          |
| **product_options**        | `List[dict]`          | Available product options / eg variation0               | List of all product options (size, color, material, etc.), each stored as a dictionary.             |
| **product_var**            | `dict`                | Variant‚Äôs selected options              | Dictionary representing the specific option(s) chosen for this variant.                              |
| **var_current_price**      | `float` or `Decimal`  | Current price for variant               | Price of this variant after applying any discounts.                                                  |
| **var_old_price**          | `float` or `Decimal`  | Original price for variant              | Price of this variant before any discounts were applied (if available).                              |
| **var_discount_percentage**| `float`               | Variant discount percentage             | Discount applied to this variant, calculated if both current and old prices are available.           |
| **product_rating**         | `float`               | Average product rating                  | Average rating of the product out of 5 (e.g., 4.5).                                                 |
| **txt_reviews**            | `str`                 | Concatenated review text                | All review texts or summary text for the product; may include the number of reviews in parentheses. |
| **nbr_reviews**            | `int`                 | Total number of reviews                 | Total count of reviews received by the product.                                                     |
| **listed_date**            | `date`                | Date product was first listed           | The date the product was originally published on Etsy.                                               |
| **product_description**            | `str`                | Product's description           | The text content of the description of the product.                                               |



variation0=5886526755

variation0=5886526755 & variation1=5886526755


https://www.etsy.com/listing/1716154949/boho-embroidered-floral-tote-bag-in-sage

https://www.etsy.com/listing/product_id/product_txt/product_var


product_var  ?variation0=&variation1=variation2=

---

### ‚≠ê **Insighted Data `product_niche` from `product_title` and `product_description`**

| Field Name                 | Python Data Type       | Concise Definition                               |
|---------------------------|-------------------------|---------------------------------------------------|
| **product_niche**             | `str`                     | Product theme or genre (comedy, anime‚Ä¶) based on `product_title` & `product_description`.         |

---

### ‚≠ê **Etsy Product Reviews (Extra dataset)**

All of the product Reviews `Comment`, `Rating`, and `Date` when each review was posted

| Field Name               | Python Data Type | Concise Definition                                      |
|--------------------------|------------------|----------------------------------------------------------|
| **review_product_var**   | `str`            | The specific product variant purchased by the reviewer   |
| **review_rating**        | `float`          | The rating the customer gave the product                 |
| **review_comment**       | `str`            | The text comment the customer wrote                      |
| **review_date**          | `date`           | The date when the customer posted the review             |
| **review_profile_url**   | `str`            | URL to the reviewer's profile page                       |
| **review_username**      | `str`            | The username of the reviewer extracted from reviewer_profile_url |
| **review_country**       | `str`            | The reviewer's country/location                          |



---

# üìå **REPLACEABLE CODE**

### REPLACED 11

In [21]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal, InvalidOperation
from datetime import datetime
import time
import random
import re
from urllib.parse import urlparse, unquote

# --------------------------
# Chrome / undetected_chromedriver options
# --------------------------
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
# Spoof UA
options.add_argument(
    "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
)

driver = uc.Chrome(options=options)

# --------------------------
# Helpers
# --------------------------
def extract_nbr_reviews(txt: str) -> int:
    """
    Extract integer number of reviews from strings like:
    - "Reviews for this item (39)"
    - "Avis sur cet article (1,4 K)"
    Handles spaces, commas, and 'K' as thousands.
    """
    if not txt:
        return 0
    match = re.search(r'\(([\d\.,\sKk]+)\)', txt)
    if not match:
        return 0
    value = match.group(1).strip()
    # normalize: remove spaces, convert comma decimal to dot
    value = value.replace(" ", "").replace(",", ".")
    try:
        if "K" in value.upper():
            return int(float(value.upper().replace("K", "")) * 1000)
        return int(float(value))
    except Exception:
        # fallback: extract digits only
        digits = ''.join(ch for ch in value if ch.isdigit())
        return int(digits) if digits else 0

def safe_decimal(s):
    """Convert string to Decimal safely, return None on failure."""
    if s is None:
        return None
    try:
        s = str(s).strip()
        # normalize non-breaking spaces
        s = s.replace("\u00A0", " ")
        # remove everything except digits, comma, dot, minus
        s_norm = re.sub(r"[^\d,.\-]", "", s)
        # if both comma and dot present, assume dot is decimal and remove commas
        if "," in s_norm and "." in s_norm:
            s_norm = s_norm.replace(",", "")
        # if only comma present, convert to dot
        elif "," in s_norm and "." not in s_norm:
            s_norm = s_norm.replace(",", ".")
        # edge case: empty
        if s_norm in ("", ".", "-", "-.", None):
            return None
        return Decimal(s_norm)
    except (InvalidOperation, ValueError):
        return None

# mapping symbol -> ISO (fallback)
CURRENCY_MAP = {
    "$": "USD", "US$": "USD",
    "CA$": "CAD", "C$": "CAD",
    "AU$": "AUD", "A$": "AUD",
    "‚Ç¨": "EUR",
    "¬£": "GBP",
    "¬•": "JPY",
    "‚Çπ": "INR",
}

def detect_currency(ld_data: dict, soup: BeautifulSoup) -> str:
    """
    Return currency ISO code (e.g., "USD", "EUR").
    1) Try JSON-LD offers.priceCurrency
    2) Fallback: inspect visible price text for symbol and map it
    """
    # 1) JSON-LD
    try:
        if isinstance(ld_data, dict):
            offers = ld_data.get("offers", {})
            if isinstance(offers, dict):
                iso = offers.get("priceCurrency") or offers.get("currency")
                if iso and isinstance(iso, str) and len(iso) >= 2:
                    return iso.upper()
    except Exception:
        pass

    # 2) visible price search (attempt)
    price_text = None
    selectors = [
        "p[data-buy-box-region='price']",
        "p[data-testid='listing-page-price']",
        "p[class*='wt-text-title']",
        "span[class*='currency-value']",
        "span[class*='wt-price']",
    ]
    for sel in selectors:
        tag = soup.select_one(sel)
        if tag and tag.get_text(strip=True):
            price_text = tag.get_text(" ", strip=True)
            break

    if not price_text:
        # look for any text with currency-like chars
        candidate = soup.find(text=re.compile(r'[\$\‚Ç¨\¬£\¬•\‚Çπ]'))
        if candidate:
            price_text = candidate.strip()

    if price_text:
        # try multi-char symbols first
        for sym in sorted(CURRENCY_MAP.keys(), key=len, reverse=True):
            if sym in price_text:
                return CURRENCY_MAP[sym]
        # single char
        m = re.search(r'([$\‚Ç¨\¬£\¬•\‚Çπ])', price_text)
        if m:
            return CURRENCY_MAP.get(m.group(1), None)

    return None

# --------------------------
# Main extractor
# --------------------------
def extract_etsy_product(product_url: str) -> dict:
    """
    Extract product info from an Etsy listing URL.
    product_id and product_txt are parsed strictly from the input product_url.
    """
    # parse product_id/product_txt from the input URL
    parsed = urlparse(product_url)
    path_parts = parsed.path.strip("/").split("/")

    if len(path_parts) >= 3 and path_parts[0] == "listing":
        product_id = path_parts[1]
        product_txt = unquote(path_parts[2])
    else:
        product_id = ""
        product_txt = ""

    var_extension = parsed.query if parsed.query else None

    # load page
    driver.get(product_url)

    # wait for the page to load the main title (or fallback)
    try:
        WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
        )
    except Exception:
        # fallback short wait
        time.sleep(2)

    # human-like delay
    time.sleep(random.uniform(1.5, 3.0))

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # JSON-LD (structured data)
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.text.strip())
        except Exception:
            ld_data = {}
    else:
        ld_data = {}

    # Basic fields (from LD where possible)
    product_title = (ld_data.get("name") if isinstance(ld_data, dict) else "") or ""
    product_description = (ld_data.get("description") if isinstance(ld_data, dict) else "") or ""

    # PRICING: attempt JSON-LD first
    current_price = None
    old_price = None
    var_discount_percentage = None

    try:
        offers = ld_data.get("offers", {}) if isinstance(ld_data, dict) else {}
        if isinstance(offers, dict):
            price_val = offers.get("price")
            if price_val is not None:
                current_price = safe_decimal(price_val)
    except Exception:
        current_price = None

    # visible price fallback
    if current_price is None:
        price_text = None
        price_sel_candidates = [
            "p[data-buy-box-region='price']",
            "p[data-testid='listing-page-price']",
            "p[class*='wt-text-title']",
            "span[class*='currency-value']",
            "div[data-region='price']",
            "span[class*='wt-price']",
        ]
        for sel in price_sel_candidates:
            el = soup.select_one(sel)
            if el and el.get_text(strip=True):
                price_text = el.get_text(" ", strip=True)
                break
        if not price_text:
            # brute force: find text with currency+digits
            el = soup.find(text=re.compile(r'[\$\‚Ç¨\¬£\¬•\‚Çπ]\s*\d')) or soup.find(text=re.compile(r'\d[\d\.,\s]*[\$\‚Ç¨\¬£\¬•\‚Çπ]'))
            if el:
                price_text = el.strip()
        if price_text:
            m = re.search(r'([‚Ç¨$¬£¬•‚Çπ]?\s*[\d\.,\s]+(?:[‚Ç¨$¬£¬•‚Çπ])?)', price_text)
            if m:
                current_price = safe_decimal(m.group(0))

    # OLD price detection (strikethrough or <del>)
    try:
        strike = soup.select_one(".wt-text-strikethrough, .wt-text-strike, del")
        if strike and strike.get_text(strip=True):
            old_price = safe_decimal(strike.get_text(" ", strip=True))
    except Exception:
        old_price = None

    # if old price found but no current price, attempt alternative extraction
    if old_price and current_price is None:
        price_nodes = soup.find_all(text=re.compile(r'[\d\.,\s]'))
        for t in price_nodes:
            txt = t.strip()
            if not txt:
                continue
            if strike and txt in strike.get_text(" ", strip=True):
                continue
            val = safe_decimal(txt)
            if val:
                current_price = val
                break

    # discount %
    try:
        if old_price and current_price:
            var_discount_percentage = float(((old_price - current_price) / old_price) * 100)
        else:
            var_discount_percentage = None
    except Exception:
        var_discount_percentage = None

    # currency ISO
    currency_iso = detect_currency(ld_data, soup)

    # product options (variations) - keep as list of dicts
    product_options = []
    try:
        variants_script = soup.find("script", {"id": "listing-page-data"})
        if variants_script:
            variants_json = json.loads(variants_script.string)
            variations = variants_json.get("listing", {}).get("variations", [])
            for v in variations:
                product_options.append({v.get("property_name"): v.get("options")})
    except Exception:
        product_options = []

    # ratings & reviews
    try:
        rating_div = soup.find("div", class_="reviews-header appears-ready")
        if rating_div:
            txt_reviews_el = rating_div.find("h2", class_="review-header-text wt-mt-xs-2 wt-mt-lg-0")
            txt_reviews = txt_reviews_el.text.strip() if txt_reviews_el else ""
            nbr_reviews = extract_nbr_reviews(txt_reviews)
            rating_value_tag = rating_div.find("span", class_="wt-text-heading-large")
            product_rating = float(rating_value_tag.text.strip()) if rating_value_tag else 0
        else:
            txt_reviews = ""
            nbr_reviews = 0
            product_rating = 0
    except Exception:
        txt_reviews = ""
        nbr_reviews = 0
        product_rating = 0

    # listed date
    try:
        date_meta = soup.find("meta", {"property": "og:updated_time"})
        if date_meta and date_meta.get("content"):
            listed_date = datetime.strptime(date_meta["content"], "%Y-%m-%dT%H:%M:%S%z").date()
        else:
            listed_date = None
    except Exception:
        listed_date = None

    # return structured dict (currency field named "currency")
    return {
        "product_title": product_title,
        "product_url": product_url,
        "product_id": product_id,
        "product_txt": product_txt,
        "var_extension": var_extension,
        "var_url": None,
        "product_options": product_options,
        "product_var": None,
        "var_current_price": float(current_price) if current_price is not None else None,
        "var_old_price": float(old_price) if old_price is not None else None,
        "var_discount_percentage": float(var_discount_percentage) if var_discount_percentage is not None else None,
        "currency": currency_iso,
        "product_rating": product_rating,
        "txt_reviews": txt_reviews,
        "nbr_reviews": nbr_reviews,
        "listed_date": listed_date,
        "product_description": product_description
    }

# --------------------------
# Run scraper for a list of product URLs and save CSV
# --------------------------
if __name__ == "__main__":
    product_urls = [
        # add your product URLs here
        "https://www.etsy.com/listing/1716154949/boho-embroidered-floral-tote-bag-in-sage",
        "https://www.etsy.com/fr/listing/4301871513/sac-fourre-tout-en-toile-personnalise",
        "https://www.etsy.com/fr/listing/1574067774/sac-fourre-tout-magique-livre-lapins"
    ]

    all_products = []
    for url in product_urls:
        print("Scraping:", url)
        data = extract_etsy_product(url)
        if data:
            all_products.append(data)
        # polite delay
        time.sleep(random.uniform(2, 5))

    df = pd.DataFrame(all_products)
    df.to_csv("../data/raw/12_extracted_data.csv", index=False)

# return dataframe head (no print)
df.head()


Scraping: https://www.etsy.com/listing/1716154949/boho-embroidered-floral-tote-bag-in-sage
Scraping: https://www.etsy.com/fr/listing/4301871513/sac-fourre-tout-en-toile-personnalise


  price_nodes = soup.find_all(text=re.compile(r'[\d\.,\s]'))


Scraping: https://www.etsy.com/fr/listing/1574067774/sac-fourre-tout-magique-livre-lapins


Unnamed: 0,product_title,product_url,product_id,product_txt,var_extension,var_url,product_options,product_var,var_current_price,var_old_price,var_discount_percentage,currency,product_rating,txt_reviews,nbr_reviews,listed_date,product_description
0,Sac fourre-tout boh√®me √† fleurs brod√©es en mar...,https://www.etsy.com/listing/1716154949/boho-e...,1716154949.0,boho-embroidered-floral-tote-bag-in-sage,,,[],,26.0,37.14,29.994615,USD,5.0,Avis sur cet article (218),218,,Nous sommes ravis de vous pr√©senter une nouvel...
1,"Sac fourre-tout en toile personnalis√©, sac fou...",https://www.etsy.com/fr/listing/4301871513/sac...,,,,,[],,0.09,0.26,65.384615,USD,4.8,"Avis sur cet article (1,3 K)",1300,,Sacs fourre-tout personnalis√©s | Votre parfait...
2,Sac fourre-tout magique livre lapins - Cottage...,https://www.etsy.com/fr/listing/1574067774/sac...,,,,,[],,17.49,24.99,30.012005,USD,4.9,Avis sur cet article (99),99,,Sac fourre-tout Magical Book Bunnies : enchant...


In [16]:
df.head()

Unnamed: 0,product_title,product_url,product_id,product_txt,var_extension,var_url,product_options,product_var,var_current_price,var_old_price,var_discount_percentage,currency,product_rating,txt_reviews,nbr_reviews,listed_date,product_description
0,Sac fourre-tout boh√®me √† fleurs brod√©es en mar...,https://www.etsy.com/listing/1716154949/boho-e...,1716154949,boho-embroidered-floral-tote-bag-in-sage,,,[],,28.38,40.54,29.995067,EUR,5.0,Avis sur cet article (218),218,,Nous sommes ravis de vous pr√©senter une nouvel...


### REPLACED 10

In [14]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal, InvalidOperation
from datetime import datetime
import time
import random
import re
from urllib.parse import urlparse, unquote

# ==========================
# CHROME OPTIONS
# ==========================
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
# Spoof UA
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")

driver = uc.Chrome(options=options)

# ==========================
# HELPERS
# ==========================
def extract_nbr_reviews(txt):
    """
    Extract integer number of reviews from strings like:
    - "Reviews for this item (39)"
    - "Avis sur cet article (1,4 K)"
    Handles spaces, commas, and 'K' as thousands.
    """
    if not txt:
        return 0
    match = re.search(r'\(([\d\.,\sKk]+)\)', txt)
    if match:
        value = match.group(1).strip()
        value = value.replace(" ", "").replace(",", ".")
        try:
            if "K" in value.upper():
                return int(float(value.upper().replace("K", "")) * 1000)
            return int(float(value))
        except:
            # fallback: extract digits only
            digits = ''.join(ch for ch in value if ch.isdigit())
            return int(digits) if digits else 0
    return 0

def safe_decimal(s):
    """ Convert string to Decimal safely, return None on failure """
    if s is None:
        return None
    try:
        # remove currency symbols and non numeric characters except . and ,
        s = s.strip()
        # replace non-breaking spaces
        s = s.replace("\u00A0", " ")
        # remove grouping commas/spaces, convert comma decimal to dot if needed
        s_norm = re.sub(r'[^\d,.\-]', '', s)
        # if there are both comma and dot, assume dot is decimal, remove commas
        if ',' in s_norm and '.' in s_norm:
            s_norm = s_norm.replace(',', '')
        # if only comma exists, replace with dot
        elif ',' in s_norm and '.' not in s_norm:
            s_norm = s_norm.replace(',', '.')
        return Decimal(s_norm)
    except (InvalidOperation, ValueError):
        return None

# Map visible currency symbol to ISO
CURRENCY_MAP = {
    "$": "USD", "US$": "USD",
    "CA$": "CAD", "C$": "CAD",
    "AU$": "AUD", "A$": "AUD",
    "‚Ç¨": "EUR",
    "¬£": "GBP",
    "¬•": "JPY",
    "‚Çπ": "INR",
}

def detect_currency_symbol_and_iso(ld_data, soup):
    """
    Try JSON-LD first for ISO, then visible price for symbol.
    Returns (currency_symbol, currency_iso)
    """
    currency_iso = None
    currency_symbol = None

    # JSON-LD priceCurrency (ISO code)
    try:
        offers = ld_data.get("offers", {}) if isinstance(ld_data, dict) else {}
        if isinstance(offers, dict):
            currency_iso = offers.get("priceCurrency") or offers.get("currency") or None
    except:
        currency_iso = None

    # Visible price text (try a few selectors)
    price_text = None
    selectors = [
        "p[data-buy-box-region='price']",
        "p[data-testid='listing-page-price']",
        "p[class*='wt-text-title']",
        "span[class*='currency-value']",
        "span[class*='wt-price']",
    ]
    for sel in selectors:
        tag = soup.select_one(sel)
        if tag and tag.get_text(strip=True):
            price_text = tag.get_text(" ", strip=True)
            break

    # fallback: first occurrence of currency-looking text
    if not price_text:
        # find any element containing currency symbols
        candidate = soup.find(text=re.compile(r'[\$\‚Ç¨\¬£\¬•\‚Çπ]'))
        if candidate:
            price_text = candidate.strip()

    # extract symbol from price_text
    if price_text:
        # look for multi-char symbols first (CA$, AU$)
        for sym in sorted(CURRENCY_MAP.keys(), key=len, reverse=True):
            if sym in price_text:
                currency_symbol = sym
                break
        # if not found, try single-char
        if not currency_symbol:
            m = re.search(r'([$\‚Ç¨\¬£\¬•\‚Çπ])', price_text)
            if m:
                currency_symbol = m.group(1)

    # derive ISO if missing
    if not currency_iso and currency_symbol:
        currency_iso = CURRENCY_MAP.get(currency_symbol)

    return currency_symbol, currency_iso

def extract_image_urls(soup):
    """
    Return list of image URLs and main_image.
    Tries og:image, and gallery images.
    """
    images = []

    # og:image
    og = soup.find("meta", {"property": "og:image"})
    if og and og.get("content"):
        images.append(og["content"])

    # Look for image gallery thumbnails
    # Etsy often uses data-src or src on image tags within .carousel or .wt-list-unstyled
    # gather unique urls
    for img in soup.find_all("img"):
        # prefer data-src or srcset fallback to src
        url = img.get("data-src") or img.get("data-original") or img.get("src") or img.get("data-lazy-src")
        if url and url.startswith("http"):
            images.append(url)

    # dedupe preserving order
    seen = set()
    out = []
    for u in images:
        if u not in seen:
            seen.add(u)
            out.append(u)

    main_image = out[0] if out else None
    return out, main_image

# ==========================
# SCRAPER FUNCTION
# ==========================
def extract_etsy_product(product_url):
    # Extract product_id and product_txt from the input URL (strictly from the URL)
    parsed = urlparse(product_url)
    path_parts = parsed.path.strip("/").split("/")

    if len(path_parts) >= 3 and path_parts[0] == "listing":
        product_id = path_parts[1]
        product_txt = unquote(path_parts[2])
    else:
        product_id = ""
        product_txt = ""

    # var_extension is the query string
    var_extension = parsed.query if parsed.query else None

    # Load page
    driver.get(product_url)

    # Wait for product title (or a fallback element)
    try:
        WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
        )
    except:
        # fallback short wait; still proceed
        time.sleep(2)

    # human-like delay
    time.sleep(random.uniform(1.5, 3.5))

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # JSON-LD
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.text.strip())
        except:
            ld_data = {}
    else:
        ld_data = {}

    # Basic fields
    product_title = ld_data.get("name", "") if isinstance(ld_data, dict) else ""
    product_description = ld_data.get("description", "") if isinstance(ld_data, dict) else ""

    # PRICING: current price from JSON-LD if available, otherwise parse visible price
    current_price = None
    old_price = None
    var_discount_percentage = None

    # try JSON-LD first
    try:
        offers = ld_data.get("offers", {}) if isinstance(ld_data, dict) else {}
        if isinstance(offers, dict):
            price_val = offers.get("price")
            if price_val is not None:
                current_price = safe_decimal(str(price_val))
            # sometimes offers has priceValidUntil or priceSpecification with highPrice/lowPrice
    except:
        current_price = None

    # visible current price fallback
    if current_price is None:
        # look for price elements
        price_text = None
        price_sel_candidates = [
            "p[data-buy-box-region='price']",
            "p[data-testid='listing-page-price']",
            "p[class*='wt-text-title']",
            "span[class*='currency-value']",
            "div[data-region='price']",
            "span[class*='wt-price']",
        ]
        for sel in price_sel_candidates:
            el = soup.select_one(sel)
            if el and el.get_text(strip=True):
                price_text = el.get_text(" ", strip=True)
                break
        if not price_text:
            # brute force: find element that contains currency sign + digits
            el = soup.find(text=re.compile(r'[\d][\d\.,\s]*[‚Ç¨$¬£¬•‚Çπ]') ) or soup.find(text=re.compile(r'[\$\‚Ç¨\¬£\¬•\‚Çπ]\s*\d'))
            if el:
                price_text = el.strip()
        if price_text:
            # extract first money-looking substring
            m = re.search(r'([‚Ç¨$¬£¬•‚Çπ]?\s*[\d\.,\s]+(?:[‚Ç¨$¬£¬•‚Çπ])?)', price_text)
            if m:
                current_price = safe_decimal(m.group(0))

    # OLD PRICE detection (strikethrough prices)
    try:
        # Etsy strikethrough price often uses class 'wt-text-strikethrough' or 'wt-display-inline'
        strike = soup.select_one(".wt-text-strikethrough, .wt-text-strike, .wt-text-strikethrough span, .wt-text-body-01 .wt-text-strikethrough")
        if strike and strike.get_text(strip=True):
            old_price = safe_decimal(strike.get_text(" ", strip=True))
        else:
            # alternative: look for <del> tags
            del_tag = soup.find("del")
            if del_tag and del_tag.get_text(strip=True):
                old_price = safe_decimal(del_tag.get_text(" ", strip=True))
    except:
        old_price = None

    # If old_price found but current_price missing, attempt to find another visible price as current
    if old_price and current_price is None:
        # try find price that is NOT strike
        price_nodes = soup.find_all(text=re.compile(r'[\d\.,\s]'))
        for t in price_nodes:
            txt = t.strip()
            if not txt:
                continue
            if strike and txt in strike.get_text(" ", strip=True):
                continue
            # attempt conversion
            val = safe_decimal(txt)
            if val:
                current_price = val
                break

    # compute discount %
    try:
        if old_price and current_price:
            var_discount_percentage = float(((old_price - current_price) / old_price) * 100)
        else:
            var_discount_percentage = None
    except:
        var_discount_percentage = None

    # CURRENCY
    currency_symbol, currency_txt = detect_currency_symbol_and_iso(ld_data, soup)

    # IMAGES
    image_urls, main_image = extract_image_urls(soup)

    # VARIANTS: keep product_options (names + options) but not expanding to rows
    product_options = []
    try:
        variants_script = soup.find("script", {"id": "listing-page-data"})
        if variants_script:
            variants_json = json.loads(variants_script.string)
            variations = variants_json.get("listing", {}).get("variations", [])
            for v in variations:
                product_options.append({v.get("property_name"): v.get("options")})
    except:
        product_options = []

    # RATINGS & REVIEWS
    try:
        rating_div = soup.find("div", class_="reviews-header appears-ready")
        if rating_div:
            txt_reviews = rating_div.find("h2", class_="review-header-text wt-mt-xs-2 wt-mt-lg-0")
            txt_reviews = txt_reviews.text.strip() if txt_reviews else ""
            nbr_reviews = extract_nbr_reviews(txt_reviews)
            rating_value_tag = rating_div.find("span", class_="wt-text-heading-large")
            product_rating = float(rating_value_tag.text.strip()) if rating_value_tag else 0
        else:
            txt_reviews = ""
            nbr_reviews = 0
            product_rating = 0
    except:
        txt_reviews = ""
        nbr_reviews = 0
        product_rating = 0

    # LISTED DATE
    try:
        date_meta = soup.find("meta", {"property": "og:updated_time"})
        if date_meta and date_meta.get("content"):
            listed_date = datetime.strptime(date_meta["content"], "%Y-%m-%dT%H:%M:%S%z").date()
        else:
            listed_date = None
    except:
        listed_date = None

    # RETURN DICT
    return {
        "product_title": product_title,
        "product_url": product_url,
        "product_id": product_id,
        "product_txt": product_txt,
        "var_extension": var_extension,
        "var_url": None,
        "product_options": product_options,
        "product_var": None,
        "var_current_price": float(current_price) if current_price is not None else None,
        "var_old_price": float(old_price) if old_price is not None else None,
        "var_discount_percentage": float(var_discount_percentage) if var_discount_percentage is not None else None,
        "currency_symbol": currency_symbol,
        "currency_txt": currency_txt,
        "product_rating": product_rating,
        "txt_reviews": txt_reviews,
        "nbr_reviews": nbr_reviews,
        "image_urls": image_urls,
        "main_image": main_image,
        "listed_date": listed_date,
        "product_description": product_description
    }

# ==========================
# SCRAPE ALL PRODUCTS
# ==========================
product_urls = [
    # put the product URLs you want to scrape here
    "https://www.etsy.com/listing/1716154949/boho-embroidered-floral-tote-bag-in-sage"
]

all_products = []

for url in product_urls:
    print("Scraping:", url)
    data = extract_etsy_product(url)
    if data:
        all_products.append(data)
    time.sleep(random.uniform(2, 5))

# Save to CSV
df = pd.DataFrame(all_products)
df.to_csv("../data/raw/10_extracted_data.csv", index=False)

df.head()


Scraping: https://www.etsy.com/listing/1716154949/boho-embroidered-floral-tote-bag-in-sage


Unnamed: 0,product_title,product_url,product_id,product_txt,var_extension,var_url,product_options,product_var,var_current_price,var_old_price,var_discount_percentage,currency_symbol,currency_txt,product_rating,txt_reviews,nbr_reviews,image_urls,main_image,listed_date,product_description
0,Sac fourre-tout boh√®me √† fleurs brod√©es en mar...,https://www.etsy.com/listing/1716154949/boho-e...,1716154949,boho-embroidered-floral-tote-bag-in-sage,,,[],,28.38,40.54,29.995067,,EUR,5.0,Avis sur cet article (218),218,[https://i.etsystatic.com/31695446/r/il/776cdb...,https://i.etsystatic.com/31695446/r/il/776cdb/...,,Nous sommes ravis de vous pr√©senter une nouvel...


---

### **REPLACED 09 (CURRENT PRICE + PRODUCT URL + PRODUCT ID + PRODUCT TXT + PRODUCT RATING + TXT REVIEWS + NBR REVIEWS + PRODUCT DESCRIPTION)**

In [12]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal
from datetime import datetime
import time
import random
import re
from urllib.parse import urlparse, unquote

# ==========================
# CHROME OPTIONS
# ==========================
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")

driver = uc.Chrome(options=options)

# ==========================
# HELPER FUNCTION
# ==========================
def extract_nbr_reviews(txt):
    """
    Extracts the number of reviews from strings like:
    - "Reviews for this item (39)"
    - "Avis sur cet article (1,4 K)"
    Handles spaces, commas, and 'K' as thousands.
    Returns an int.
    """
    match = re.search(r'\(([\d\.,\sKk]+)\)', txt)
    if match:
        value = match.group(1).strip()
        value = value.replace(" ", "").replace(",", ".")
        if "K" in value.upper():
            value = float(value.upper().replace("K", "")) * 1000
        else:
            value = float(value)
        return int(value)
    return 0

# ==========================
# SCRAPER FUNCTION
# ==========================
def extract_etsy_product(product_url):
    # ==========================
    # EXTRACT product_id AND product_txt FROM THE URL
    # ==========================
    parsed = urlparse(product_url)
    path_parts = parsed.path.strip("/").split("/")

    if len(path_parts) >= 3 and path_parts[0] == "listing":
        product_id = path_parts[1]
        product_txt = unquote(path_parts[2])
    else:
        product_id = ""
        product_txt = ""

    # var_extension is query string
    var_extension = parsed.query if parsed.query else None

    driver.get(product_url)

    # Wait for the product title to appear
    try:
        WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
        )
    except:
        print("‚ö† Page did not load properly:", product_url)
        return None

    # Random human-like delay
    time.sleep(random.uniform(2, 4))

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # -----------------------------
    # JSON-LD DATA
    # -----------------------------
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.text.strip())
        except:
            ld_data = {}
    else:
        ld_data = {}

    # Basic fields
    product_title = ld_data.get("name", "")
    product_description = ld_data.get("description", "")

    # -----------------------------
    # VARIANTS
    # -----------------------------
    product_options = []
    try:
        variants_script = soup.find("script", {"id": "listing-page-data"})
        variants_json = json.loads(variants_script.string)

        variations = variants_json.get("listing", {}).get("variations", [])
        for v in variations:
            product_options.append({v.get("property_name"): v.get("options")})
    except:
        product_options = []

    # -----------------------------
    # PRICING
    # -----------------------------
    try:
        current_price = Decimal(ld_data["offers"]["price"])
    except:
        current_price = None

    old_price = None
    var_discount_percentage = None

    # -----------------------------
    # PRODUCT-SPECIFIC RATING & REVIEWS
    # -----------------------------
    try:
        rating_div = soup.find("div", class_="reviews-header appears-ready")
        # txt_reviews
        txt_reviews = rating_div.find("h2", class_="review-header-text wt-mt-xs-2 wt-mt-lg-0").text.strip()
        # nbr_reviews
        nbr_reviews = extract_nbr_reviews(txt_reviews)

        # Average rating
        rating_value_tag = rating_div.find("span", class_="wt-text-heading-large")
        product_rating = float(rating_value_tag.text.strip()) if rating_value_tag else 0
    except:
        txt_reviews = ""
        nbr_reviews = 0
        product_rating = 0

    # -----------------------------
    # LISTED DATE
    # -----------------------------
    try:
        date_str = soup.find("meta", {"property": "og:updated_time"})["content"]
        listed_date = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%S%z").date()
    except:
        listed_date = None

    # -----------------------------
    # RESULT
    # -----------------------------
    return {
        "product_title": product_title,
        "product_url": product_url,
        "product_id": product_id,
        "product_txt": product_txt,
        "var_extension": var_extension,
        "var_url": None,
        "product_options": product_options,
        "product_var": None,
        "var_current_price": current_price,
        "var_old_price": old_price,
        "var_discount_percentage": var_discount_percentage,
        "product_rating": product_rating,
        "txt_reviews": txt_reviews,
        "nbr_reviews": nbr_reviews,
        "listed_date": listed_date,
        "product_description": product_description
    }

# ==========================
# SCRAPE ALL PRODUCTS
# ==========================
product_urls = [
    "https://www.etsy.com/listing/1716154949/boho-embroidered-floral-tote-bag-in-sage"
]

all_products = []

for url in product_urls:
    print("Scraping:", url)
    data = extract_etsy_product(url)
    if data:
        all_products.append(data)
    time.sleep(random.uniform(2, 5))

# Save to CSV
df = pd.DataFrame(all_products)
df.to_csv("../data/raw/09_extracted_data.csv", index=False)

df.head()


Scraping: https://www.etsy.com/listing/1716154949/boho-embroidered-floral-tote-bag-in-sage


Unnamed: 0,product_title,product_url,product_id,product_txt,var_extension,var_url,product_options,product_var,var_current_price,var_old_price,var_discount_percentage,product_rating,txt_reviews,nbr_reviews,listed_date,product_description
0,Sac fourre-tout boh√®me √† fleurs brod√©es en mar...,https://www.etsy.com/listing/1716154949/boho-e...,1716154949,boho-embroidered-floral-tote-bag-in-sage,,,[],,28.38,,,5.0,Avis sur cet article (218),218,,Nous sommes ravis de vous pr√©senter une nouvel...


### REPLACED 08

In [11]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal
from datetime import datetime
import time
import random
import re
from urllib.parse import urlparse, unquote

# ==========================
# CHROME OPTIONS
# ==========================
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")

driver = uc.Chrome(options=options)

# ==========================
# HELPER FUNCTION
# ==========================
def extract_nbr_reviews(txt):
    """
    Extracts the number of reviews from strings like:
    - "Reviews for this item (39)"
    - "Avis sur cet article (1,4 K)"
    Handles spaces, commas, and 'K' as thousands.
    Returns an int.
    """
    match = re.search(r'\(([\d\.,\sKk]+)\)', txt)
    if match:
        value = match.group(1).strip()
        value = value.replace(" ", "").replace(",", ".")  # remove spaces and fix decimal
        if "K" in value.upper():
            value = float(value.upper().replace("K", "")) * 1000
        else:
            value = float(value)
        return int(value)
    return 0

# ==========================
# SCRAPER FUNCTION
# ==========================
def extract_etsy_product(url):
    driver.get(url)

    # Wait for the product title to appear
    try:
        WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
        )
    except:
        print("‚ö† Page did not load properly:", url)
        return None

    # Random human-like delay
    time.sleep(random.uniform(2, 4))

    # After page loads, get the final URL to extract product_id and product_txt
    final_url = driver.current_url
    parsed = urlparse(final_url)
    path_parts = parsed.path.strip("/").split("/")

    # ==========================
    # FIXED product_id AND product_txt
    # ==========================
    if len(path_parts) >= 3 and path_parts[0] == "listing":
        product_id = path_parts[1]
        product_txt = unquote(path_parts[2])  # decode URL encoding
    else:
        product_id = ""
        product_txt = ""

    # var_extension is everything after the path (query string)
    var_extension = parsed.query if parsed.query else None

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # -----------------------------
    # JSON-LD DATA
    # -----------------------------
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.text.strip())
        except:
            ld_data = {}
    else:
        ld_data = {}

    # Basic fields
    product_title = ld_data.get("name", "")
    product_description = ld_data.get("description", "")
    product_url = final_url

    # -----------------------------
    # VARIANTS (if any)
    # -----------------------------
    product_options = []
    try:
        variants_script = soup.find("script", {"id": "listing-page-data"})
        variants_json = json.loads(variants_script.string)

        variations = variants_json.get("listing", {}).get("variations", [])
        for v in variations:
            product_options.append({v.get("property_name"): v.get("options")})
    except:
        product_options = []

    # -----------------------------
    # PRICING
    # -----------------------------
    try:
        current_price = Decimal(ld_data["offers"]["price"])
    except:
        current_price = None

    old_price = None
    var_discount_percentage = None

    # -----------------------------
    # PRODUCT-SPECIFIC RATING & REVIEWS
    # -----------------------------
    try:
        rating_div = soup.find("div", class_="reviews-header appears-ready")
        # txt_reviews
        txt_reviews = rating_div.find("h2", class_="review-header-text wt-mt-xs-2 wt-mt-lg-0").text.strip()
        # nbr_reviews
        nbr_reviews = extract_nbr_reviews(txt_reviews)

        # Average rating
        rating_value_tag = rating_div.find("span", class_="wt-text-heading-large")
        product_rating = float(rating_value_tag.text.strip()) if rating_value_tag else 0
    except:
        txt_reviews = ""
        nbr_reviews = 0
        product_rating = 0

    # -----------------------------
    # LISTED DATE
    # -----------------------------
    try:
        date_str = soup.find("meta", {"property": "og:updated_time"})["content"]
        listed_date = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%S%z").date()
    except:
        listed_date = None

    # -----------------------------
    # RESULT
    # -----------------------------
    return {
        "product_title": product_title,
        "product_url": product_url,
        "product_id": product_id,
        "product_txt": product_txt,
        "var_extension": var_extension,
        "var_url": None,
        "product_options": product_options,
        "product_var": None,
        "var_current_price": current_price,
        "var_old_price": old_price,
        "var_discount_percentage": var_discount_percentage,
        "product_rating": product_rating,
        "txt_reviews": txt_reviews,
        "nbr_reviews": nbr_reviews,
        "listed_date": listed_date,
        "product_description": product_description
    }

# ==========================
# SCRAPE ALL PRODUCTS
# ==========================
product_urls = [
    "https://www.etsy.com/listing/1716154949/boho-embroidered-floral-tote-bag-in-sage"
]

all_products = []

for url in product_urls:
    print("Scraping:", url)
    data = extract_etsy_product(url)
    if data:
        all_products.append(data)

    time.sleep(random.uniform(2, 5))  # Avoid being blocked

# Save to CSV
df = pd.DataFrame(all_products)
df.to_csv("../data/raw/06_extracted_data.csv", index=False)

df.head()


Scraping: https://www.etsy.com/listing/1716154949/boho-embroidered-floral-tote-bag-in-sage


Unnamed: 0,product_title,product_url,product_id,product_txt,var_extension,var_url,product_options,product_var,var_current_price,var_old_price,var_discount_percentage,product_rating,txt_reviews,nbr_reviews,listed_date,product_description
0,Sac fourre-tout boh√®me √† fleurs brod√©es en mar...,https://www.etsy.com/fr/listing/1716154949/sac...,,,,,[],,28.38,,,5.0,Avis sur cet article (218),218,,Nous sommes ravis de vous pr√©senter une nouvel...


### REPLACED 07 EEEH?

In [10]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal
from datetime import datetime
import time
import random
import re
from urllib.parse import urlparse, unquote

# ==========================
# CHROME OPTIONS
# ==========================
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")

driver = uc.Chrome(options=options)

# ==========================
# HELPER FUNCTION
# ==========================
def extract_nbr_reviews(txt):
    """
    Extracts the number of reviews from strings like:
    - "Reviews for this item (39)"
    - "Avis sur cet article (1,4 K)"
    Handles spaces, commas, and 'K' as thousands.
    Returns an int.
    """
    match = re.search(r'\(([\d\.,\sKk]+)\)', txt)
    if match:
        value = match.group(1).strip()
        value = value.replace(" ", "").replace(",", ".")  # remove spaces and fix decimal
        if "K" in value.upper():
            value = float(value.upper().replace("K", "")) * 1000
        else:
            value = float(value)
        return int(value)
    return 0

# ==========================
# SCRAPER FUNCTION
# ==========================
def extract_etsy_product(url):
    driver.get(url)

    # Wait for the product title to appear
    try:
        WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
        )
    except:
        print("‚ö† Page did not load properly:", url)
        return None

    # Random human-like delay
    time.sleep(random.uniform(2, 4))

    # After page loads, get the current URL to extract product_txt
    final_url = driver.current_url
    parsed = urlparse(final_url)

    # Extract path parts: /listing/<product_id>/<product_txt>
    path_parts = parsed.path.strip("/").split("/")
    product_id = path_parts[1] if len(path_parts) > 1 else ""
    product_txt = path_parts[2] if len(path_parts) > 2 else ""
    product_txt = unquote(product_txt)  # decode URL encoding

    # var_extension is everything after the path (query + fragment)
    var_extension = parsed.query if parsed.query else None

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # -----------------------------
    # JSON-LD DATA
    # -----------------------------
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.text.strip())
        except:
            ld_data = {}
    else:
        ld_data = {}

    # Basic fields
    product_title = ld_data.get("name", "")
    product_description = ld_data.get("description", "")
    product_url = final_url

    # -----------------------------
    # VARIANTS (if any)
    # -----------------------------
    product_options = []
    try:
        variants_script = soup.find("script", {"id": "listing-page-data"})
        variants_json = json.loads(variants_script.string)

        variations = variants_json.get("listing", {}).get("variations", [])
        for v in variations:
            product_options.append({v.get("property_name"): v.get("options")})
    except:
        product_options = []

    # -----------------------------
    # PRICING
    # -----------------------------
    try:
        current_price = Decimal(ld_data["offers"]["price"])
    except:
        current_price = None

    old_price = None
    var_discount_percentage = None

    # -----------------------------
    # PRODUCT-SPECIFIC RATING & REVIEWS
    # -----------------------------
    try:
        rating_div = soup.find("div", class_="reviews-header appears-ready")
        # txt_reviews
        txt_reviews = rating_div.find("h2", class_="review-header-text wt-mt-xs-2 wt-mt-lg-0").text.strip()
        # nbr_reviews
        nbr_reviews = extract_nbr_reviews(txt_reviews)

        # Average rating
        rating_value_tag = rating_div.find("span", class_="wt-text-heading-large")
        product_rating = float(rating_value_tag.text.strip()) if rating_value_tag else 0
    except:
        txt_reviews = ""
        nbr_reviews = 0
        product_rating = 0

    # -----------------------------
    # LISTED DATE
    # -----------------------------
    try:
        date_str = soup.find("meta", {"property": "og:updated_time"})["content"]
        listed_date = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%S%z").date()
    except:
        listed_date = None

    # -----------------------------
    # RESULT
    # -----------------------------
    return {
        "product_title": product_title,
        "product_url": product_url,
        "product_id": product_id,
        "product_txt": product_txt,
        "var_extension": var_extension,
        "var_url": None,
        "product_options": product_options,
        "product_var": None,
        "var_current_price": current_price,
        "var_old_price": old_price,
        "var_discount_percentage": var_discount_percentage,
        "product_rating": product_rating,
        "txt_reviews": txt_reviews,
        "nbr_reviews": nbr_reviews,
        "listed_date": listed_date,
        "product_description": product_description
    }

# ==========================
# SCRAPE ALL PRODUCTS
# ==========================
product_urls = [
    "https://www.etsy.com/listing/1716154949/boho-embroidered-floral-tote-bag-in-sage"
]

all_products = []

for url in product_urls:
    print("Scraping:", url)
    data = extract_etsy_product(url)
    if data:
        all_products.append(data)

    time.sleep(random.uniform(2, 5))  # Avoid being blocked

# Save to CSV
df = pd.DataFrame(all_products)
df.to_csv("../data/raw/07_extracted_data.csv", index=False)

df.head()


Scraping: https://www.etsy.com/listing/1716154949/boho-embroidered-floral-tote-bag-in-sage


Unnamed: 0,product_title,product_url,product_id,product_txt,var_extension,var_url,product_options,product_var,var_current_price,var_old_price,var_discount_percentage,product_rating,txt_reviews,nbr_reviews,listed_date,product_description
0,Sac fourre-tout boh√®me √† fleurs brod√©es en mar...,https://www.etsy.com/fr/listing/1716154949/sac...,listing,1716154949,,,[],,28.38,,,5.0,Avis sur cet article (218),218,,Nous sommes ravis de vous pr√©senter une nouvel...


### REPLACED 06

In [9]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal
from datetime import datetime
import time
import random
import re

# ==========================
# CHROME OPTIONS
# ==========================
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")

driver = uc.Chrome(options=options)

# ==========================
# HELPER FUNCTION
# ==========================
def extract_nbr_reviews(txt):
    """
    Extracts the number of reviews from strings like:
    - "Reviews for this item (39)"
    - "Avis sur cet article (1,4 K)"
    Handles spaces, commas, and 'K' as thousands.
    Returns an int.
    """
    match = re.search(r'\(([\d\.,\sKk]+)\)', txt)
    if match:
        value = match.group(1).strip()
        value = value.replace(" ", "").replace(",", ".")  # remove spaces and fix decimal
        if "K" in value.upper():
            value = float(value.upper().replace("K", "")) * 1000
        else:
            value = float(value)
        return int(value)
    return 0

# ==========================
# SCRAPER FUNCTION
# ==========================
def extract_etsy_product(url):
    driver.get(url)

    # Wait for the product title to appear
    try:
        WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
        )
    except:
        print("‚ö† Page did not load properly:", url)
        return None

    # Random human-like delay
    time.sleep(random.uniform(2, 4))

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # -----------------------------
    # JSON-LD DATA
    # -----------------------------
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.text.strip())
        except:
            ld_data = {}
    else:
        ld_data = {}

    # Basic fields
    product_title = ld_data.get("name", "")
    product_description = ld_data.get("description", "")
    product_id = url.rstrip('/').split("/")[-1]
    product_url = url

    # -----------------------------
    # VARIANTS (if any)
    # -----------------------------
    product_options = []
    try:
        variants_script = soup.find("script", {"id": "listing-page-data"})
        variants_json = json.loads(variants_script.string)

        variations = variants_json.get("listing", {}).get("variations", [])
        for v in variations:
            product_options.append({v.get("property_name"): v.get("options")})
    except:
        product_options = []

    # -----------------------------
    # PRICING
    # -----------------------------
    try:
        current_price = Decimal(ld_data["offers"]["price"])
    except:
        current_price = None

    old_price = None
    var_discount_percentage = None

    # -----------------------------
    # PRODUCT-SPECIFIC RATING & REVIEWS
    # -----------------------------
    try:
        rating_div = soup.find("div", class_="reviews-header appears-ready")
        # txt_reviews
        txt_reviews = rating_div.find("h2", class_="review-header-text wt-mt-xs-2 wt-mt-lg-0").text.strip()
        # nbr_reviews
        nbr_reviews = extract_nbr_reviews(txt_reviews)

        # Average rating
        rating_value_tag = rating_div.find("span", class_="wt-text-heading-large")
        product_rating = float(rating_value_tag.text.strip()) if rating_value_tag else 0
    except:
        txt_reviews = ""
        nbr_reviews = 0
        product_rating = 0

    # -----------------------------
    # LISTED DATE
    # -----------------------------
    try:
        date_str = soup.find("meta", {"property": "og:updated_time"})["content"]
        listed_date = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%S%z").date()
    except:
        listed_date = None

    # -----------------------------
    # RESULT
    # -----------------------------
    return {
        "product_title": product_title,
        "product_url": product_url,
        "product_id": product_id,
        "var_extension": None,
        "var_url": None,
        "product_options": product_options,
        "product_var": None,
        "var_current_price": current_price,
        "var_old_price": old_price,
        "var_discount_percentage": var_discount_percentage,
        "product_rating": product_rating,
        "txt_reviews": txt_reviews,
        "nbr_reviews": nbr_reviews,
        "listed_date": listed_date,
        "product_description": product_description
    }

# ==========================
# SCRAPE ALL PRODUCTS
# ==========================
product_urls = [
    "https://www.etsy.com/listing/1310361624"
]

all_products = []

for url in product_urls:
    print("Scraping:", url)
    data = extract_etsy_product(url)
    if data:
        all_products.append(data)

    time.sleep(random.uniform(2, 5))  # Important to avoid suspicion

# Save to CSV
df = pd.DataFrame(all_products)
df.to_csv("../data/raw/06_extracted_data.csv", index=False)

df.head()


Scraping: https://www.etsy.com/listing/1310361624


Unnamed: 0,product_title,product_url,product_id,var_extension,var_url,product_options,product_var,var_current_price,var_old_price,var_discount_percentage,product_rating,txt_reviews,nbr_reviews,listed_date,product_description
0,"Sac fourre-tout en toile brod√©e personnalis√©, ...",https://www.etsy.com/listing/1310361624,1310361624,,,[],,,,,4.8,"Avis sur cet article (1,4 K)",1400,,-Le texte mesure 5 pouces de longueur. Plus le...


### replaced 05

In [7]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal
from datetime import datetime
import time
import random
import re

# ==========================
# CHROME OPTIONS
# ==========================
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")

driver = uc.Chrome(options=options)

# ==========================
# HELPER FUNCTION
# ==========================
def extract_nbr_reviews(txt):
    """
    Extracts the number of reviews from a string like "Reviews for this item (39)".
    Handles 'K' as thousands.
    Returns an int.
    """
    match = re.search(r'\(([\d\.Kk]+)\)', txt)
    if match:
        value = match.group(1)
        if "K" in value.upper():
            value = float(value.upper().replace("K", "")) * 1000
        return int(value)
    return 0

# ==========================
# SCRAPER FUNCTION
# ==========================
def extract_etsy_product(url):
    driver.get(url)

    # Wait for the product title to appear
    try:
        WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
        )
    except:
        print("‚ö† Page did not load properly:", url)
        return None

    # Random human-like delay
    time.sleep(random.uniform(2, 4))

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # -----------------------------
    # JSON-LD DATA
    # -----------------------------
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.text.strip())
        except:
            ld_data = {}
    else:
        ld_data = {}

    # Basic fields
    product_title = ld_data.get("name", "")
    product_description = ld_data.get("description", "")
    product_id = url.rstrip('/').split("/")[-1]
    product_url = url

    # -----------------------------
    # VARIANTS (if any)
    # -----------------------------
    product_options = []
    try:
        variants_script = soup.find("script", {"id": "listing-page-data"})
        variants_json = json.loads(variants_script.string)

        variations = variants_json.get("listing", {}).get("variations", [])
        for v in variations:
            product_options.append({v.get("property_name"): v.get("options")})
    except:
        product_options = []

    # -----------------------------
    # PRICING
    # -----------------------------
    try:
        current_price = Decimal(ld_data["offers"]["price"])
    except:
        current_price = None

    old_price = None
    var_discount_percentage = None

    # -----------------------------
    # PRODUCT-SPECIFIC RATING
    # -----------------------------
    try:
        rating_div = soup.find("div", class_="reviews-header appears-ready")
        # txt_reviews
        txt_reviews = rating_div.find("h2", class_="review-header-text wt-mt-xs-2 wt-mt-lg-0").text.strip()
        # nbr_reviews
        nbr_reviews = extract_nbr_reviews(txt_reviews)

        # Average rating
        rating_value_tag = rating_div.find("span", class_="wt-text-heading-large")
        product_rating = float(rating_value_tag.text.strip()) if rating_value_tag else 0
    except:
        txt_reviews = ""
        nbr_reviews = 0
        product_rating = 0

    # -----------------------------
    # LISTED DATE
    # -----------------------------
    try:
        date_str = soup.find("meta", {"property": "og:updated_time"})["content"]
        listed_date = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%S%z").date()
    except:
        listed_date = None

    # -----------------------------
    # RESULT
    # -----------------------------
    return {
        "product_title": product_title,
        "product_url": product_url,
        "product_id": product_id,
        "var_extension": None,
        "var_url": None,
        "product_options": product_options,
        "product_var": None,
        "var_current_price": current_price,
        "var_old_price": old_price,
        "var_discount_percentage": var_discount_percentage,
        "product_rating": product_rating,
        "txt_reviews": txt_reviews,
        "nbr_reviews": nbr_reviews,
        "listed_date": listed_date,
        "product_description": product_description
    }

# ==========================
# SCRAPE ALL PRODUCTS
# ==========================
product_urls = [
    "https://www.etsy.com/listing/1310361624"
]

all_products = []

for url in product_urls:
    print("Scraping:", url)
    data = extract_etsy_product(url)
    if data:
        all_products.append(data)

    time.sleep(random.uniform(2, 5))  # Important to avoid suspicion

# Save to CSV
df = pd.DataFrame(all_products)
df.to_csv("../data/raw/05_extracted_data.csv", index=False)

df.head()


Scraping: https://www.etsy.com/listing/1310361624


Unnamed: 0,product_title,product_url,product_id,var_extension,var_url,product_options,product_var,var_current_price,var_old_price,var_discount_percentage,product_rating,txt_reviews,nbr_reviews,listed_date,product_description
0,"Sac fourre-tout en toile brod√©e personnalis√©, ...",https://www.etsy.com/listing/1310361624,1310361624,,,[],,,,,4.8,"Avis sur cet article (1,4 K)",0,,-Le texte mesure 5 pouces de longueur. Plus le...


### replaced 04

In [6]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal
from datetime import datetime
import time
import random
import re

# ==========================
# CHROME OPTIONS
# ==========================
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")

driver = uc.Chrome(options=options)

# ==========================
# HELPER FUNCTION
# ==========================
def extract_nbr_reviews(txt):
    """
    Extracts the number of reviews from a string like "Reviews for this item (39)".
    Handles 'K' as thousands.
    Returns an int.
    """
    match = re.search(r'\(([\d\.Kk]+)\)', txt)
    if match:
        value = match.group(1)
        if "K" in value.upper():
            value = float(value.upper().replace("K", "")) * 1000
        return int(value)
    return 0

# ==========================
# SCRAPER FUNCTION
# ==========================
def extract_etsy_product(url):
    driver.get(url)

    # Wait for the product title to appear
    try:
        WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
        )
    except:
        print("‚ö† Page did not load properly:", url)
        return None

    # Random human-like delay
    time.sleep(random.uniform(2, 4))

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # -----------------------------
    # JSON-LD DATA
    # -----------------------------
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.text.strip())
        except:
            ld_data = {}
    else:
        ld_data = {}

    # Basic fields
    product_title = ld_data.get("name", "")
    product_description = ld_data.get("description", "")
    product_id = url.rstrip('/').split("/")[-1]
    product_url = url

    # -----------------------------
    # VARIANTS (if any)
    # -----------------------------
    product_options = []
    try:
        variants_script = soup.find("script", {"id": "listing-page-data"})
        variants_json = json.loads(variants_script.string)

        variations = variants_json.get("listing", {}).get("variations", [])
        for v in variations:
            product_options.append({v.get("property_name"): v.get("options")})
    except:
        product_options = []

    # -----------------------------
    # PRICING
    # -----------------------------
    try:
        current_price = Decimal(ld_data["offers"]["price"])
    except:
        current_price = None

    old_price = None
    var_discount_percentage = None

    # -----------------------------
    # PRODUCT-SPECIFIC RATING
    # -----------------------------
    try:
        rating_div = soup.find("div", class_="reviews-header appears-ready")
        # txt_reviews
        txt_reviews = rating_div.find("h2", class_="review-header-text wt-mt-xs-2 wt-mt-lg-0").text.strip()
        # nbr_reviews
        nbr_reviews = extract_nbr_reviews(txt_reviews)

        # Average rating
        rating_value_tag = rating_div.find("span", class_="wt-text-heading-large")
        product_rating = float(rating_value_tag.text.strip()) if rating_value_tag else 0
    except:
        txt_reviews = ""
        nbr_reviews = 0
        product_rating = 0

    # -----------------------------
    # LISTED DATE
    # -----------------------------
    try:
        date_str = soup.find("meta", {"property": "og:updated_time"})["content"]
        listed_date = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%S%z").date()
    except:
        listed_date = None

    # -----------------------------
    # RESULT
    # -----------------------------
    return {
        "product_title": product_title,
        "product_url": product_url,
        "product_id": product_id,
        "var_extension": None,
        "var_url": None,
        "product_options": product_options,
        "product_var": None,
        "var_current_price": current_price,
        "var_old_price": old_price,
        "var_discount_percentage": var_discount_percentage,
        "product_rating": product_rating,
        "txt_reviews": txt_reviews,
        "nbr_reviews": nbr_reviews,
        "listed_date": listed_date,
        "product_description": product_description
    }

# ==========================
# SCRAPE ALL PRODUCTS
# ==========================
product_urls = [
    "https://www.etsy.com/listing/1289965137/"
]

all_products = []

for url in product_urls:
    print("Scraping:", url)
    data = extract_etsy_product(url)
    if data:
        all_products.append(data)

    time.sleep(random.uniform(2, 5))  # Important to avoid suspicion

# Save to CSV
df = pd.DataFrame(all_products)
df.to_csv("../data/raw/04_extracted_data.csv", index=False)

df.head()


Scraping: https://www.etsy.com/listing/1289965137/


Unnamed: 0,product_title,product_url,product_id,var_extension,var_url,product_options,product_var,var_current_price,var_old_price,var_discount_percentage,product_rating,txt_reviews,nbr_reviews,listed_date,product_description
0,Tote Bag Pr√©nom Personnalis√© - Id√©al pour Cade...,https://www.etsy.com/listing/1289965137/,1289965137,,,[],,,,,5.0,Avis sur cet article (39),39,,Un tote bag √©l√©gant √† personnaliser avec l‚Äôini...


### replace 03

In [4]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal
from datetime import datetime
import time
import random

# ==========================
# CHROME OPTIONS
# ==========================
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")

# Spoof user profile to look real
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")

driver = uc.Chrome(options=options)

# ==========================
# SCRAPER FUNCTION
# ==========================
def extract_etsy_product(url):
    driver.get(url)

    # Wait for the product title to appear
    try:
        WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
        )
    except:
        print("‚ö† Page did not load properly:", url)
        return None

    # Random human-like delay
    time.sleep(random.uniform(2, 4))

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # -----------------------------
    # JSON-LD DATA
    # -----------------------------
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.text.strip())
        except:
            ld_data = {}
    else:
        ld_data = {}

    # Basic fields
    product_title = ld_data.get("name", "")
    product_description = ld_data.get("description", "")
    product_id = url.rstrip('/').split("/")[-1]
    product_url = url

    # -----------------------------
    # VARIANTS (if any)
    # -----------------------------
    product_options = []
    try:
        variants_script = soup.find("script", {"id": "listing-page-data"})
        variants_json = json.loads(variants_script.string)

        variations = variants_json.get("listing", {}).get("variations", [])
        for v in variations:
            product_options.append({v.get("property_name"): v.get("options")})
    except:
        product_options = []

    # -----------------------------
    # PRICING
    # -----------------------------
    try:
        current_price = Decimal(ld_data["offers"]["price"])
    except:
        current_price = None

    old_price = None
    var_discount_percentage = None

    # -----------------------------
    # PRODUCT-SPECIFIC RATING
    # -----------------------------
    try:
        rating_div = soup.find("div", class_="wt-display-flex-xs wt-align-items-center wt-flex-wrap wt-align-content-center wt-justify-content-center wt-flex-direction-column-xs")
        if rating_div:
            # Rating value
            rating_value_tag = rating_div.find("p", class_="wt-text-heading-large")
            product_rating = float(rating_value_tag.text.strip()) if rating_value_tag else 0

            # Number of ratings/reviews
            nbr_reviews_tag = rating_div.find_next("p", class_="wt-text-body-smaller wt-sem-text-secondary")
            nbr_reviews_text = nbr_reviews_tag.text.strip() if nbr_reviews_tag else "0"
            nbr_reviews = int(''.join(filter(str.isdigit, nbr_reviews_text)))
        else:
            product_rating = 0
            nbr_reviews = 0
    except:
        product_rating = 0
        nbr_reviews = 0

    # -----------------------------
    # LISTED DATE
    # -----------------------------
    try:
        date_str = soup.find("meta", {"property": "og:updated_time"})["content"]
        listed_date = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%S%z").date()
    except:
        listed_date = None

    # -----------------------------
    # RESULT
    # -----------------------------
    return {
        "product_title": product_title,
        "product_url": product_url,
        "product_id": product_id,
        "var_extension": None,
        "var_url": None,
        "product_options": product_options,
        "product_var": None,
        "var_current_price": current_price,
        "var_old_price": old_price,
        "var_discount_percentage": var_discount_percentage,
        "product_rating": product_rating,
        "txt_reviews": "",  # Optional: scraping full reviews requires scrolling
        "nbr_reviews": nbr_reviews,
        "listed_date": listed_date,
        "product_description": product_description
    }

# ==========================
# SCRAPE ALL PRODUCTS
# ==========================
product_urls = [
    "https://www.etsy.com/listing/1289965137/"
]

all_products = []

for url in product_urls:
    print("Scraping:", url)
    data = extract_etsy_product(url)
    if data:
        all_products.append(data)

    time.sleep(random.uniform(2, 5))  # Important to avoid suspicion

# Save to CSV
df = pd.DataFrame(all_products)
df.to_csv("../data/raw/03_extracted_data.csv", index=False)

print("‚úî Data extraction complete!")
print(df.head())


Scraping: https://www.etsy.com/listing/1289965137/
‚úî Data extraction complete!
                                       product_title  \
0  Tote Bag Pr√©nom Personnalis√© - Id√©al pour Cade...   

                                product_url  product_id var_extension var_url  \
0  https://www.etsy.com/listing/1289965137/  1289965137          None    None   

  product_options product_var var_current_price var_old_price  \
0              []        None              None          None   

  var_discount_percentage  product_rating txt_reviews  nbr_reviews  \
0                    None               0                        0   

  listed_date                                product_description  
0        None  Un tote bag √©l√©gant √† personnaliser avec l‚Äôini...  


In [5]:
df.head()

Unnamed: 0,product_title,product_url,product_id,var_extension,var_url,product_options,product_var,var_current_price,var_old_price,var_discount_percentage,product_rating,txt_reviews,nbr_reviews,listed_date,product_description
0,Tote Bag Pr√©nom Personnalis√© - Id√©al pour Cade...,https://www.etsy.com/listing/1289965137/,1289965137,,,[],,,,,0,,0,,Un tote bag √©l√©gant √† personnaliser avec l‚Äôini...


### replaced 02 (extracted store_rating and store_reviews not the product_rating, txt_nbr, nbr_reviews )

In [None]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
from decimal import Decimal
from datetime import datetime
import time
import random

# ==========================
# CHROME OPTIONS
# ==========================
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--start-maximized")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")

# Spoof user profile to look real
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")

driver = uc.Chrome(options=options)

# ==========================
# SCRAPER FUNCTION
# ==========================
def extract_etsy_product(url):
    driver.get(url)

    # Wait for the product title to appear
    try:
        WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-buy-box-listing-title]"))
        )
    except:
        print("‚ö† Page did not load properly:", url)
        return None

    # Random human-like delay
    time.sleep(random.uniform(2, 4))

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # -----------------------------
    # JSON-LD DATA
    # -----------------------------
    json_ld_tag = soup.find("script", type="application/ld+json")
    if json_ld_tag:
        try:
            ld_data = json.loads(json_ld_tag.text.strip())
        except:
            ld_data = {}
    else:
        ld_data = {}

    # Basic fields
    product_title = ld_data.get("name", "")
    product_description = ld_data.get("description", "")
    product_id = url.rstrip('/').split("/")[-1]
    product_url = url

    # -----------------------------
    # VARIANTS (if any)
    # -----------------------------
    product_options = []
    try:
        variants_script = soup.find("script", {"id": "listing-page-data"})
        variants_json = json.loads(variants_script.string)

        variations = variants_json.get("listing", {}).get("variations", [])
        for v in variations:
            product_options.append({v.get("property_name"): v.get("options")})
    except:
        product_options = []

    # -----------------------------
    # PRICING
    # -----------------------------
    try:
        current_price = Decimal(ld_data["offers"]["price"])
    except:
        current_price = None

    old_price = None
    var_discount_percentage = None

    # -----------------------------
    # RATING
    # -----------------------------
    rating_info = ld_data.get("aggregateRating", {})
    product_rating = float(rating_info.get("ratingValue", 0))
    nbr_reviews = int(rating_info.get("reviewCount", 0))
    txt_reviews = ""

    # -----------------------------
    # LISTED DATE
    # -----------------------------
    try:
        date_str = soup.find("meta", {"property": "og:updated_time"})["content"]
        listed_date = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%S%z").date()
    except:
        listed_date = None

    # -----------------------------
    # RESULT
    # -----------------------------
    return {
        "product_title": product_title,
        "product_url": product_url,
        "product_id": product_id,
        "var_extension": None,
        "var_url": None,
        "product_options": product_options,
        "product_var": None,
        "var_current_price": current_price,
        "var_old_price": old_price,
        "var_discount_percentage": var_discount_percentage,
        "product_rating": product_rating,
        "txt_reviews": txt_reviews,
        "nbr_reviews": nbr_reviews,
        "listed_date": listed_date,
        "product_description": product_description
    }


# ==========================
# SCRAPE ALL PRODUCTS
# ==========================
product_urls = [
    "https://www.etsy.com/listing/1289965137/"
]

all_products = []

for url in product_urls:
    print("Scraping:", url)
    data = extract_etsy_product(url)
    if data:
        all_products.append(data)

    time.sleep(random.uniform(2, 5))  # Important to avoid suspicion

# Save to CSV
df = pd.DataFrame(all_products)
df.to_csv("../data/raw/01_extracted_data.csv", index=False)

print("‚úî Data extraction complete!")
print(df.head())


Scraping: https://www.etsy.com/listing/1289965137/
‚úî Data extraction complete!
                                       product_title  \
0  Tote Bag Pr√©nom Personnalis√© - Id√©al pour Cade...   

                                product_url  product_id var_extension var_url  \
0  https://www.etsy.com/listing/1289965137/  1289965137          None    None   

  product_options product_var var_current_price var_old_price  \
0              []        None              None          None   

  var_discount_percentage  product_rating txt_reviews  nbr_reviews  \
0                    None             5.0                      369   

  listed_date                                product_description  
0        None  Un tote bag √©l√©gant √† personnaliser avec l‚Äôini...  


In [3]:
df.head()

Unnamed: 0,product_title,product_url,product_id,var_extension,var_url,product_options,product_var,var_current_price,var_old_price,var_discount_percentage,product_rating,txt_reviews,nbr_reviews,listed_date,product_description
0,Tote Bag Pr√©nom Personnalis√© - Id√©al pour Cade...,https://www.etsy.com/listing/1289965137/,1289965137,,,[],,,,,5.0,,369,,Un tote bag √©l√©gant √† personnaliser avec l‚Äôini...


### replaced 01

In [54]:
import time
import re
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product

def get_prices(driver):
    now_price, old_price = None, None
    try:
        price_elements = driver.find_elements(By.XPATH, "//p[contains(@class,'wt-text-title')]/span | //span[contains(@class,'wt-text-strikethrough')]")
        for elem in price_elements:
            text = elem.text.strip().replace("‚Ç¨", "").replace("+", "").replace(",", ".")
            try:
                value = float(text)
            except:
                continue

            if "wt-text-strikethrough" in elem.get_attribute("class"):
                old_price = value
            else:
                now_price = value

        if now_price is None and old_price is not None:
            now_price = old_price
        if old_price is None:
            old_price = now_price

    except:
        now_price, old_price = None, None

    percentage_difference_price = round((old_price - now_price) / old_price * 100, 2) if old_price and now_price and old_price != now_price else None
    return now_price, old_price, percentage_difference_price


def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/search?q=tote+bag"
    driver.get(search_url)
    time.sleep(5)

    product_links = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//ul[contains(@class,'wt-grid')]/li//a[@data-listing-id]"))
    )
    product_links = [link.get_attribute("href") for link in product_links][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        # Title
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # Rating
        try:
            rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
            rating = float(rating_elem.get_attribute("value"))
        except:
            rating = None

        # Reviews
        try:
            reviews_elem = driver.find_element(By.XPATH, "//h2[contains(@class,'review-header-text')]")
            txt_reviews = reviews_elem.text.strip()  # Full text
            match = re.search(r"\((\d+)\)", txt_reviews)
            nbr_reviews = int(match.group(1)) if match else 0
        except:
            txt_reviews = None
            nbr_reviews = None

        # Variants
        try:
            variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
            if not variant_sections:
                now_price, old_price, percentage_difference_price = get_prices(driver)
                results.append({
                    "product_link": url,
                    "product_id": re.search(r"/listing/(\d+)", url).group(1),
                    "product_variant_url": url,
                    "product_title": title,
                    "Option": None,
                    "current_price": now_price,
                    "old_price": old_price,
                    "discount_percentage": percentage_difference_price,
                    "product_rating": rating,
                    "txt_reviews": txt_reviews,
                    "nbr_reviews": nbr_reviews
                })
            else:
                all_options = []
                for section in variant_sections:
                    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select')) and not(contains(@aria-label,'S√©lectionner'))]")
                    option_names = [opt.get_attribute("aria-label") or opt.text for opt in opts]
                    all_options.append(option_names)

                for combo in product(*all_options):
                    try:
                        for sec_idx, option_name in enumerate(combo):
                            section = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")[sec_idx]
                            opt_buttons = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select')) and not(contains(@aria-label,'S√©lectionner'))]")
                            for btn in opt_buttons:
                                btn_name = btn.get_attribute("aria-label") or btn.text
                                if btn_name == option_name:
                                    btn.click()
                                    time.sleep(2)
                                    break

                        now_price, old_price, percentage_difference_price = get_prices(driver)

                        results.append({
                            "product_link": url,
                            "product_id": re.search(r"/listing/(\d+)", url).group(1),
                            "product_variant_url": f"{url}/{'_'.join(combo)}",
                            "product_title": title,
                            "Option": " | ".join(combo),
                            "current_price": now_price,
                            "old_price": old_price,
                            "discount_percentage": percentage_difference_price,
                            "product_rating": rating,
                            "txt_reviews": txt_reviews,
                            "nbr_reviews": nbr_reviews
                        })
                    except Exception as e:
                        print(f"[WARNING] Could not process combo {combo}: {e}")

        except Exception as e:
            print(f"[WARNING] Variant handling skipped for product {url}: {e}")

    driver.quit()
    return pd.DataFrame(results)


if __name__ == "__main__":
    df = scrape_products(limit=2)
    df.to_csv("../data/raw/etsy_raw_data.csv", index=False)
    print("[SUCCESS] RAW DATA CSV saved!")


[INFO] Scraping product 1/2: https://www.etsy.com/fr/listing/4377096883/sac-fourre-tout-en-coton-matelasse-a?click_key=LT86d51480109cbb6e2573438fac1eeeea942e488f%3A4377096883&click_sum=c3e93e86&ls=a&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-435685-1-1&sr_prefetch=1&pf_from=search&pro=1&pop=1&sts=1
[INFO] Scraping product 2/2: https://www.etsy.com/fr/listing/4377096883/sac-fourre-tout-en-coton-matelasse-a?click_key=LT86d51480109cbb6e2573438fac1eeeea942e488f%3A4377096883&click_sum=c3e93e86&ls=a&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-435685-1-1&sr_prefetch=1&pf_from=search&pro=1&pop=1&sts=1
[SUCCESS] RAW DATA CSV saved!


In [55]:
df.head()

Unnamed: 0,product_link,product_id,product_variant_url,product_title,Option,current_price,old_price,discount_percentage,product_rating,txt_reviews,nbr_reviews
0,https://www.etsy.com/fr/listing/4377096883/sac...,4377096883,https://www.etsy.com/fr/listing/4377096883/sac...,Sac fourre-tout en coton matelass√© √† imprim√© j...,,52.43,52.43,,4.9259,,0
1,https://www.etsy.com/fr/listing/4377096883/sac...,4377096883,https://www.etsy.com/fr/listing/4377096883/sac...,Sac fourre-tout en coton matelass√© √† imprim√© j...,,52.43,52.43,,4.9259,,0


==================================================================================================================================
# <div align="center">DATA CLEANING & ANALYSIS</div>
==================================================================================================================================

#### üóÉÔ∏è **Raw data**

- Web scraped data saved in a DataFrame then a CSV file and uploaded to google drive
- The df_url has to be a downloadable link to the csv file from google drive
- We load the csv to use for data cleaning and analysis

In [None]:
import pandas as pd

# Load RAW DATA CSV
df_url = 'link to the dataFrame collected from scraping as a downloadable link from google drive'
df_etsy = pd.read_csv(df_url)

print("STEP 1 : RAW CSV loaded successfully!")
df_etsy.head()


==================================================================================================================================
# <div align="center">PLOTS</div>
==================================================================================================================================

### üìä PLOT 01:

In [None]:
# PLOT 1

### üìä PLOT 02:

In [None]:
# PLOT 2

### üìä PLOT 03:

In [None]:
# PLOT 3

### üìä PLOT 04:

In [None]:
# PLOT 4

### üìä PLOT 05:

In [None]:
# PLOT 5

==================================================================================================================================
# <div align="center">INSIGHTS</div>
==================================================================================================================================

### üß† INSIGHT 01:
Text

----

### üß† INSIGHT 02:
Text

---

### üß† INSIGHT 03:
Text


==================================================================================================================================