==================================================================================================================================
# <div align="center">PROJECT 03: Etsy Print-On-Demand Trends</div>
==================================================================================================================================

### üìù BUSINESS IDEA

**Print-On-Demand (POD) Business** ‚Äì What the project is about

### ‚ö†Ô∏è PROBLEM

No Free API exists to access the market data needed, requiring web scraping to gather insights ‚Äì The challenge we‚Äôre addressing

### üî∞ SOLUTION FRAMEWORK

Web scrape etsy for a specific POD product

Collect the data necessary to clean & analyze


| **Development**                                                                                                                                             | **Presentation**                 |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------- |
| **Business Idea** ‚Üí **Problem Definition** ‚Üí **Data Research & Visualization** ‚Üí **Insights** ‚Üí **Interpretation** ‚Üí **Implications** ‚Üí **Business Impact** | **Limitations & Considerations** |

### üßê QUESTIONS

- Which keywords in product titles and descriptions drive the most sales?

- Which product niches have the highest demand?

- What keywords improve search visibility on Etsy?

- When is the best period to sell based on review trends?

- Which price ranges generate the most sales?


---

### üìì SECTION OVERVIEW

- **Project / Business Idea:** What the project is about

- **Problem:** The challenge we‚Äôre addressing

- **Solution / Approach:** How we solve it

- **Research & Plots:** How we analyzed data visually

- **Insights:** What we discovered

- **Interpretation:** Why it matters

- **Implications:** What actions the business can take

- **Business Impact:** Expected results for the business

- **Limitations:** What constraints or gaps exist

==================================================================================================================================
# <div align="center">RESEARCH</div>
==================================================================================================================================

### üåê **Which Are the Best-Selling POD Products on Etsy?**

I‚Äôm researching print-on-demand products to sell on Etsy that only require **digital artwork and marketing**, while the POD provider handles **printing, packaging, and shipping**.


### ‚≠ê **Using Google Trends for POD Product Research**
üí° **Goal:** Identify which POD product category has been searched the most on Google over the past 5 years (2020‚Äì2025).

Below is the list of product categories I‚Äôm comparing:

### üéØ **Chosen POD product to research is :** `tote bags`

| Category              | Subcategories / Examples                                      |
|-----------------------|---------------------------------------------------------------|
| **Custom Apparel**        | T-shirts, Hoodies, Sweatshirts, Tank tops                     |
| **Mug**                   | Ceramic mugs, Color-changing mugs, Espresso mugs, Travel mugs |
| **Tote Bag**              | Cotton totes, All-over print totes                            |
| **Phone Case**            | iPhone / Samsung cases, Tough / Slim cases                    |
| **Stickers**              | Die-cut stickers, Kiss-cut stickers, Sticker sheets           |
| **Hats**                  | Baseball caps, Trucker hats, Beanies                          |
| **Pillows / Cushions**    | Pillow covers, Stuffed pillows, All-over print pillow designs|
| **Blanket**               | Fleece blankets, Sherpa blankets, Woven blankets             |
| **Wall Art**              | Posters, Canvas prints, Framed posters, Metal prints         |
| **Doormat**               | Printed coir doormats, Rubber-backed doormats                |
| **Drinkware**             | Stainless steel tumblers, Water bottles, Wine tumblers       |
| **Calendar**              | Custom printed wall calendars                                 |
| **Yoga Mat**              | Printed yoga mats                                             |
| **Bedding**               | Duvet covers, Pillowcases, All-over print bed sets           |
| **Pet Accessories**       | Pet bandanas, Pet beds, Pet bowls, Pet blankets              |
| **Ornaments**             | Ceramic ornaments, Wood ornaments, Metal ornaments           |


==================================================================================================================================
# <div align="center">WEB SCRAPING</div>
==================================================================================================================================

```Etsy``` is a dynamic website, so scraping it requires careful handling.

Since ```Etsy``` uses ```JavaScript``` to load some content,

```requests``` +  ``BeautifulSoup`` might work for static parts (like search results), 

but for dynamic content, ``Selenium`` is more reliable. 

I will be using ``requests`` + ``BeautifulSoup`` for ```product listings``` **(title, price, link)**

Important Note: Etsy uses dynamic loading + anti-bot protections.

Using code with standard HTML scraping can work as long as Etsy doesn‚Äôt block the request.

If blocked, using headers, rotating proxies, or the Etsy API will be required.

----

### üß∞ **Install for web scraping**

In [None]:
# install requests & beautifulsoup
!pip install requests beautifulsoup4 fake-useragent pandas

# install selenium
!pip install selenium pandas

---

### üìå **Avoid web BLOCKED**
| Version                                   | Best For          | Pros                                           | Cons                          |
| ----------------------------------------- | ----------------- | ---------------------------------------------- | ----------------------------- |
| **Requests + BeautifulSoup + Pagination** | Simple scraping   | Fast, clean                                    | Etsy may block request        |
| **Selenium + BeautifulSoup + Pagination** | Reliable scraping | Bypasses bot protection, loads dynamic content | Slower, requires ChromeDriver |


----

### üìå **Pagination + BeautifulSoup**
| Version                                   | Best For          | Pros                                           | Cons                          |
| ----------------------------------------- | ----------------- | ---------------------------------------------- | ----------------------------- |
| **Requests + BeautifulSoup + Pagination** | Simple scraping   | Fast, clean                                    | Etsy may block request        |

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import time


def scrape_products(pages=5, max_items=10):
    base_url = "https://www.etsy.com/search?q=tote+bag&page="
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36"
    }

    data = []

    for page in range(1, pages + 1):
        url = base_url + str(page)
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, "html.parser")

        products = soup.find_all("li", class_="wt-list-unstyled")

        for item in products:
            if len(data) >= max_items:
                return pd.DataFrame(data)

            # URL
            link = item.find("a", href=True)
            if not link:
                continue
            product_url = "https://www.etsy.com" + link["href"]

            # Title
            title_tag = item.find("h3")
            title = title_tag.get_text(strip=True) if title_tag else None

            # Price
            price_tag = item.find("span", class_="currency-value")
            price = None
            if price_tag:
                try:
                    price = float(price_tag.text.replace(",", "."))
                except:
                    pass

            # Rating
            rating_tag = item.find("span", class_="wt-screen-reader-only")
            rating = None
            if rating_tag:
                match_rating = re.search(r"([\d.]+) out of 5", rating_tag.text)
                if match_rating:
                    rating = float(match_rating.group(1))

            # Reviews
            reviews_tag = item.find("span", class_="wt-text-body-01")
            reviews = None
            if reviews_tag:
                match_reviews = re.search(r"\((\d+)\)", reviews_tag.text)
                if match_reviews:
                    reviews = int(match_reviews.group(1))

            # Delivery
            delivery = None
            delivery_tag = item.find(string=re.compile("delivery", re.I))
            if delivery_tag:
                txt = delivery_tag.lower()
                if "free" in txt:
                    delivery = 0
                else:
                    match_del = re.search(r"‚Ç¨\s?([\d.,]+)", delivery_tag)
                    if match_del:
                        delivery = float(match_del.group(1).replace(",", "."))

            data.append({
                "URL": product_url,
                "Title": title,
                "Price": price,
                "Rating": rating,
                "Reviews": reviews,
                "Delivery": delivery
            })

        time.sleep(1)

    return pd.DataFrame(data)


# Example: save CSV
if __name__ == "__main__":
    df = scrape_products()
    df.to_csv("../data/interim/0_interim_price.csv", index=False)
    print("STEP 1 : 'Price' INTERIM and CSV saved successfully!")


---

### üìå **Selenium-Based (ChromeDriver)**

| Version                                   | Best For          | Pros                                           | Cons                          |
| ----------------------------------------- | ----------------- | ---------------------------------------------- | ----------------------------- |
| **Selenium + BeautifulSoup + Pagination** | Reliable scraping | Bypasses bot protection, loads dynamic content | Slower, requires ChromeDriver |

Link to ChromeDriver: https://googlechromelabs.github.io/chrome-for-testing/#stable

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
import time
import re


def scrape_products_selenium(max_items=10):
    options = Options()
    options.add_argument("--headless")  
    options.add_argument("--disable-blink-features=AutomationControlled")
    options.add_argument("--disable-gpu")
    options.add_argument("start-maximized")
    options.add_argument("user-agent=Mozilla/5.0")

    driver = webdriver.Chrome(options=options)

    data = []
    page = 1

    while len(data) < max_items:
        url = f"https://www.etsy.com/search?q=tote+bag&page={page}"
        driver.get(url)
        time.sleep(4)

        soup = BeautifulSoup(driver.page_source, "html.parser")
        products = soup.find_all("li", class_="wt-list-unstyled")

        for item in products:
            if len(data) >= max_items:
                break

            # URL
            link = item.find("a", href=True)
            if not link:
                continue
            product_url = "https://www.etsy.com" + link["href"]

            # Title
            title_tag = item.find("h3")
            title = title_tag.get_text(strip=True) if title_tag else None

            # Price
            price_tag = item.find("span", class_="currency-value")
            price = None
            if price_tag:
                try:
                    price = float(price_tag.text.replace(",", "."))
                except:
                    pass

            # Rating
            rating_tag = item.find("span", class_="wt-screen-reader-only")
            rating = None
            if rating_tag:
                match_rating = re.search(r"([\d.]+) out of 5", rating_tag.text)
                if match_rating:
                    rating = float(match_rating.group(1))

            # Reviews
            reviews_tag = item.find("span", class_="wt-text-body-01")
            reviews = None
            if reviews_tag:
                match_reviews = re.search(r"\((\d+)\)", reviews_tag.text)
                if match_reviews:
                    reviews = int(match_reviews.group(1))

            # Delivery
            delivery = None
            delivery_tag = item.find(string=re.compile("delivery", re.I))
            if delivery_tag:
                txt = delivery_tag.lower()
                if "free" in txt:
                    delivery = 0
                else:
                    match_del = re.search(r"‚Ç¨\s?([\d.,]+)", delivery_tag)
                    if match_del:
                        delivery = float(match_del.group(1).replace(",", "."))

            data.append({
                "URL": product_url,
                "Title": title,
                "Price": price,
                "Rating": rating,
                "Reviews": reviews,
                "Delivery": delivery
            })

        page += 1
        time.sleep(2)

    driver.quit()

    df = pd.DataFrame(data)
    return df


# Save CSV
if __name__ == "__main__":
    df = scrape_products_selenium()
    df.to_csv("../data/interim/1_interim_price.csv", index=False)
    print("STEP 1 : 'Price' INTERIM and CSV saved successfully!")


---

## üìå **Product PAGE**
The main data fields to extract from Etsy's product page :

### ‚≠ê **Etsy Product Info**

| Field Name            | Python Data Type       | Concise Definition               | Long Definition                                                                                       |
|-----------------------|-----------------------|---------------------------------|-------------------------------------------------------------------------------------------------------|
| product_id            | `str`                   | Unique Etsy listing ID.          | Unique identifier assigned by Etsy to each product listing.                                           |
| product_title         | `str`                   | Product‚Äôs title.                 | The full title/name of the product as shown on the listing page.                                      |
| old_price             | `float` or `Decimal`      | Price before discount.           | The original price before any discounts were applied.                                                 |
| discount_percentage   | `float`                 | Discount rate in percent.        | The discount value expressed as a percentage (e.g., 20.0 for 20%).                                    |
| now_price             | `float` or `Decimal`      | Price after discount.            | The current price after applying discounts.                                                           |
| currency              | `str`                   | Currency code (e.g., USD).       | The currency code used for the product price (e.g., "USD", "EUR").                                    |
| listed_date           | `datetime`              | Date the item was listed.        | The date (and optionally time) when the product was first listed on Etsy.                             |
| product_url           | `str`                   | Link to the product page.        | The direct URL link to the Etsy product page.                                                         |
| product_description   | `str`                   | Product description text.        | The text description of the product, including details, features, and information provided by seller.|
| product_variation     | `list[dict]`            | List of available variations.    | A list of variation options (size, color, material, etc.), each represented as a dictionary.          |


| Field Name            | Python Data Type       | Concise Definition               | Long Definition                                                                                       |
|-----------------------|-----------------------|---------------------------------|-------------------------------------------------------------------------------------------------------|
| product_id            | `str`                   | Unique Etsy listing ID.          | Unique identifier assigned by Etsy to each product listing.                                           |
| product_title         | `str`                   | Product‚Äôs title.                 | The full title/name of the product as shown on the listing page.                                      |
| old_price             | `float` or `Decimal`      | Price before discount.           | The original price before any discounts were applied.                                                 |
| discount_percentage   | `float`                 | Discount rate in percent.        | The discount value expressed as a percentage (e.g., 20.0 for 20%).                                    |
| now_price             | `float` or `Decimal`      | Price after discount.            | The current price after applying discounts.                                                           |
| currency              | `str`                   | Currency code (e.g., USD).       | The currency code used for the product price (e.g., "USD", "EUR").                                    |
| listed_date           | `datetime`              | Date the item was listed.        | The date (and optionally time) when the product was first listed on Etsy.                             |
| product_url           | `str`                   | Link to the product page.        | The direct URL link to the Etsy product page.                                                         |
| product_description   | `str`                   | Product description text.        | The text description of the product, including details, features, and information provided by seller.|
| product_variation     | `list[dict]`            | List of available variations.    | A list of variation options (size, color, material, etc.), each represented as a dictionary.          |


---

### ‚≠ê **Insighted Data**

| Field Name                 | Python Data Type       | Concise Definition                               |
|---------------------------|-------------------------|---------------------------------------------------|
| product_niche             | `str`                     | Product theme or genre (comedy, anime‚Ä¶) based on `product_title` & `product_description`.         |

---

### ‚≠ê **Etsy Product Reviews (Extra dataset)**

| Field Name                     | Python Data Type | Concise Definition                         |
|-------------------------------|------------------|---------------------------------------------|
| product_reviews         | `pd.DataFrame`     | Ratings extracted from all reviews, Dates when each review was posted, Text content of each review.          |


---

## REPLACEABLE CODE

In [54]:
import time
import re
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product

def get_prices(driver):
    now_price, old_price = None, None
    try:
        price_elements = driver.find_elements(By.XPATH, "//p[contains(@class,'wt-text-title')]/span | //span[contains(@class,'wt-text-strikethrough')]")
        for elem in price_elements:
            text = elem.text.strip().replace("‚Ç¨", "").replace("+", "").replace(",", ".")
            try:
                value = float(text)
            except:
                continue

            if "wt-text-strikethrough" in elem.get_attribute("class"):
                old_price = value
            else:
                now_price = value

        if now_price is None and old_price is not None:
            now_price = old_price
        if old_price is None:
            old_price = now_price

    except:
        now_price, old_price = None, None

    percentage_difference_price = round((old_price - now_price) / old_price * 100, 2) if old_price and now_price and old_price != now_price else None
    return now_price, old_price, percentage_difference_price


def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/search?q=tote+bag"
    driver.get(search_url)
    time.sleep(5)

    product_links = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//ul[contains(@class,'wt-grid')]/li//a[@data-listing-id]"))
    )
    product_links = [link.get_attribute("href") for link in product_links][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        # Title
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # Rating
        try:
            rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
            rating = float(rating_elem.get_attribute("value"))
        except:
            rating = None

        # Reviews
        try:
            reviews_elem = driver.find_element(By.XPATH, "//h2[contains(@class,'review-header-text')]")
            txt_reviews = reviews_elem.text.strip()  # Full text
            match = re.search(r"\((\d+)\)", txt_reviews)
            nbr_reviews = int(match.group(1)) if match else 0
        except:
            txt_reviews = None
            nbr_reviews = None

        # Variants
        try:
            variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
            if not variant_sections:
                now_price, old_price, percentage_difference_price = get_prices(driver)
                results.append({
                    "product_link": url,
                    "product_id": re.search(r"/listing/(\d+)", url).group(1),
                    "product_variant_url": url,
                    "product_title": title,
                    "Option": None,
                    "current_price": now_price,
                    "old_price": old_price,
                    "discount_percentage": percentage_difference_price,
                    "product_rating": rating,
                    "txt_reviews": txt_reviews,
                    "nbr_reviews": nbr_reviews
                })
            else:
                all_options = []
                for section in variant_sections:
                    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select')) and not(contains(@aria-label,'S√©lectionner'))]")
                    option_names = [opt.get_attribute("aria-label") or opt.text for opt in opts]
                    all_options.append(option_names)

                for combo in product(*all_options):
                    try:
                        for sec_idx, option_name in enumerate(combo):
                            section = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")[sec_idx]
                            opt_buttons = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select')) and not(contains(@aria-label,'S√©lectionner'))]")
                            for btn in opt_buttons:
                                btn_name = btn.get_attribute("aria-label") or btn.text
                                if btn_name == option_name:
                                    btn.click()
                                    time.sleep(2)
                                    break

                        now_price, old_price, percentage_difference_price = get_prices(driver)

                        results.append({
                            "product_link": url,
                            "product_id": re.search(r"/listing/(\d+)", url).group(1),
                            "product_variant_url": f"{url}/{'_'.join(combo)}",
                            "product_title": title,
                            "Option": " | ".join(combo),
                            "current_price": now_price,
                            "old_price": old_price,
                            "discount_percentage": percentage_difference_price,
                            "product_rating": rating,
                            "txt_reviews": txt_reviews,
                            "nbr_reviews": nbr_reviews
                        })
                    except Exception as e:
                        print(f"[WARNING] Could not process combo {combo}: {e}")

        except Exception as e:
            print(f"[WARNING] Variant handling skipped for product {url}: {e}")

    driver.quit()
    return pd.DataFrame(results)


if __name__ == "__main__":
    df = scrape_products(limit=2)
    df.to_csv("../data/raw/etsy_raw_data.csv", index=False)
    print("[SUCCESS] RAW DATA CSV saved!")


[INFO] Scraping product 1/2: https://www.etsy.com/fr/listing/4377096883/sac-fourre-tout-en-coton-matelasse-a?click_key=LT86d51480109cbb6e2573438fac1eeeea942e488f%3A4377096883&click_sum=c3e93e86&ls=a&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-435685-1-1&sr_prefetch=1&pf_from=search&pro=1&pop=1&sts=1
[INFO] Scraping product 2/2: https://www.etsy.com/fr/listing/4377096883/sac-fourre-tout-en-coton-matelasse-a?click_key=LT86d51480109cbb6e2573438fac1eeeea942e488f%3A4377096883&click_sum=c3e93e86&ls=a&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-435685-1-1&sr_prefetch=1&pf_from=search&pro=1&pop=1&sts=1
[SUCCESS] RAW DATA CSV saved!


In [55]:
df.head()

Unnamed: 0,product_link,product_id,product_variant_url,product_title,Option,current_price,old_price,discount_percentage,product_rating,txt_reviews,nbr_reviews
0,https://www.etsy.com/fr/listing/4377096883/sac...,4377096883,https://www.etsy.com/fr/listing/4377096883/sac...,Sac fourre-tout en coton matelass√© √† imprim√© j...,,52.43,52.43,,4.9259,,0
1,https://www.etsy.com/fr/listing/4377096883/sac...,4377096883,https://www.etsy.com/fr/listing/4377096883/sac...,Sac fourre-tout en coton matelass√© √† imprim√© j...,,52.43,52.43,,4.9259,,0


---

## üìå **CODE**

In [52]:
import time
import re
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product

def get_prices(driver):
    now_price, old_price = None, None
    try:
        price_elements = driver.find_elements(By.XPATH, "//p[contains(@class,'wt-text-title')]/span | //span[contains(@class,'wt-text-strikethrough')]")
        for elem in price_elements:
            text = elem.text.strip().replace("‚Ç¨", "").replace("+", "").replace(",", ".")
            try:
                value = float(text)
            except:
                continue

            if "wt-text-strikethrough" in elem.get_attribute("class"):
                old_price = value
            else:
                now_price = value

        if now_price is None and old_price is not None:
            now_price = old_price
        if old_price is None:
            old_price = now_price

    except:
        now_price, old_price = None, None

    percentage_difference_price = round((old_price - now_price) / old_price * 100, 2) if old_price and now_price and old_price != now_price else None
    return now_price, old_price, percentage_difference_price


def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/search?q=tote+bag"
    driver.get(search_url)
    time.sleep(5)

    product_links = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//ul[contains(@class,'wt-grid')]/li//a[@data-listing-id]"))
    )
    product_links = [link.get_attribute("href") for link in product_links][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        product_id = url.split("/listing/")[1].split("/")[0]

        # Title
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # Rating
        try:
            rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
            rating = float(rating_elem.get_attribute("value"))
        except:
            rating = None

        # Reviews
        try:
            reviews_elem = driver.find_element(By.XPATH, "//h2[contains(@class,'review-header-text')]")
            txt_reviews = reviews_elem.text.strip()
            match = re.search(r"\((.*?)\)", txt_reviews)
            if match:
                num_text = match.group(1).strip()
                if "K" in num_text or "k" in num_text:
                    num_text = num_text.replace("K", "").replace("k", "").replace(",", ".")
                    nbr_reviews = int(float(num_text) * 1000)
                else:
                    num_text = num_text.replace(",", "").replace(" ", "").replace(".", "")
                    nbr_reviews = int(num_text)
            else:
                nbr_reviews = 0
        except:
            txt_reviews = None
            nbr_reviews = None

        # Variants
        try:
            variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
            if not variant_sections:
                now_price, old_price, percentage_difference_price = get_prices(driver)
                results.append({
                    "product_link": url,
                    "product_id": product_id,
                    "product_variant_url": url,
                    "product_title": title,
                    "Option": None,
                    "current_price": now_price,
                    "old_price": old_price,
                    "discount_percentage": percentage_difference_price,
                    "product_rating": rating,
                    "txt_reviews": txt_reviews,
                    "nbr_reviews": nbr_reviews
                })
            else:
                all_options = []
                for section in variant_sections:
                    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select')) and not(contains(@aria-label,'S√©lectionner'))]")
                    option_names = [opt.get_attribute("aria-label") or opt.text for opt in opts]
                    all_options.append(option_names)

                for combo in product(*all_options):
                    try:
                        # Select each variant option
                        for sec_idx, option_name in enumerate(combo):
                            section = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")[sec_idx]
                            opt_buttons = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select')) and not(contains(@aria-label,'S√©lectionner'))]")
                            for btn in opt_buttons:
                                btn_name = btn.get_attribute("aria-label") or btn.text
                                if btn_name == option_name:
                                    btn.click()
                                    time.sleep(1.5)
                                    break

                        now_price, old_price, percentage_difference_price = get_prices(driver)
                        var_comb_str = "_".join(combo)
                        variant_url = f"{url}{var_comb_str}" if var_comb_str else url

                        results.append({
                            "product_link": url,
                            "product_id": product_id,
                            "product_variant_url": variant_url,
                            "product_title": title,
                            "Option": " | ".join(combo),
                            "current_price": now_price,
                            "old_price": old_price,
                            "discount_percentage": percentage_difference_price,
                            "product_rating": rating,
                            "txt_reviews": txt_reviews,
                            "nbr_reviews": nbr_reviews
                        })
                    except Exception as e:
                        print(f"[WARNING] Could not process combo {combo}: {e}")

        except Exception as e:
            print(f"[WARNING] Variant handling skipped for product {url}: {e}")

    driver.quit()
    return pd.DataFrame(results)


if __name__ == "__main__":
    df = scrape_products(limit=2)
    df.to_csv("../data/raw/00_raw_data.csv", index=False)
    print("[SUCCESS] RAW DATA CSV saved!")


[INFO] Scraping product 1/2: https://www.etsy.com/fr/listing/4377096883/sac-fourre-tout-en-coton-matelasse-a?click_key=LTf8728d142951c7160aa39e279ecfe7789bf72645%3A4377096883&click_sum=0811a32d&ls=a&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-375020-1-1&sr_prefetch=1&pf_from=search&pro=1&pop=1&sts=1
[INFO] Scraping product 2/2: https://www.etsy.com/fr/listing/4377096883/sac-fourre-tout-en-coton-matelasse-a?click_key=LTf8728d142951c7160aa39e279ecfe7789bf72645%3A4377096883&click_sum=0811a32d&ls=a&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-375020-1-1&sr_prefetch=1&pf_from=search&pro=1&pop=1&sts=1
[SUCCESS] RAW DATA CSV saved!


In [53]:
df.head()

Unnamed: 0,product_link,product_id,product_variant_url,product_title,Option,current_price,old_price,discount_percentage,product_rating,txt_reviews,nbr_reviews
0,https://www.etsy.com/fr/listing/4377096883/sac...,4377096883,https://www.etsy.com/fr/listing/4377096883/sac...,Sac fourre-tout en coton matelass√© √† imprim√© j...,,52.43,52.43,,4.9259,,0
1,https://www.etsy.com/fr/listing/4377096883/sac...,4377096883,https://www.etsy.com/fr/listing/4377096883/sac...,Sac fourre-tout en coton matelass√© √† imprim√© j...,,52.43,52.43,,4.9259,,0


In [47]:
import time
import pandas as pd
from itertools import product
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

# --- Setup Chrome options ---
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")

# --- Setup driver ---
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
wait = WebDriverWait(driver, 10)

# --- Price extraction helper ---
def get_price_info(driver):
    now_price, old_price = None, None
    try:
        price_elements = driver.find_elements(By.XPATH,
            "//p[contains(@class,'wt-text-title')]/span | //span[contains(@class,'wt-text-strikethrough')]")
        for elem in price_elements:
            text = elem.text.strip().replace("‚Ç¨", "").replace("$", "").replace(",", "").replace("+", "")
            try:
                value = float(text)
            except:
                continue
            if "wt-text-strikethrough" in elem.get_attribute("class"):
                old_price = value
            else:
                now_price = value
        if now_price is None and old_price is not None:
            now_price = old_price
        if old_price is None:
            old_price = now_price
    except:
        now_price, old_price = None, None
    discount_percentage = round((old_price - now_price) / old_price * 100, 2) if old_price and now_price and old_price != now_price else None
    return now_price, old_price, discount_percentage

# --- Main scraping function ---
def extract_etsy_product_data(url):
    driver.get(url)
    time.sleep(3)

    # Product ID
    product_id = url.split("/listing/")[1].split("/")[0]

    # Product Title
    try:
        product_title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
    except:
        product_title = None

    # Product rating
    try:
        rating_elem = driver.find_element(By.CSS_SELECTOR, 'input[name="initial-rating"]')
        product_rating = float(rating_elem.get_attribute("value"))
    except:
        product_rating = None

    # Product reviews
    try:
        reviews_elem = driver.find_element(By.CSS_SELECTOR, 'span[data-review-count]')
        product_reviews = int(reviews_elem.text.strip("()"))
    except:
        product_reviews = None

    # Currency symbol/code
    try:
        price_elem = driver.find_element(By.XPATH, "//p[contains(@class,'wt-text-title')]/span")
        currency_symbol = price_elem.text.strip()[0]
    except:
        currency_symbol = None

    try:
        currency_txt_elem = driver.find_element(By.CSS_SELECTOR, 'meta[itemprop="priceCurrency"]')
        currency_txt = currency_txt_elem.get_attribute("content")
    except:
        currency_txt = None

    # --- Variants ---
    variants_data = []
    try:
        option_elements = driver.find_elements(By.CSS_SELECTOR, 'select[data-selector="variation-select"]')
        if option_elements:
            options_list = []
            for sel in option_elements:
                options = [o.text for o in sel.find_elements(By.TAG_NAME, "option") if o.get_attribute("value")]
                options_list.append(options)

            combinations = list(product(*options_list))

            for combo in combinations:
                variant_url = url + "/" + "_".join(combo)

                # Select variant options
                for idx, sel in enumerate(option_elements):
                    select = Select(sel)
                    select.select_by_visible_text(combo[idx])
                time.sleep(1)  # wait for price to update

                now_price, old_price, discount_percentage = get_price_info(driver)

                variants_data.append({
                    "product_link": url,
                    "product_id": product_id,
                    "product_variant_url": variant_url,
                    "product_title": product_title,
                    "current_price": now_price,
                    "old_price": old_price,
                    "discount_percentage": discount_percentage,
                    "currency_symbol": currency_symbol,
                    "currency_txt": currency_txt,
                    "product_rating": product_rating,
                    "product_reviews": product_reviews
                })
        else:
            # No variants
            now_price, old_price, discount_percentage = get_price_info(driver)
            variants_data.append({
                "product_link": url,
                "product_id": product_id,
                "product_variant_url": url,
                "product_title": product_title,
                "current_price": now_price,
                "old_price": old_price,
                "discount_percentage": discount_percentage,
                "currency_symbol": currency_symbol,
                "currency_txt": currency_txt,
                "product_rating": product_rating,
                "product_reviews": product_reviews
            })
    except Exception as e:
        print(f"Error extracting variants: {e}")

    return variants_data

# --- Example usage ---
urls = [
    'https://www.etsy.com/listing/1289965137'
]

all_data = []
for url in urls:
    data = extract_etsy_product_data(url)
    all_data.extend(data)

df = pd.DataFrame(all_data)
print(df)

driver.quit()


                              product_link  product_id  \
0  https://www.etsy.com/listing/1289965137  1289965137   

                       product_variant_url product_title current_price  \
0  https://www.etsy.com/listing/1289965137          None          None   

  old_price discount_percentage currency_symbol currency_txt product_rating  \
0      None                None            None         None           None   

  product_reviews  
0            None  


In [48]:
df.head()

Unnamed: 0,product_link,product_id,product_variant_url,product_title,current_price,old_price,discount_percentage,currency_symbol,currency_txt,product_rating,product_reviews
0,https://www.etsy.com/listing/1289965137,1289965137,https://www.etsy.com/listing/1289965137,,,,,,,,


### üß™ MAIN TEST

In [46]:
import time
import pandas as pd
from itertools import product
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

# --- Setup Chrome options ---
chrome_options = Options()
chrome_options.add_argument("--headless")  # run headless
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")

# --- Setup driver ---
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
wait = WebDriverWait(driver, 10)

# --- Etsy product scraping function ---
def extract_etsy_product_data(url):
    driver.get(url)
    time.sleep(2)  # wait for page to load

    # --- Product ID from URL ---
    product_id = url.split("/listing/")[1].split("/")[0]

    # --- Product title ---
    try:
        product_title = wait.until(
            EC.presence_of_element_located((By.XPATH, "//h1"))
        ).text.strip()
    except:
        product_title = None

    # --- Product rating ---
    try:
        rating_elem = driver.find_element(By.CSS_SELECTOR, 'input[name="initial-rating"]')
        product_rating = float(rating_elem.get_attribute("value"))
    except:
        product_rating = None

    # --- Product reviews count ---
    try:
        reviews_elem = driver.find_element(By.CSS_SELECTOR, 'span[data-review-count]')
        product_reviews = int(reviews_elem.text.strip("()"))
    except:
        product_reviews = None

    # --- Variants ---
    variants_data = []

    try:
        option_elements = driver.find_elements(By.CSS_SELECTOR, 'select[data-selector="variation-select"]')
        if option_elements:
            options_list = []
            for sel in option_elements:
                options = [o.text for o in sel.find_elements(By.TAG_NAME, "option") if o.get_attribute("value")]
                options_list.append(options)

            combinations = list(product(*options_list))

            for combo in combinations:
                variant_url = url + "/" + "_".join(combo)

                # Select variant options
                for idx, sel in enumerate(option_elements):
                    select = Select(sel)
                    select.select_by_visible_text(combo[idx])

                # Wait for price to update
                time.sleep(1)
                try:
                    price_elem = wait.until(
                        EC.presence_of_element_located((By.CSS_SELECTOR, 'p[data-buy-box-region="price"] span'))
                    )
                    price_text = price_elem.text.strip()
                    currency_symbol = price_text[0]
                    now_price = float(price_text.replace(currency_symbol, "").replace(",", ""))
                except:
                    now_price = None
                    currency_symbol = None

                try:
                    old_price_elem = driver.find_element(By.CSS_SELECTOR, 'p[data-buy-box-region="price"] del span')
                    old_price_text = old_price_elem.text.strip()
                    old_price = float(old_price_text.replace(currency_symbol, "").replace(",", ""))
                except:
                    old_price = None

                discount_percentage = None
                if now_price and old_price:
                    discount_percentage = round((old_price - now_price) / old_price * 100, 2)

                # Currency code
                try:
                    currency_txt_elem = driver.find_element(By.CSS_SELECTOR, 'meta[itemprop="priceCurrency"]')
                    currency_txt = currency_txt_elem.get_attribute("content")
                except:
                    currency_txt = None

                variants_data.append({
                    "product_link": url,
                    "product_id": product_id,
                    "product_variant_url": variant_url,
                    "product_title": product_title,
                    "current_price": now_price,
                    "old_price": old_price,
                    "discount_percentage": discount_percentage,
                    "currency_symbol": currency_symbol,
                    "currency_txt": currency_txt,
                    "product_rating": product_rating,
                    "product_reviews": product_reviews
                })
        else:
            # No variants
            try:
                price_elem = wait.until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, 'p[data-buy-box-region="price"] span'))
                )
                price_text = price_elem.text.strip()
                currency_symbol = price_text[0]
                now_price = float(price_text.replace(currency_symbol, "").replace(",", ""))
            except:
                now_price = None
                currency_symbol = None

            old_price = None
            discount_percentage = None

            try:
                currency_txt_elem = driver.find_element(By.CSS_SELECTOR, 'meta[itemprop="priceCurrency"]')
                currency_txt = currency_txt_elem.get_attribute("content")
            except:
                currency_txt = None

            variants_data.append({
                "product_link": url,
                "product_id": product_id,
                "product_variant_url": url,
                "product_title": product_title,
                "current_price": now_price,
                "old_price": old_price,
                "discount_percentage": discount_percentage,
                "currency_symbol": currency_symbol,
                "currency_txt": currency_txt,
                "product_rating": product_rating,
                "product_reviews": product_reviews
            })
    except Exception as e:
        print(f"Error extracting variants: {e}")

    return variants_data

# --- Example usage ---
urls = [
    'https://www.etsy.com/listing/1289965137'  # replace with your URLs
]

all_data = []
for url in urls:
    data = extract_etsy_product_data(url)
    all_data.extend(data)

df = pd.DataFrame(all_data)
print(df)

# --- Close driver ---
driver.quit()
df.head(2)

                              product_link  product_id  \
0  https://www.etsy.com/listing/1289965137  1289965137   

                       product_variant_url product_title current_price  \
0  https://www.etsy.com/listing/1289965137          None          None   

  old_price discount_percentage currency_symbol currency_txt product_rating  \
0      None                None            None         None           None   

  product_reviews  
0            None  


Unnamed: 0,product_link,product_id,product_variant_url,product_title,current_price,old_price,discount_percentage,currency_symbol,currency_txt,product_rating,product_reviews
0,https://www.etsy.com/listing/1289965137,1289965137,https://www.etsy.com/listing/1289965137,,,,,,,,


In [None]:
import time
import pandas as pd
from itertools import product
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

# --- Setup Chrome options ---
chrome_options = Options()
chrome_options.add_argument("--headless")  # run headless
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")

# --- Setup driver ---
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
wait = WebDriverWait(driver, 10)

# --- Etsy product scraping function ---
def extract_etsy_product_data(url):
    driver.get(url)
    time.sleep(2)  # wait for page to load

    # --- Product ID from URL ---
    product_id = url.split("/listing/")[1].split("/")[0]

    # --- Product title ---
    try:
        product_title = wait.until(
            EC.presence_of_element_located((By.XPATH, "//h1"))
        ).text.strip()
    except:
        product_title = None

    # --- Product rating ---
    try:
        rating_elem = driver.find_element(By.CSS_SELECTOR, 'input[name="initial-rating"]')
        product_rating = float(rating_elem.get_attribute("value"))
    except:
        product_rating = None

    # --- Product reviews count ---
    try:
        reviews_elem = driver.find_element(By.CSS_SELECTOR, 'span[data-review-count]')
        product_reviews = int(reviews_elem.text.strip("()"))
    except:
        product_reviews = None

    # --- Variants ---
    variants_data = []

    try:
        option_elements = driver.find_elements(By.CSS_SELECTOR, 'select[data-selector="variation-select"]')
        if option_elements:
            options_list = []
            for sel in option_elements:
                options = [o.text for o in sel.find_elements(By.TAG_NAME, "option") if o.get_attribute("value")]
                options_list.append(options)

            combinations = list(product(*options_list))

            for combo in combinations:
                variant_url = url + "/" + "_".join(combo)

                # Select variant options
                for idx, sel in enumerate(option_elements):
                    select = Select(sel)
                    select.select_by_visible_text(combo[idx])

                # Wait for price to update
                time.sleep(1)
                try:
                    price_elem = wait.until(
                        EC.presence_of_element_located((By.CSS_SELECTOR, 'p[data-buy-box-region="price"] span'))
                    )
                    price_text = price_elem.text.strip()
                    currency_symbol = price_text[0]
                    now_price = float(price_text.replace(currency_symbol, "").replace(",", ""))
                except:
                    now_price = None
                    currency_symbol = None

                try:
                    old_price_elem = driver.find_element(By.CSS_SELECTOR, 'p[data-buy-box-region="price"] del span')
                    old_price_text = old_price_elem.text.strip()
                    old_price = float(old_price_text.replace(currency_symbol, "").replace(",", ""))
                except:
                    old_price = None

                discount_percentage = None
                if now_price and old_price:
                    discount_percentage = round((old_price - now_price) / old_price * 100, 2)

                # Currency code
                try:
                    currency_txt_elem = driver.find_element(By.CSS_SELECTOR, 'meta[itemprop="priceCurrency"]')
                    currency_txt = currency_txt_elem.get_attribute("content")
                except:
                    currency_txt = None

                variants_data.append({
                    "product_link": url,
                    "product_id": product_id,
                    "product_variant_url": variant_url,
                    "product_title": product_title,
                    "current_price": now_price,
                    "old_price": old_price,
                    "discount_percentage": discount_percentage,
                    "currency_symbol": currency_symbol,
                    "currency_txt": currency_txt,
                    "product_rating": product_rating,
                    "product_reviews": product_reviews
                })
        else:
            # No variants
            try:
                price_elem = wait.until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, 'p[data-buy-box-region="price"] span'))
                )
                price_text = price_elem.text.strip()
                currency_symbol = price_text[0]
                now_price = float(price_text.replace(currency_symbol, "").replace(",", ""))
            except:
                now_price = None
                currency_symbol = None

            old_price = None
            discount_percentage = None

            try:
                currency_txt_elem = driver.find_element(By.CSS_SELECTOR, 'meta[itemprop="priceCurrency"]')
                currency_txt = currency_txt_elem.get_attribute("content")
            except:
                currency_txt = None

            variants_data.append({
                "product_link": url,
                "product_id": product_id,
                "product_variant_url": url,
                "product_title": product_title,
                "current_price": now_price,
                "old_price": old_price,
                "discount_percentage": discount_percentage,
                "currency_symbol": currency_symbol,
                "currency_txt": currency_txt,
                "product_rating": product_rating,
                "product_reviews": product_reviews
            })
    except Exception as e:
        print(f"Error extracting variants: {e}")

    return variants_data

# --- Example usage ---
urls = [
    'https://www.etsy.com/listing/1289965137'  # replace with your URLs
]

all_data = []
for url in urls:
    data = extract_etsy_product_data(url)
    all_data.extend(data)

df = pd.DataFrame(all_data)
print(df)

# --- Close driver ---
driver.quit()


NameError: name 'variant_data' is not defined

---

In [None]:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

# --- Setup Chrome options ---
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run headless
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")

# --- Setup driver using webdriver-manager ---
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)

# --- Etsy product scraping function ---
def extract_etsy_product_data(url):
    driver.get(url)
    time.sleep(3)  # wait for page to load

    # --- Product ID from URL ---
    product_id = url.split("/listing/")[1].split("/")[0]

    # --- Product title ---
    try:
        product_title = driver.find_element(By.CSS_SELECTOR, 'h1[data-buy-box-listing-title]').text
    except:
        product_title = ""

    # --- Product rating ---
    try:
        rating_elem = driver.find_element(By.CSS_SELECTOR, 'input[name="initial-rating"]')
        product_rating = float(rating_elem.get_attribute("value"))
    except:
        product_rating = None

    # --- Product reviews count ---
    try:
        reviews_elem = driver.find_element(By.CSS_SELECTOR, 'span[data-review-count]')
        product_reviews = int(reviews_elem.text.strip("()"))
    except:
        product_reviews = None

    # --- Variants ---
    variants_data = []

    try:
        option_elements = driver.find_elements(By.CSS_SELECTOR, 'select[data-selector="variation-select"]')
        if option_elements:
            from itertools import product

            options_list = []
            for sel in option_elements:
                options = [o.text for o in sel.find_elements(By.TAG_NAME, "option") if o.get_attribute("value")]
                options_list.append(options)

            combinations = list(product(*options_list))

            for combo in combinations:
                variant_url = url + "/" + "_".join(combo)

                # Select variant in page
                for idx, sel in enumerate(option_elements):
                    sel.send_keys(combo[idx])
                time.sleep(1)

                # Price and currency
                try:
                    price_elem = driver.find_element(By.CSS_SELECTOR, 'p[data-buy-box-region="price"] span')
                    price_text = price_elem.text.strip()
                    currency_symbol = price_text[0]
                    now_price = float(price_text.replace(currency_symbol, "").replace(",", ""))
                except:
                    now_price = None
                    currency_symbol = None

                try:
                    old_price_elem = driver.find_element(By.CSS_SELECTOR, 'p[data-buy-box-region="price"] del span')
                    old_price_text = old_price_elem.text.strip()
                    old_price = float(old_price_text.replace(currency_symbol, "").replace(",", ""))
                except:
                    old_price = None

                # Discount %
                discount_percentage = None
                if now_price and old_price:
                    discount_percentage = round((old_price - now_price)/old_price*100, 2)

                # Currency code (example: USD)
                try:
                    currency_txt_elem = driver.find_element(By.CSS_SELECTOR, 'meta[itemprop="priceCurrency"]')
                    currency_txt = currency_txt_elem.get_attribute("content")
                except:
                    currency_txt = None

                variants_data.append({
                    "product_link": url,
                    "product_id": product_id,
                    "product_variant_url": variant_url,
                    "product_title": product_title,
                    "current_price": now_price,
                    "old_price": old_price,
                    "discount_percentage": discount_percentage,
                    "currency_symbol": currency_symbol,
                    "currency_txt": currency_txt,
                    "product_rating": product_rating,
                    "product_reviews": product_reviews
                })
        else:
            # No variants
            try:
                price_elem = driver.find_element(By.CSS_SELECTOR, 'p[data-buy-box-region="price"] span')
                price_text = price_elem.text.strip()
                currency_symbol = price_text[0]
                now_price = float(price_text.replace(currency_symbol, "").replace(",", ""))
            except:
                now_price = None
                currency_symbol = None

            old_price = None
            discount_percentage = None

            try:
                currency_txt_elem = driver.find_element(By.CSS_SELECTOR, 'meta[itemprop="priceCurrency"]')
                currency_txt = currency_txt_elem.get_attribute("content")
            except:
                currency_txt = None

            variants_data.append({
                "product_link": url,
                "product_id": product_id,
                "product_variant_url": url,
                "product_title": product_title,
                "current_price": now_price,
                "old_price": old_price,
                "discount_percentage": discount_percentage,
                "currency_symbol": currency_symbol,
                "currency_txt": currency_txt,
                "product_rating": product_rating,
                "product_reviews": product_reviews
            })
    except Exception as e:
        print(f"Error extracting variants: {e}")

    return variants_data

# --- Example usage ---
urls = [
    'https://www.etsy.com/listing/1289965137'  # replace with your URLs
]

all_data = []
for url in urls:
    data = extract_etsy_product_data(url)
    all_data.extend(data)

df = pd.DataFrame(all_data)
print(df)

# --- Close driver ---
driver.quit()


In [41]:
df.head()

Unnamed: 0,product_link,product_id,product_variant_url,product_title,current_price,old_price,discount_percentage,currency_symbol,currency_txt,product_rating,product_reviews
0,https://www.etsy.com/listing/1289965137,1289965137,https://www.etsy.com/listing/1289965137,,,,,,,,


=====================================================================================================================

---

In [None]:
#### CURRENT PRICE , OLD PRICE , CURRENCY 

from bs4 import BeautifulSoup
import re

# Example HTML snippet for reference
html = """
<div class="wt-display-flex-xs wt-align-items-center wt-flex-wrap appears-ready" data-selector="price-only" data-buy-box-region="price">
    <p class="wt-text-title-larger wt-mr-xs-1">
        <span class="wt-screen-reader-only">Price:</span>‚Ç¨7.90+
    </p>
</div>

<div class="variation-price">
    <span>‚Ç¨8.50</span>
</div>

<!-- Optional old price for discount -->
<div class="old-price">
    <span>‚Ç¨10.00</span>
</div>
"""

# Map symbols to currency codes
currency_map = {
    '$': 'USD',
    '‚Ç¨': 'EUR',
    '¬£': 'GBP',
    '¬•': 'JPY',
    '‚Çπ': 'INR'
}

def extract_dynamic_price(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Find all potential price elements
    price_elements = soup.select('[data-selector="price-only"] p, .variation-price span')
    
    now_price = None
    currency_symbol = None
    currency_txt = None
    
    for el in price_elements:
        text = el.get_text(strip=True)
        
        if not text:
            continue
        
        # Extract currency symbol (anything non-numeric)
        symbol_match = re.search(r'[^0-9.,\s]+', text)
        symbol = symbol_match.group(0) if symbol_match else None
        
        # Extract numeric price
        num_match = re.search(r'([0-9]+(?:[.,][0-9]+)?)', text)
        if num_match:
            price = float(num_match.group(1).replace(',', '.'))
            if price:
                now_price = price
                currency_symbol = symbol
                currency_txt = currency_map.get(currency_symbol, currency_symbol)
                break

    # Optional: handle old price / discount if present
    old_price_element = soup.select_one('.old-price span')
    old_price = None
    discount_percentage = None
    if old_price_element:
        old_text = old_price_element.get_text(strip=True)
        old_num_match = re.search(r'([0-9]+(?:[.,][0-9]+)?)', old_text)
        if old_num_match:
            old_price = float(old_num_match.group(1).replace(',', '.'))
            if now_price:
                discount_percentage = round((old_price - now_price) / old_price * 100, 2)

    return {
        'now_price': now_price,
        'old_price': old_price,
        'discount_percentage': discount_percentage,
        'currency_symbol': currency_symbol,
        'currency_txt': currency_txt
    }

# Run the function
price_info = extract_dynamic_price(html)
print(price_info)


{'now_price': 7.9, 'old_price': 10.0, 'discount_percentage': 21.0, 'currency_symbol': 'Price:‚Ç¨', 'currency_txt': 'Price:‚Ç¨'}


---

In [26]:
import time
import re
import json
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product as iter_product
from datetime import datetime

def get_prices(driver):
    """
    Extract now price, old price, and discount percentage.
    Returns: now_price, old_price, discount_percentage
    """
    now_price, old_price = None, None
    try:
        price_elements = driver.find_elements(By.XPATH, "//p[contains(@class,'wt-text-title')]/span | //span[contains(@class,'wt-text-strikethrough')]")
        for elem in price_elements:
            text = elem.text.strip().replace(",", ".").replace("+", "")
            try:
                value = float(re.sub(r"[^\d.]", "", text))
            except:
                continue

            if "wt-text-strikethrough" in elem.get_attribute("class"):
                old_price = value
            else:
                now_price = value

        if now_price is None and old_price is not None:
            now_price = old_price
        if old_price is None:
            old_price = now_price

    except:
        now_price, old_price = None, None

    discount_percentage = round((old_price - now_price) / old_price * 100, 2) if old_price and now_price and old_price != now_price else None
    return now_price, old_price, discount_percentage

def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/fr/search?q=tote+bag"
    driver.get(search_url)
    time.sleep(5)

    # Collect product links
    product_links = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//ul[contains(@class,'wt-grid')]/li//a[@data-listing-id]"))
    )
    product_links = [link.get_attribute("href") for link in product_links][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        # --- Product ID ---
        product_id_match = re.search(r"/listing/(\d+)", url)
        product_id = product_id_match.group(1) if product_id_match else None

        # --- Title ---
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # --- Listed date from JSON-LD ---
        try:
            json_ld = driver.find_element(By.XPATH, "//script[@type='application/ld+json']").get_attribute("innerHTML")
            data = json.loads(json_ld)

            if isinstance(data, list):
                product_data = next((item for item in data if item.get('@type') == 'Product'), None)
            else:
                product_data = data if data.get('@type') == 'Product' else None

            if product_data and 'releaseDate' in product_data:
                listed_date = datetime.strptime(product_data['releaseDate'], "%Y-%m-%d")
            else:
                listed_date = None
        except:
            listed_date = None

        # --- Currency ---
        try:
            price_elem = driver.find_element(By.XPATH, "//div[@data-selector='price-only']//p[contains(@class,'wt-text-title-larger')]")
            price_text = price_elem.text.strip()
            # Remove screen-reader label if present
            price_text = re.sub(r"Price:|Original Price:", "", price_text, flags=re.I).strip()

            # Extract currency symbol: first character that is not a digit, comma, dot, or plus
            match = re.search(r"[^\d.,+]", price_text)
            currency_symbol = match.group(0) if match else None

            # Map currency symbol to ISO code
            currency_map = {
                "$": "USD",
                "‚Ç¨": "EUR",
                "¬£": "GBP",
                "¬•": "JPY",
                "‚Çπ": "INR",
                # Add more as needed
            }
            currency_txt = currency_map.get(currency_symbol, None)

        except:
            currency_symbol, currency_txt = None, None

        # --- Variants ---
        try:
            variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
            if not variant_sections:
                # Single product, no variant
                now_price, old_price, discount_percentage = get_prices(driver)
                results.append({
                    "product_id": product_id,
                    "product_title": title,
                    "old_price": old_price,
                    "discount_percentage": discount_percentage,
                    "now_price": now_price,
                    "currency_symbol": currency_symbol,
                    "currency_txt": currency_txt,
                    "listed_date": listed_date,
                    "product_url": url
                })
            else:
                # Multiple variants
                all_options = []
                for section in variant_sections:
                    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'S√©lectionner'))]")
                    option_names = [opt.get_attribute("aria-label") or opt.text for opt in opts]
                    all_options.append(option_names)

                # Generate all variant combinations
                for combo in iter_product(*all_options):
                    try:
                        for sec_idx, option_name in enumerate(combo):
                            section = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")[sec_idx]
                            opt_buttons = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'S√©lectionner'))]")
                            for btn in opt_buttons:
                                btn_name = btn.get_attribute("aria-label") or btn.text
                                if btn_name == option_name:
                                    btn.click()
                                    time.sleep(1)
                                    break

                        now_price, old_price, discount_percentage = get_prices(driver)

                        results.append({
                            "product_id": product_id,
                            "product_title": title,
                            "old_price": old_price,
                            "discount_percentage": discount_percentage,
                            "now_price": now_price,
                            "currency_symbol": currency_symbol,
                            "currency_txt": currency_txt,
                            "listed_date": listed_date,
                            "product_url": url
                        })
                    except Exception as e:
                        print(f"[WARNING] Could not process variant {combo}: {e}")

        except Exception as e:
            print(f"[WARNING] Variant handling skipped for product {url}: {e}")

    driver.quit()
    return pd.DataFrame(results)

if __name__ == "__main__":
    df = scrape_products(limit=2)
    df.to_csv("../data/clean/etsy_products.csv", index=False)
    print("[SUCCESS] CSV saved!")

print(df.head())

[INFO] Scraping product 1/2: https://www.etsy.com/fr/listing/1289965137/tote-bag-prenom-personnalise-ideal-pour?click_key=1e747937-6237-450b-903c-e16cfa7504b6%3ALTc6febbeea6174855432740b0616482df4dfd26b1&click_sum=a852f9bc&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-662904-1-1&sr_prefetch=1&pf_from=search&sts=1&nob=1&local_signal_search=1&content_source=1e747937-6237-450b-903c-e16cfa7504b6%253ALTc6febbeea6174855432740b0616482df4dfd26b1
[INFO] Scraping product 2/2: https://www.etsy.com/fr/listing/1289965137/tote-bag-prenom-personnalise-ideal-pour?click_key=1e747937-6237-450b-903c-e16cfa7504b6%3ALTc6febbeea6174855432740b0616482df4dfd26b1&click_sum=a852f9bc&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-662904-1-1&sr_prefetch=1&pf_from=search&sts=1&nob=1&local_signal_search=1&content_source=1e747937-6237-450b-903c-e16cfa7504b6%253ALTc6febbeea6174855432740b0616482d

In [27]:
df.head()

Unnamed: 0,product_id,product_title,old_price,discount_percentage,now_price,currency_symbol,currency_txt,listed_date,product_url
0,1289965137,Tote Bag Pr√©nom Personnalis√© - Id√©al pour Cade...,7.9,,7.9,P,,,https://www.etsy.com/fr/listing/1289965137/tot...
1,1289965137,Tote Bag Pr√©nom Personnalis√© - Id√©al pour Cade...,7.9,,7.9,P,,,https://www.etsy.com/fr/listing/1289965137/tot...


# tried

In [25]:
import time
import re
import json
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product as iter_product
from datetime import datetime

def get_prices(driver):
    """
    Extract now price, old price, and discount percentage.
    Returns: now_price, old_price, discount_percentage
    """
    now_price, old_price = None, None
    try:
        price_elements = driver.find_elements(By.XPATH, "//p[contains(@class,'wt-text-title')]/span | //span[contains(@class,'wt-text-strikethrough')]")
        for elem in price_elements:
            text = elem.text.strip().replace(",", ".").replace("+", "")
            try:
                value = float(re.sub(r"[^\d.]", "", text))
            except:
                continue

            if "wt-text-strikethrough" in elem.get_attribute("class"):
                old_price = value
            else:
                now_price = value

        if now_price is None and old_price is not None:
            now_price = old_price
        if old_price is None:
            old_price = now_price

    except:
        now_price, old_price = None, None

    discount_percentage = round((old_price - now_price) / old_price * 100, 2) if old_price and now_price and old_price != now_price else None
    return now_price, old_price, discount_percentage

def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/fr/search?q=tote+bag"
    driver.get(search_url)
    time.sleep(5)

    # Collect product links
    product_links = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//ul[contains(@class,'wt-grid')]/li//a[@data-listing-id]"))
    )
    product_links = [link.get_attribute("href") for link in product_links][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        # --- Product ID ---
        product_id_match = re.search(r"/listing/(\d+)", url)
        product_id = product_id_match.group(1) if product_id_match else None

        # --- Title ---
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # --- Listed date from JSON-LD ---
        try:
            json_ld = driver.find_element(By.XPATH, "//script[@type='application/ld+json']").get_attribute("innerHTML")
            data = json.loads(json_ld)

            if isinstance(data, list):
                product_data = next((item for item in data if item.get('@type') == 'Product'), None)
            else:
                product_data = data if data.get('@type') == 'Product' else None

            if product_data and 'releaseDate' in product_data:
                listed_date = datetime.strptime(product_data['releaseDate'], "%Y-%m-%d")
            else:
                listed_date = None
        except:
            listed_date = None

        # --- Currency (fixed) ---
        try:
            # Grab the visible price <p> text
            price_elem = driver.find_element(By.XPATH, "//div[@data-selector='price-only']//p[contains(@class,'wt-text-title-larger')]")
            price_text = price_elem.text.strip()
            # Remove screen-reader label if present
            price_text = re.sub(r"Price:|Original Price:", "", price_text, flags=re.I).strip()
            # Extract first non-digit, non-dot, non-comma, non-plus character as currency
            match = re.search(r"[^\d.,+]", price_text)
            currency = match.group(0) if match else None
        except:
            currency = None

        # --- Variants ---
        try:
            variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
            if not variant_sections:
                # Single product, no variant
                now_price, old_price, discount_percentage = get_prices(driver)
                results.append({
                    "product_id": product_id,
                    "product_title": title,
                    "old_price": old_price,
                    "discount_percentage": discount_percentage,
                    "now_price": now_price,
                    "currency": currency,
                    "listed_date": listed_date,
                    "product_url": url
                })
            else:
                # Multiple variants
                all_options = []
                for section in variant_sections:
                    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'S√©lectionner'))]")
                    option_names = [opt.get_attribute("aria-label") or opt.text for opt in opts]
                    all_options.append(option_names)

                # Generate all variant combinations
                for combo in iter_product(*all_options):
                    try:
                        for sec_idx, option_name in enumerate(combo):
                            section = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")[sec_idx]
                            opt_buttons = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'S√©lectionner'))]")
                            for btn in opt_buttons:
                                btn_name = btn.get_attribute("aria-label") or btn.text
                                if btn_name == option_name:
                                    btn.click()
                                    time.sleep(1)
                                    break

                        now_price, old_price, discount_percentage = get_prices(driver)

                        results.append({
                            "product_id": product_id,
                            "product_title": title,
                            "old_price": old_price,
                            "discount_percentage": discount_percentage,
                            "now_price": now_price,
                            "currency": currency,
                            "listed_date": listed_date,
                            "product_url": url
                        })
                    except Exception as e:
                        print(f"[WARNING] Could not process variant {combo}: {e}")

        except Exception as e:
            print(f"[WARNING] Variant handling skipped for product {url}: {e}")

    driver.quit()
    return pd.DataFrame(results)

if __name__ == "__main__":
    df = scrape_products(limit=2)
    df.to_csv("../data/clean/etsy_products.csv", index=False)
    print("[SUCCESS] CSV saved!")
df.head()


[INFO] Scraping product 1/2: https://www.etsy.com/fr/listing/1836666545/tote-bag-petit-bazar-personnalise-ideal?click_key=f4b62246-d1bf-4230-8ef8-dc72122bff1b%3ALTc7d98abdf5a6a84d8fcd6000c069e4ae937c38e7&click_sum=c3162c1d&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-7722-1-1&sr_prefetch=1&pf_from=search&pop=1&sts=1&local_signal_search=1&content_source=f4b62246-d1bf-4230-8ef8-dc72122bff1b%253ALTc7d98abdf5a6a84d8fcd6000c069e4ae937c38e7
[INFO] Scraping product 2/2: https://www.etsy.com/fr/listing/1836666545/tote-bag-petit-bazar-personnalise-ideal?click_key=f4b62246-d1bf-4230-8ef8-dc72122bff1b%3ALTc7d98abdf5a6a84d8fcd6000c069e4ae937c38e7&click_sum=c3162c1d&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-7722-1-1&sr_prefetch=1&pf_from=search&pop=1&sts=1&local_signal_search=1&content_source=f4b62246-d1bf-4230-8ef8-dc72122bff1b%253ALTc7d98abdf5a6a84d8fcd6000c069e4ae937

Unnamed: 0,product_id,product_title,old_price,discount_percentage,now_price,currency,listed_date,product_url
0,1836666545,Tote Bag Petit Bazar Personnalis√© - Id√©al pour...,7.9,,7.9,P,,https://www.etsy.com/fr/listing/1836666545/tot...
1,1836666545,Tote Bag Petit Bazar Personnalis√© - Id√©al pour...,7.9,,7.9,P,,https://www.etsy.com/fr/listing/1836666545/tot...


### FR VERSION

In [None]:
import time
import re
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product

def get_prices(driver):
    """
    Extract now price, old price, and calculate percentage difference.
    Returns: now_price, old_price, percentage_difference
    """
    now_price, old_price = None, None

    try:
        # Grab all relevant price elements
        price_elements = driver.find_elements(By.XPATH, "//p[contains(@class,'wt-text-title')]/span | //span[contains(@class,'wt-text-strikethrough')]")
        for elem in price_elements:
            text = elem.text.strip().replace("‚Ç¨", "").replace("+", "").replace(",", ".")
            try:
                value = float(text)
            except:
                continue

            # Determine if strikethrough -> old price
            if "wt-text-strikethrough" in elem.get_attribute("class"):
                old_price = value
            else:
                now_price = value

        # Fallback if only one price found
        if now_price is None and old_price is not None:
            now_price = old_price
        if old_price is None:
            old_price = now_price

    except:
        now_price, old_price = None, None

    # Calculate percentage difference
    percentage_difference_price = round((old_price - now_price) / old_price * 100, 2) if old_price and now_price and old_price != now_price else None

    return now_price, old_price, percentage_difference_price


def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    # Search page for tote bags
    search_url = "https://www.etsy.com/fr/search?q=tote+bag"
    driver.get(search_url)
    time.sleep(5)

    # Collect product links
    product_links = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//ul[contains(@class,'wt-grid')]/li//a[@data-listing-id]"))
    )
    product_links = [link.get_attribute("href") for link in product_links][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        # --- Title ---
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # --- Rating ---
        try:
            rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
            rating = float(rating_elem.get_attribute("value"))
        except:
            rating = None

        # --- Reviews ---
        try:
            reviews_elem = driver.find_element(By.XPATH, "//h2[contains(@class,'review-header-text')]")
            txt_reviews = reviews_elem.text.strip()
            match = re.search(r"\((.*?)\)", txt_reviews)
            if match:
                num_text = match.group(1).strip()
                if "K" in num_text or "k" in num_text:
                    num_text = num_text.replace("K", "").replace("k", "").replace(",", ".")
                    nbr_reviews = int(float(num_text) * 1000)
                else:
                    num_text = num_text.replace(",", "").replace(" ", "").replace(".", "")
                    nbr_reviews = int(num_text)
            else:
                nbr_reviews = 0
        except:
            txt_reviews = None
            nbr_reviews = None

        # --- Delivery ---
        try:
            delivery_elem = driver.find_element(By.XPATH, "//span[contains(text(),'livraison') or contains(text(),'delivery')]")
            delivery_text = delivery_elem.text.strip()
            delivery = 0 if "gratuit" in delivery_text.lower() or "free" in delivery_text.lower() else delivery_text
        except:
            delivery = None

        # --- Variants ---
        try:
            variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
            if not variant_sections:
                # Single price
                now_price, old_price, percentage_difference_price = get_prices(driver)
                results.append({
                    "URL": url, "Title": title, "Rating": rating,
                    "txt_reviews": txt_reviews, "nbr_reviews": nbr_reviews,
                    "Delivery": delivery, "Option": None,
                    "Old_Price": old_price, "Now_Price": now_price,
                    "Percentage_Difference_Price": percentage_difference_price
                })
            else:
                # Handle variants
                all_options = []
                for section in variant_sections:
                    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'S√©lectionner'))]")
                    option_names = [opt.get_attribute("aria-label") or opt.text for opt in opts]
                    all_options.append(option_names)

                # Generate all combinations
                for combo in product(*all_options):
                    try:
                        # Click each option
                        for sec_idx, option_name in enumerate(combo):
                            section = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")[sec_idx]
                            opt_buttons = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'S√©lectionner'))]")
                            for btn in opt_buttons:
                                btn_name = btn.get_attribute("aria-label") or btn.text
                                if btn_name == option_name:
                                    btn.click()
                                    time.sleep(2)
                                    break

                        # Extract prices
                        now_price, old_price, percentage_difference_price = get_prices(driver)

                        results.append({
                            "URL": url, "Title": title, "Rating": rating,
                            "txt_reviews": txt_reviews, "nbr_reviews": nbr_reviews,
                            "Delivery": delivery, "Option": " | ".join(combo),
                            "Old_Price": old_price, "Now_Price": now_price,
                            "Percentage_Difference_Price": percentage_difference_price
                        })
                    except Exception as e:
                        print(f"[WARNING] Could not process combo {combo}: {e}")

        except Exception as e:
            print(f"[WARNING] Variant handling skipped for product {url}: {e}")

    driver.quit()
    return pd.DataFrame(results)


if __name__ == "__main__":
    df = scrape_products(limit=10)
    df.to_csv("../data/clean/clean_tote_bags.csv", index=False)
    print("[SUCCESS] CSV saved!")
    df.head(10)

### Currency & Listed date

In [24]:
import time
import re
import json
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product as iter_product
from datetime import datetime

def get_prices(driver):
    """
    Extract now price, old price, and discount percentage.
    Returns: now_price, old_price, discount_percentage
    """
    now_price, old_price = None, None
    try:
        price_elements = driver.find_elements(By.XPATH, "//p[contains(@class,'wt-text-title')]/span | //span[contains(@class,'wt-text-strikethrough')]")
        for elem in price_elements:
            text = elem.text.strip().replace(",", ".").replace("+", "")
            try:
                # Remove non-numeric symbols to convert to float
                value = float(re.sub(r"[^\d.]", "", text))
            except:
                continue

            if "wt-text-strikethrough" in elem.get_attribute("class"):
                old_price = value
            else:
                now_price = value

        if now_price is None and old_price is not None:
            now_price = old_price
        if old_price is None:
            old_price = now_price

    except:
        now_price, old_price = None, None

    discount_percentage = round((old_price - now_price) / old_price * 100, 2) if old_price and now_price and old_price != now_price else None
    return now_price, old_price, discount_percentage

def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/fr/search?q=tote+bag"
    driver.get(search_url)
    time.sleep(5)

    # Collect product links
    product_links = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//ul[contains(@class,'wt-grid')]/li//a[@data-listing-id]"))
    )
    product_links = [link.get_attribute("href") for link in product_links][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        # --- Product ID ---
        product_id_match = re.search(r"/listing/(\d+)", url)
        product_id = product_id_match.group(1) if product_id_match else None

        # --- Title ---
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # --- Listed date from JSON-LD ---
        try:
            json_ld = driver.find_element(By.XPATH, "//script[@type='application/ld+json']").get_attribute("innerHTML")
            data = json.loads(json_ld)

            if isinstance(data, list):
                product_data = next((item for item in data if item.get('@type') == 'Product'), None)
            else:
                product_data = data if data.get('@type') == 'Product' else None

            if product_data and 'releaseDate' in product_data:
                listed_date = datetime.strptime(product_data['releaseDate'], "%Y-%m-%d")
            else:
                listed_date = None
        except:
            listed_date = None

        # --- Currency (extract symbol only) ---
        try:
            currency_elem = driver.find_element(By.XPATH, "//p[contains(@class,'wt-text-title')]/span")
            text = currency_elem.text.strip()
            match = re.search(r"([^\d.,\s]+)", text)
            currency = match.group(1) if match else None
        except:
            currency = None

        # --- Variants ---
        try:
            variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
            if not variant_sections:
                # Single product, no variant
                now_price, old_price, discount_percentage = get_prices(driver)
                results.append({
                    "product_id": product_id,
                    "product_title": title,
                    "old_price": old_price,
                    "discount_percentage": discount_percentage,
                    "now_price": now_price,
                    "currency": currency,
                    "listed_date": listed_date,
                    "product_url": url
                })
            else:
                # Multiple variants
                all_options = []
                for section in variant_sections:
                    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'S√©lectionner'))]")
                    option_names = [opt.get_attribute("aria-label") or opt.text for opt in opts]
                    all_options.append(option_names)

                # Generate all variant combinations
                for combo in iter_product(*all_options):
                    try:
                        for sec_idx, option_name in enumerate(combo):
                            section = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")[sec_idx]
                            opt_buttons = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'S√©lectionner'))]")
                            for btn in opt_buttons:
                                btn_name = btn.get_attribute("aria-label") or btn.text
                                if btn_name == option_name:
                                    btn.click()
                                    time.sleep(1)
                                    break

                        now_price, old_price, discount_percentage = get_prices(driver)

                        results.append({
                            "product_id": product_id,
                            "product_title": title,
                            "old_price": old_price,
                            "discount_percentage": discount_percentage,
                            "now_price": now_price,
                            "currency": currency,
                            "listed_date": listed_date,
                            "product_url": url
                        })
                    except Exception as e:
                        print(f"[WARNING] Could not process variant {combo}: {e}")

        except Exception as e:
            print(f"[WARNING] Variant handling skipped for product {url}: {e}")

    driver.quit()
    return pd.DataFrame(results)

if __name__ == "__main__":
    df = scrape_products(limit=10)
    df.to_csv("../data/clean/etsy_products.csv", index=False)
    print("[SUCCESS] CSV saved!")

df.head(10)


[INFO] Scraping product 1/10: https://www.etsy.com/fr/listing/1836666545/tote-bag-petit-bazar-personnalise-ideal?click_key=49ce3529-c623-42ca-b5d9-4efee03a9e80%3ALTc8e05e2b8c3c7dd9258eb646e6055b1683895fd9&click_sum=30f2d5f6&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-983877-1-1&sr_prefetch=1&pf_from=search&pop=1&sts=1&local_signal_search=1&content_source=49ce3529-c623-42ca-b5d9-4efee03a9e80%253ALTc8e05e2b8c3c7dd9258eb646e6055b1683895fd9
[INFO] Scraping product 2/10: https://www.etsy.com/fr/listing/1836666545/tote-bag-petit-bazar-personnalise-ideal?click_key=49ce3529-c623-42ca-b5d9-4efee03a9e80%3ALTc8e05e2b8c3c7dd9258eb646e6055b1683895fd9&click_sum=30f2d5f6&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-983877-1-1&sr_prefetch=1&pf_from=search&pop=1&sts=1&local_signal_search=1&content_source=49ce3529-c623-42ca-b5d9-4efee03a9e80%253ALTc8e05e2b8c3c7dd9258eb646e6055

Unnamed: 0,product_id,product_title,old_price,discount_percentage,now_price,currency,listed_date,product_url
0,1836666545,Tote Bag Petit Bazar Personnalis√© - Id√©al pour...,7.9,,7.9,Prix,,https://www.etsy.com/fr/listing/1836666545/tot...
1,1836666545,Tote Bag Petit Bazar Personnalis√© - Id√©al pour...,7.9,,7.9,Prix,,https://www.etsy.com/fr/listing/1836666545/tot...
2,1489928611,Tote bag personnalis√© pour professeur : toile ...,62.32,50.02,31.15,Prix,,https://www.etsy.com/fr/listing/1489928611/tot...
3,1489928611,Tote bag personnalis√© pour professeur : toile ...,62.32,50.02,31.15,Prix,,https://www.etsy.com/fr/listing/1489928611/tot...
4,1825286680,Double Pocket Soft Corduroy Tote Bag (Dark Bro...,138.58,25.0,103.93,Maintenant,,https://www.etsy.com/fr/listing/1825286680/dou...
5,1825286680,Double Pocket Soft Corduroy Tote Bag (Dark Bro...,138.58,25.0,103.93,Maintenant,,https://www.etsy.com/fr/listing/1825286680/dou...
6,953673271,Sac en jute personnalis√© ‚Äì Tote bag cabas natu...,18.9,,18.9,Prix,,https://www.etsy.com/fr/listing/953673271/sac-...
7,953673271,Sac en jute personnalis√© ‚Äì Tote bag cabas natu...,18.9,,18.9,Prix,,https://www.etsy.com/fr/listing/953673271/sac-...
8,1075693684,"Sac personnalis√© pour Enfant, tote bag, pochon...",8.99,,8.99,Prix,,https://www.etsy.com/fr/listing/1075693684/sac...
9,1075693684,"Sac personnalis√© pour Enfant, tote bag, pochon...",8.99,,8.99,Prix,,https://www.etsy.com/fr/listing/1075693684/sac...


### variations

In [23]:
import time
import re
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product as iter_product
from datetime import datetime

def get_prices(driver):
    """
    Extract now price, old price, and discount percentage.
    Returns: now_price, old_price, discount_percentage
    """
    now_price, old_price = None, None
    try:
        price_elements = driver.find_elements(By.XPATH, "//p[contains(@class,'wt-text-title')]/span | //span[contains(@class,'wt-text-strikethrough')]")
        for elem in price_elements:
            text = elem.text.strip().replace("‚Ç¨", "").replace("$", "").replace(",", ".").replace("+", "")
            try:
                value = float(text)
            except:
                continue

            if "wt-text-strikethrough" in elem.get_attribute("class"):
                old_price = value
            else:
                now_price = value

        if now_price is None and old_price is not None:
            now_price = old_price
        if old_price is None:
            old_price = now_price

    except:
        now_price, old_price = None, None

    discount_percentage = round((old_price - now_price) / old_price * 100, 2) if old_price and now_price and old_price != now_price else None
    return now_price, old_price, discount_percentage

def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/fr/search?q=tote+bag"
    driver.get(search_url)
    time.sleep(5)

    # Collect product links
    product_links = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//ul[contains(@class,'wt-grid')]/li//a[@data-listing-id]"))
    )
    product_links = [link.get_attribute("href") for link in product_links][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        # --- Product ID ---
        product_id_match = re.search(r"/listing/(\d+)", url)
        product_id = product_id_match.group(1) if product_id_match else None

        # --- Title ---
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # --- Listed date ---
        try:
            listed_elem = driver.find_element(By.XPATH, "//div[contains(text(),'Listed on')]")
            listed_text = listed_elem.text.strip()
            listed_date = datetime.strptime(re.search(r"Listed on (.*)", listed_text).group(1), "%b %d, %Y")
        except:
            listed_date = None

        # --- Currency ---
        try:
            currency_elem = driver.find_element(By.XPATH, "//p[contains(@class,'wt-text-title')]/span")
            currency = re.search(r"[^\d.,]+", currency_elem.text.strip()).group(0)
        except:
            currency = None

        # --- Variants ---
        try:
            variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
            if not variant_sections:
                # Single product, no variant
                now_price, old_price, discount_percentage = get_prices(driver)
                results.append({
                    "product_id": product_id,
                    "product_title": title,
                    "old_price": old_price,
                    "discount_percentage": discount_percentage,
                    "now_price": now_price,
                    "currency": currency,
                    "listed_date": listed_date,
                    "product_url": url
                })
            else:
                # Multiple variants
                all_options = []
                for section in variant_sections:
                    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'S√©lectionner'))]")
                    option_names = [opt.get_attribute("aria-label") or opt.text for opt in opts]
                    all_options.append(option_names)

                # Generate all variant combinations
                for combo in iter_product(*all_options):
                    try:
                        for sec_idx, option_name in enumerate(combo):
                            section = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")[sec_idx]
                            opt_buttons = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'S√©lectionner'))]")
                            for btn in opt_buttons:
                                btn_name = btn.get_attribute("aria-label") or btn.text
                                if btn_name == option_name:
                                    btn.click()
                                    time.sleep(1)
                                    break

                        now_price, old_price, discount_percentage = get_prices(driver)

                        results.append({
                            "product_id": product_id,
                            "product_title": title,
                            "old_price": old_price,
                            "discount_percentage": discount_percentage,
                            "now_price": now_price,
                            "currency": currency,
                            "listed_date": listed_date,
                            "product_url": url
                        })
                    except Exception as e:
                        print(f"[WARNING] Could not process variant {combo}: {e}")

        except Exception as e:
            print(f"[WARNING] Variant handling skipped for product {url}: {e}")

    driver.quit()
    return pd.DataFrame(results)

if __name__ == "__main__":
    df = scrape_products(limit=10)
    df.to_csv("../data/clean/etsy_products.csv", index=False)
    print("[SUCCESS] CSV saved!")

df.head(10)


[INFO] Scraping product 1/10: https://www.etsy.com/fr/listing/1836666545/tote-bag-petit-bazar-personnalise-ideal?click_key=b16b0dd0-2600-402f-8946-1eb9b985c803%3ALTae3da8157379bbbba1166fe4b528fce1a48936d6&click_sum=49f7a704&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-1063293-1-1&sr_prefetch=1&pf_from=search&pop=1&sts=1&local_signal_search=1&content_source=b16b0dd0-2600-402f-8946-1eb9b985c803%253ALTae3da8157379bbbba1166fe4b528fce1a48936d6
[INFO] Scraping product 2/10: https://www.etsy.com/fr/listing/1836666545/tote-bag-petit-bazar-personnalise-ideal?click_key=b16b0dd0-2600-402f-8946-1eb9b985c803%3ALTae3da8157379bbbba1166fe4b528fce1a48936d6&click_sum=49f7a704&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-1063293-1-1&sr_prefetch=1&pf_from=search&pop=1&sts=1&local_signal_search=1&content_source=b16b0dd0-2600-402f-8946-1eb9b985c803%253ALTae3da8157379bbbba1166fe4b52

Unnamed: 0,product_id,product_title,old_price,discount_percentage,now_price,currency,listed_date,product_url
0,1836666545,Tote Bag Petit Bazar Personnalis√© - Id√©al pour...,7.9,,7.9,Prix :,,https://www.etsy.com/fr/listing/1836666545/tot...
1,1836666545,Tote Bag Petit Bazar Personnalis√© - Id√©al pour...,7.9,,7.9,Prix :,,https://www.etsy.com/fr/listing/1836666545/tot...
2,1239066659,Tote bag brod√© jute / Embroidered tote bag aes...,17.0,25.0,12.75,Maintenant,,https://www.etsy.com/fr/listing/1239066659/tot...
3,1239066659,Tote bag brod√© jute / Embroidered tote bag aes...,17.0,25.0,12.75,Maintenant,,https://www.etsy.com/fr/listing/1239066659/tot...
4,1289965137,Tote Bag Pr√©nom Personnalis√© - Id√©al pour Cade...,7.9,,7.9,Prix :,,https://www.etsy.com/fr/listing/1289965137/tot...
5,1289965137,Tote Bag Pr√©nom Personnalis√© - Id√©al pour Cade...,7.9,,7.9,Prix :,,https://www.etsy.com/fr/listing/1289965137/tot...
6,1075693684,"Sac personnalis√© pour Enfant, tote bag, pochon...",8.99,,8.99,Prix :,,https://www.etsy.com/fr/listing/1075693684/sac...
7,1075693684,"Sac personnalis√© pour Enfant, tote bag, pochon...",8.99,,8.99,Prix :,,https://www.etsy.com/fr/listing/1075693684/sac...
8,4390172313,"Sac cabas en velours c√¥tel√© personnalis√©, sac ...",28.99,45.02,15.94,Maintenant,,https://www.etsy.com/fr/listing/4390172313/sac...
9,4390172313,"Sac cabas en velours c√¥tel√© personnalis√©, sac ...",28.99,45.02,15.94,Maintenant,,https://www.etsy.com/fr/listing/4390172313/sac...


---

### REVIEWS DONE !

In [50]:
import time
import re
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product

def get_prices(driver):
    now_price, old_price = None, None
    try:
        price_elements = driver.find_elements(By.XPATH, "//p[contains(@class,'wt-text-title')]/span | //span[contains(@class,'wt-text-strikethrough')]")
        for elem in price_elements:
            text = elem.text.strip().replace("‚Ç¨", "").replace("+", "").replace(",", ".")
            try:
                value = float(text)
            except:
                continue

            if "wt-text-strikethrough" in elem.get_attribute("class"):
                old_price = value
            else:
                now_price = value

        if now_price is None and old_price is not None:
            now_price = old_price
        if old_price is None:
            old_price = now_price

    except:
        now_price, old_price = None, None

    percentage_difference_price = round((old_price - now_price) / old_price * 100, 2) if old_price and now_price and old_price != now_price else None
    return now_price, old_price, percentage_difference_price


def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/search?q=tote+bag"
    driver.get(search_url)
    time.sleep(5)

    product_links = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//ul[contains(@class,'wt-grid')]/li//a[@data-listing-id]"))
    )
    product_links = [link.get_attribute("href") for link in product_links][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        # Title
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # Rating
        try:
            rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
            rating = float(rating_elem.get_attribute("value"))
        except:
            rating = None

        # Reviews
        try:
            reviews_elem = driver.find_element(By.XPATH, "//h2[contains(@class,'review-header-text')]")
            txt_reviews = reviews_elem.text.strip()
            match = re.search(r"\((.*?)\)", txt_reviews)
            if match:
                num_text = match.group(1).strip()
                if "K" in num_text or "k" in num_text:
                    num_text = num_text.replace("K", "").replace("k", "").replace(",", ".")
                    nbr_reviews = int(float(num_text) * 1000)
                else:
                    num_text = num_text.replace(",", "").replace(" ", "").replace(".", "")
                    nbr_reviews = int(num_text)
            else:
                nbr_reviews = 0
        except:
            txt_reviews = None
            nbr_reviews = None

        # Variants
        try:
            variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
            if not variant_sections:
                now_price, old_price, percentage_difference_price = get_prices(driver)
                results.append({
                    "URL": url, "Title": title, "Rating": rating,
                    "txt_reviews": txt_reviews, "nbr_reviews": nbr_reviews,
                    "Option": None,
                    "Old_Price": old_price, "Now_Price": now_price,
                    "Percentage_Difference_Price": percentage_difference_price
                })
            else:
                all_options = []
                for section in variant_sections:
                    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select')) and not(contains(@aria-label,'S√©lectionner'))]")
                    option_names = [opt.get_attribute("aria-label") or opt.text for opt in opts]
                    all_options.append(option_names)

                for combo in product(*all_options):
                    try:
                        for sec_idx, option_name in enumerate(combo):
                            section = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")[sec_idx]
                            opt_buttons = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select')) and not(contains(@aria-label,'S√©lectionner'))]")
                            for btn in opt_buttons:
                                btn_name = btn.get_attribute("aria-label") or btn.text
                                if btn_name == option_name:
                                    btn.click()
                                    time.sleep(2)
                                    break

                        now_price, old_price, percentage_difference_price = get_prices(driver)

                        results.append({
                            "URL": url, "Title": title, "Rating": rating,
                            "txt_reviews": txt_reviews, "nbr_reviews": nbr_reviews,
                            "Option": " | ".join(combo),
                            "Old_Price": old_price, "Now_Price": now_price,
                            "Percentage_Difference_Price": percentage_difference_price
                        })
                    except Exception as e:
                        print(f"[WARNING] Could not process combo {combo}: {e}")

        except Exception as e:
            print(f"[WARNING] Variant handling skipped for product {url}: {e}")

    driver.quit()
    return pd.DataFrame(results)


if __name__ == "__main__":
    df = scrape_products(limit=2)
    df.to_csv("../data/raw/00_raw_data.csv", index=False)
    print("[SUCCESS] RAW DATA CSV saved!")


[INFO] Scraping product 1/2: https://www.etsy.com/fr/listing/1778190169/sac-fourre-tout-en-coton-matelasse?click_key=9420d0b6-cfdc-4f91-b59c-110c0f9bf256%3ALT5262ab91d5288cfaa6451dba67d3e2c9fdebecaa&click_sum=27133ae6&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-304429-1-1&sr_prefetch=1&pf_from=search&pro=1&pop=1&content_source=9420d0b6-cfdc-4f91-b59c-110c0f9bf256%253ALT5262ab91d5288cfaa6451dba67d3e2c9fdebecaa
[INFO] Scraping product 2/2: https://www.etsy.com/fr/listing/1778190169/sac-fourre-tout-en-coton-matelasse?click_key=9420d0b6-cfdc-4f91-b59c-110c0f9bf256%3ALT5262ab91d5288cfaa6451dba67d3e2c9fdebecaa&click_sum=27133ae6&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-304429-1-1&sr_prefetch=1&pf_from=search&pro=1&pop=1&content_source=9420d0b6-cfdc-4f91-b59c-110c0f9bf256%253ALT5262ab91d5288cfaa6451dba67d3e2c9fdebecaa
Stacktrace:
Symbols not available. Dumping u

In [51]:
df.head()

Unnamed: 0,URL,Title,Rating,txt_reviews,nbr_reviews,Option,Old_Price,Now_Price,Percentage_Difference_Price
0,https://www.etsy.com/fr/listing/1778190169/sac...,Sac fourre-tout en coton matelass√© multicolore...,4.6071,Avis sur cet article (40),40,,43.56,21.78,50.0


### VIEW DATASET

In [None]:
df.head(10)

In [None]:
import time
import re
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product

def get_prices(driver):
    now_price, old_price = None, None
    try:
        price_elements = driver.find_elements(By.XPATH, "//p[contains(@class,'wt-text-title')]/span | //span[contains(@class,'wt-text-strikethrough')]")
        for elem in price_elements:
            text = elem.text.strip().replace("‚Ç¨", "").replace("+", "").replace(",", ".")
            try:
                value = float(text)
            except:
                continue

            if "wt-text-strikethrough" in elem.get_attribute("class"):
                old_price = value
            else:
                now_price = value

        if now_price is None and old_price is not None:
            now_price = old_price
        if old_price is None:
            old_price = now_price

    except:
        now_price, old_price = None, None

    percentage_difference_price = round((old_price - now_price) / old_price * 100, 2) if old_price and now_price and old_price != now_price else None
    return now_price, old_price, percentage_difference_price


def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/search?q=tote%20bag"
    driver.get(search_url)
    time.sleep(5)

    # Updated XPath for Etsy search results
    product_links_elements = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//a[@data-listing-id]")
    ))
    product_links = [elem.get_attribute("href") for elem in product_links_elements][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        # Title
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # Rating
        try:
            rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
            rating = float(rating_elem.get_attribute("value"))
        except:
            rating = None

        # Reviews
        try:
            reviews_elem = driver.find_element(By.XPATH, "//h2[contains(@class,'review-header-text')]")
            txt_reviews = reviews_elem.text.strip()
            match = re.search(r"\((.*?)\)", txt_reviews)
            if match:
                num_text = match.group(1).strip()
                if "K" in num_text or "k" in num_text:
                    num_text = num_text.replace("K", "").replace("k", "").replace(",", ".")
                    nbr_reviews = int(float(num_text) * 1000)
                else:
                    num_text = num_text.replace(",", "").replace(" ", "").replace(".", "")
                    nbr_reviews = int(num_text)
            else:
                nbr_reviews = 0
        except:
            txt_reviews = None
            nbr_reviews = None

        # Variants
        try:
            variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
            if not variant_sections:
                now_price, old_price, percentage_difference_price = get_prices(driver)
                results.append({
                    "URL": url, "Title": title, "Rating": rating,
                    "txt_reviews": txt_reviews, "nbr_reviews": nbr_reviews,
                    "Option": None,
                    "Old_Price": old_price, "Now_Price": now_price,
                    "Percentage_Difference_Price": percentage_difference_price
                })
            else:
                all_options = []
                for section in variant_sections:
                    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select')) and not(contains(@aria-label,'S√©lectionner'))]")
                    option_names = [opt.get_attribute("aria-label") or opt.text for opt in opts]
                    all_options.append(option_names)

                for combo in product(*all_options):
                    try:
                        for sec_idx, option_name in enumerate(combo):
                            section = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")[sec_idx]
                            opt_buttons = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select')) and not(contains(@aria-label,'S√©lectionner'))]")
                            for btn in opt_buttons:
                                btn_name = btn.get_attribute("aria-label") or btn.text
                                if btn_name == option_name:
                                    btn.click()
                                    time.sleep(2)
                                    break

                        now_price, old_price, percentage_difference_price = get_prices(driver)

                        results.append({
                            "URL": url, "Title": title, "Rating": rating,
                            "txt_reviews": txt_reviews, "nbr_reviews": nbr_reviews,
                            "Option": " | ".join(combo),
                            "Old_Price": old_price, "Now_Price": now_price,
                            "Percentage_Difference_Price": percentage_difference_price
                        })
                    except Exception as e:
                        print(f"[WARNING] Could not process combo {combo}: {e}")

        except Exception as e:
            print(f"[WARNING] Variant handling skipped for product {url}: {e}")

    driver.quit()
    return pd.DataFrame(results)


if __name__ == "__main__":
    df = scrape_products(limit=10)
    df.to_csv("../data/clean/clean_tote_bags.csv", index=False)
    print("[SUCCESS] CSV saved!")

df.head(10)

### TRY

In [None]:
import time
import re
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product

def extract_prices(driver):
    """Extract now_price, old_price (if any), and discount_percentage (if any)."""
    now_price, old_price, discount_percentage = None, None, None

    try:
        price_container = driver.find_element(By.XPATH, "//div[@data-selector='price-only']")

        # Current price (after discount or just price)
        now_elem = price_container.find_element(
            By.XPATH, ".//p[contains(@class,'wt-text-title-larger')]//span[not(contains(@class,'wt-text-strikethrough'))]"
        )
        now_text = now_elem.text.strip().replace("‚Ç¨", "").replace("+", "").replace(",", ".")
        now_price = float(now_text)

        # Old price (if any)
        try:
            old_elem = price_container.find_element(By.XPATH, ".//span[contains(@class,'wt-text-strikethrough')]")
            old_text = old_elem.text.strip().replace("‚Ç¨", "").replace("+", "").replace(",", ".")
            old_price = float(old_text)

            # Discount %
            discount_percentage = round((old_price - now_price) / old_price * 100, 2)
        except:
            old_price = None
            discount_percentage = None

    except Exception as e:
        print(f"[ERROR] Could not extract prices: {e}")

    return now_price, old_price, discount_percentage


def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/search?q=tote%20bag"
    driver.get(search_url)
    time.sleep(5)

    # Updated XPath for Etsy search results
    product_links_elements = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//a[@data-listing-id]")
    ))
    product_links = [elem.get_attribute("href") for elem in product_links_elements][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        # Title
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # Rating
        try:
            rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
            rating = float(rating_elem.get_attribute("value"))
        except:
            rating = None

        # Reviews
        try:
            reviews_elem = driver.find_element(By.XPATH, "//h2[contains(@class,'review-header-text')]")
            txt_reviews = reviews_elem.text.strip()
            match = re.search(r"\((.*?)\)", txt_reviews)
            if match:
                num_text = match.group(1).strip()
                if "K" in num_text or "k" in num_text:
                    num_text = num_text.replace("K", "").replace("k", "").replace(",", ".")
                    nbr_reviews = int(float(num_text) * 1000)
                else:
                    num_text = num_text.replace(",", "").replace(" ", "").replace(".", "")
                    nbr_reviews = int(num_text)
            else:
                nbr_reviews = 0
        except:
            txt_reviews = None
            nbr_reviews = None

        # Variants
        try:
            variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
            if not variant_sections:
                now_price, old_price, discount_percentage = extract_prices(driver)
                results.append({
                    "URL": url, "Title": title, "Rating": rating,
                    "txt_reviews": txt_reviews, "nbr_reviews": nbr_reviews,
                    "Option": None,
                    "Old_Price": old_price, "Now_Price": now_price,
                    "Discount_Percentage": discount_percentage
                })
            else:
                all_options = []
                for section in variant_sections:
                    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select')) and not(contains(@aria-label,'S√©lectionner'))]")
                    option_names = [opt.get_attribute("aria-label") or opt.text for opt in opts]
                    all_options.append(option_names)

                for combo in product(*all_options):
                    try:
                        for sec_idx, option_name in enumerate(combo):
                            section = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")[sec_idx]
                            opt_buttons = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select')) and not(contains(@aria-label,'S√©lectionner'))]")
                            for btn in opt_buttons:
                                btn_name = btn.get_attribute("aria-label") or btn.text
                                if btn_name == option_name:
                                    btn.click()
                                    time.sleep(2)
                                    break

                        now_price, old_price, discount_percentage = extract_prices(driver)

                        results.append({
                            "URL": url, "Title": title, "Rating": rating,
                            "txt_reviews": txt_reviews, "nbr_reviews": nbr_reviews,
                            "Option": " | ".join(combo),
                            "Old_Price": old_price, "Now_Price": now_price,
                            "Discount_Percentage": discount_percentage
                        })
                    except Exception as e:
                        print(f"[WARNING] Could not process combo {combo}: {e}")

        except Exception as e:
            print(f"[WARNING] Variant handling skipped for product {url}: {e}")

    driver.quit()
    return pd.DataFrame(results)


if __name__ == "__main__":
    df = scrape_products(limit=10)
    df.to_csv("../data/clean/clean_tote_bags.csv", index=False)
    print("[SUCCESS] CSV saved!")
    print(df.head(10))


### ANOTHER 

In [None]:
import time
import re
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product

def extract_prices(driver):
    """
    Extract now_price, old_price (if any), and discount_percentage (if any) from Etsy product page.
    """
    now_price, old_price, discount_percentage = None, None, None
    try:
        price_container = driver.find_element(By.XPATH, "//div[@data-selector='price-only']")

        # Get all <span> inside the price container
        spans = price_container.find_elements(By.TAG_NAME, "span")
        price_values = []
        for sp in spans:
            text = sp.text.strip().replace("‚Ç¨", "").replace("$", "").replace("+", "").replace(",", ".")
            if text:
                try:
                    price_values.append(float(text))
                except:
                    continue

        if len(price_values) == 1:
            now_price = price_values[0]
            old_price = None
            discount_percentage = None
        elif len(price_values) >= 2:
            now_price = price_values[0]
            old_price = price_values[1]
            discount_percentage = round((old_price - now_price) / old_price * 100, 2)

    except Exception as e:
        print(f"[ERROR] Could not extract prices: {e}")

    return now_price, old_price, discount_percentage


def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/search?q=tote+bag"  # English page
    driver.get(search_url)
    time.sleep(5)

    # Etsy search results: links with data-listing-id
    product_links_elements = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//a[@data-listing-id]")
    ))
    product_links = [elem.get_attribute("href") for elem in product_links_elements][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        # Title
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # Rating
        try:
            rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
            rating = float(rating_elem.get_attribute("value"))
        except:
            rating = None

        # Reviews
        try:
            reviews_elem = driver.find_element(By.XPATH, "//h2[contains(@class,'review-header-text')]")
            txt_reviews = reviews_elem.text.strip()
            match = re.search(r"\((.*?)\)", txt_reviews)
            if match:
                num_text = match.group(1).strip()
                if "K" in num_text or "k" in num_text:
                    num_text = num_text.replace("K", "").replace("k", "").replace(",", ".")
                    nbr_reviews = int(float(num_text) * 1000)
                else:
                    num_text = num_text.replace(",", "").replace(" ", "").replace(".", "")
                    nbr_reviews = int(num_text)
            else:
                nbr_reviews = 0
        except:
            txt_reviews = None
            nbr_reviews = None

        # Variants
        try:
            variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
            if not variant_sections:
                now_price, old_price, discount_percentage = extract_prices(driver)
                results.append({
                    "URL": url, "Title": title, "Rating": rating,
                    "txt_reviews": txt_reviews, "nbr_reviews": nbr_reviews,
                    "Option": None,
                    "Old_Price": old_price, "Now_Price": now_price,
                    "Discount_Percentage": discount_percentage
                })
            else:
                all_options = []
                for section in variant_sections:
                    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select'))]")
                    option_names = [opt.get_attribute("aria-label") or opt.text for opt in opts]
                    all_options.append(option_names)

                for combo in product(*all_options):
                    try:
                        for sec_idx, option_name in enumerate(combo):
                            section = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")[sec_idx]
                            opt_buttons = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select'))]")
                            for btn in opt_buttons:
                                btn_name = btn.get_attribute("aria-label") or btn.text
                                if btn_name == option_name:
                                    btn.click()
                                    time.sleep(2)
                                    break

                        now_price, old_price, discount_percentage = extract_prices(driver)

                        results.append({
                            "URL": url, "Title": title, "Rating": rating,
                            "txt_reviews": txt_reviews, "nbr_reviews": nbr_reviews,
                            "Option": " | ".join(combo),
                            "Old_Price": old_price, "Now_Price": now_price,
                            "Discount_Percentage": discount_percentage
                        })
                    except Exception as e:
                        print(f"[WARNING] Could not process combo {combo}: {e}")

        except Exception as e:
            print(f"[WARNING] Variant handling skipped for product {url}: {e}")

    driver.quit()
    return pd.DataFrame(results)


if __name__ == "__main__":
    df = scrape_products(limit=10)
    df.to_csv("../data/raw/raw_data.csv", index=False)
    print("[SUCCESS] CSV saved!")


### PRODUCT INFO EXTRACTION

In [None]:
import time
import re
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product
from datetime import datetime

def extract_prices(driver):
    """Extract now_price, old_price, discount_percentage, and currency from Etsy product page."""
    now_price, old_price, discount_percentage, currency = None, None, None, None
    try:
        price_container = driver.find_element(By.XPATH, "//div[@data-selector='price-only']")
        spans = price_container.find_elements(By.TAG_NAME, "span")
        price_values = []

        for sp in spans:
            text = sp.text.strip().replace(",", ".")
            # Extract currency if present
            if not currency and re.search(r"[‚Ç¨$¬£]", text):
                currency = re.search(r"[‚Ç¨$¬£]", text).group()
            text_clean = re.sub(r"[‚Ç¨$¬£\s+]", "", text)
            if text_clean:
                try:
                    price_values.append(float(text_clean))
                except:
                    continue

        if len(price_values) == 1:
            now_price = price_values[0]
        elif len(price_values) >= 2:
            now_price = price_values[0]
            old_price = price_values[1]
            discount_percentage = round((old_price - now_price) / old_price * 100, 2)

        # Normalize currency codes
        if currency == "$":
            currency = "USD"
        elif currency == "‚Ç¨":
            currency = "EUR"
        elif currency == "¬£":
            currency = "GBP"

    except Exception as e:
        print(f"[ERROR] Could not extract prices: {e}")
    return now_price, old_price, discount_percentage, currency

def extract_description(driver):
    """Extract product description."""
    try:
        desc_elem = driver.find_element(By.XPATH, "//div[@data-id='description-text']")
        return desc_elem.text.strip()
    except:
        return None

def extract_listed_date(driver):
    """Extract product listed date (if available)."""
    try:
        date_elem = driver.find_element(By.XPATH, "//div[contains(text(),'Listed on') or contains(text(),'Cr√©√© le')]")
        match = re.search(r"(\d{1,2}\s\w+\s\d{4})", date_elem.text)
        if match:
            return datetime.strptime(match.group(1), "%d %B %Y")
    except:
        return None

def extract_variations(driver):
    """Extract product variations as list of dicts."""
    variations_list = []
    try:
        variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
        if variant_sections:
            all_options = []
            for section in variant_sections:
                opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select'))]")
                option_names = [opt.get_attribute("aria-label") or opt.text for opt in opts]
                all_options.append(option_names)

            for combo in product(*all_options):
                variations_list.append({"variation": " | ".join(combo)})
    except Exception as e:
        print(f"[WARNING] Could not extract variations: {e}")
    return variations_list or None

def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/search?q=tote+bag"
    driver.get(search_url)
    time.sleep(5)

    product_links_elements = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//a[@data-listing-id]")
    ))
    product_links = [elem.get_attribute("href") for elem in product_links_elements][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        # Product ID from URL
        product_id_match = re.search(r"/listing/(\d+)", url)
        product_id = product_id_match.group(1) if product_id_match else None

        # Title
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # Prices
        now_price, old_price, discount_percentage, currency = extract_prices(driver)

        # Description
        description = extract_description(driver)

        # Listed date
        listed_date = extract_listed_date(driver)

        # Variations
        variations = extract_variations(driver)

        results.append({
            "product_id": product_id,
            "product_title": title,
            "old_price": old_price,
            "discount_percentage": discount_percentage,
            "now_price": now_price,
            "currency": currency,
            "listed_date": listed_date,
            "product_url": url,
            "product_description": description,
            "product_variation": variations
        })

    driver.quit()
    return pd.DataFrame(results)

if __name__ == "__main__":
    df = scrape_products(limit=10)
    df.to_csv("../data/raw/raw_data.csv", index=False)
    print("[SUCCESS] CSV saved!")


### PRODUCT VARIATION changes prices , url , and images

In [17]:
import time
import re
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from datetime import datetime

def extract_prices(driver):
    now_price, old_price, discount_percentage, currency = None, None, None, None
    try:
        price_container = driver.find_element(By.XPATH, "//div[@data-selector='price-only']")
        spans = price_container.find_elements(By.TAG_NAME, "span")
        price_values = []

        for sp in spans:
            text = sp.text.strip().replace(",", ".")
            if not currency and re.search(r"[‚Ç¨$¬£]", text):
                currency = re.search(r"[‚Ç¨$¬£]", text).group()
            text_clean = re.sub(r"[‚Ç¨$¬£\s+]", "", text)
            if text_clean:
                try:
                    price_values.append(float(text_clean))
                except:
                    continue

        if len(price_values) == 1:
            now_price = price_values[0]
        elif len(price_values) >= 2:
            now_price = price_values[0]
            old_price = price_values[1]
            discount_percentage = round((old_price - now_price) / old_price * 100, 2)

        if currency == "$":
            currency = "USD"
        elif currency == "‚Ç¨":
            currency = "EUR"
        elif currency == "¬£":
            currency = "GBP"

    except Exception as e:
        print(f"[ERROR] Could not extract prices: {e}")
    return now_price, old_price, discount_percentage, currency

def extract_description(driver):
    try:
        desc_elem = driver.find_element(By.XPATH, "//div[@data-id='description-text']")
        return desc_elem.text.strip()
    except:
        return None

def extract_listed_date(driver):
    try:
        date_elem = driver.find_element(By.XPATH, "//div[contains(text(),'Listed on') or contains(text(),'Cr√©√© le')]")
        match = re.search(r"(\d{1,2}\s\w+\s\d{4})", date_elem.text)
        if match:
            return datetime.strptime(match.group(1), "%d %B %Y")
    except:
        return None

def recursive_variation_select(driver, sections, idx=0, current_combo=None, results=None):
    """Recursively select variations to cover all combinations."""
    if current_combo is None:
        current_combo = []
    if results is None:
        results = []

    if idx >= len(sections):
        # All options selected; extract prices for this combination
        now_price, old_price, discount_percentage, currency = extract_prices(driver)
        results.append({
            "variation": " | ".join(current_combo) if current_combo else None,
            "now_price": now_price,
            "old_price": old_price,
            "discount_percentage": discount_percentage,
            "currency": currency
        })
        return

    # Get current variation section and options
    section = sections[idx]
    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select'))]")
    for opt in opts:
        opt_name = opt.get_attribute("aria-label") or opt.text
        try:
            opt.click()
            time.sleep(1)  # Wait for dynamic price update
            recursive_variation_select(driver, sections, idx + 1, current_combo + [opt_name], results)
        except Exception as e:
            print(f"[WARNING] Could not click option {opt_name}: {e}")

def extract_all_variation_combinations(driver):
    results = []
    try:
        variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
        if not variant_sections:
            # No variations
            now_price, old_price, discount_percentage, currency = extract_prices(driver)
            results.append({
                "variation": None,
                "now_price": now_price,
                "old_price": old_price,
                "discount_percentage": discount_percentage,
                "currency": currency
            })
        else:
            recursive_variation_select(driver, variant_sections, results=results)
    except Exception as e:
        print(f"[WARNING] Could not extract variations: {e}")
    return results

def scrape_product(url):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    driver.get(url)
    time.sleep(5)

    # Product ID
    product_id_match = re.search(r"/listing/(\d+)", url)
    product_id = product_id_match.group(1) if product_id_match else None

    # Title
    try:
        title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
    except:
        title = None

    # Description
    description = extract_description(driver)

    # Listed date
    listed_date = extract_listed_date(driver)

    # Variations with prices
    variations = extract_all_variation_combinations(driver)

    driver.quit()

    # Build final results
    final_results = []
    for var in variations:
        final_results.append({
            "product_id": product_id,
            "product_title": title,
            "old_price": var.get("old_price"),
            "discount_percentage": var.get("discount_percentage"),
            "now_price": var.get("now_price"),
            "currency": var.get("currency"),
            "listed_date": listed_date,
            "product_url": url,
            "product_description": description,
            "product_variation": var.get("variation")
        })

    return pd.DataFrame(final_results)

if __name__ == "__main__":
    test_url = "https://www.etsy.com/listing/4301871513/custom-canvas-tote-bagpersonalised-logo"
    df = scrape_product(test_url)
    df.to_csv("../data/raw/etsy_product_variations.csv", index=False)
    print("[SUCCESS] CSV saved with all variations!")


[SUCCESS] CSV saved with all variations!


### PRODUCT VARIATIONS AS INDIVIDUAL PRODUCTS

In [19]:
import time
import re
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from datetime import datetime

def extract_prices(driver):
    now_price, old_price, discount_percentage, currency = None, None, None, None
    try:
        price_container = driver.find_element(By.XPATH, "//div[@data-selector='price-only']")
        spans = price_container.find_elements(By.TAG_NAME, "span")
        price_values = []

        for sp in spans:
            text = sp.text.strip().replace(",", ".")
            if not currency and re.search(r"[‚Ç¨$¬£]", text):
                currency = re.search(r"[‚Ç¨$¬£]", text).group()
            text_clean = re.sub(r"[‚Ç¨$¬£\s+]", "", text)
            if text_clean:
                try:
                    price_values.append(float(text_clean))
                except:
                    continue

        if len(price_values) == 1:
            now_price = price_values[0]
        elif len(price_values) >= 2:
            now_price = price_values[0]
            old_price = price_values[1]
            discount_percentage = round((old_price - now_price) / old_price * 100, 2)

        if currency == "$":
            currency = "USD"
        elif currency == "‚Ç¨":
            currency = "EUR"
        elif currency == "¬£":
            currency = "GBP"

    except Exception as e:
        print(f"[ERROR] Could not extract prices: {e}")
    return now_price, old_price, discount_percentage, currency

def extract_description(driver):
    try:
        desc_elem = driver.find_element(By.XPATH, "//div[@data-id='description-text']")
        return desc_elem.text.strip()
    except:
        return None

def extract_listed_date(driver):
    try:
        date_elem = driver.find_element(By.XPATH, "//div[contains(text(),'Listed on') or contains(text(),'Cr√©√© le')]")
        match = re.search(r"(\d{1,2}\s\w+\s\d{4})", date_elem.text)
        if match:
            return datetime.strptime(match.group(1), "%d %B %Y")
    except:
        return None

def recursive_variation_select(driver, sections, idx=0, current_combo=None, results=None):
    """Recursively select variations visually and scrape price after each selection."""
    if current_combo is None:
        current_combo = []
    if results is None:
        results = []

    if idx >= len(sections):
        # All options selected, extract price for this combination
        now_price, old_price, discount_percentage, currency = extract_prices(driver)
        results.append({
            "product_variation": " | ".join(current_combo) if current_combo else None,
            "now_price": now_price,
            "old_price": old_price,
            "discount_percentage": discount_percentage,
            "currency": currency
        })
        return

    section = sections[idx]

    # Open dropdown if it exists
    try:
        dropdown = section.find_element(By.TAG_NAME, "summary")
        driver.execute_script("arguments[0].scrollIntoView(true);", dropdown)
        time.sleep(0.5)
        dropdown.click()
        time.sleep(0.5)
    except:
        pass  # Some variations are already visible

    # Get all options
    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select'))]")
    for opt in opts:
        opt_name = opt.get_attribute("aria-label") or opt.text
        try:
            driver.execute_script("arguments[0].scrollIntoView(true);", opt)
            time.sleep(0.3)
            opt.click()
            time.sleep(1)  # Allow price update
            recursive_variation_select(driver, sections, idx + 1, current_combo + [opt_name], results)
        except Exception as e:
            print(f"[WARNING] Could not click option {opt_name}: {e}")

def extract_all_variation_combinations(driver):
    results = []
    try:
        variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
        if not variant_sections:
            # No variations
            now_price, old_price, discount_percentage, currency = extract_prices(driver)
            results.append({
                "product_variation": None,
                "now_price": now_price,
                "old_price": old_price,
                "discount_percentage": discount_percentage,
                "currency": currency
            })
        else:
            recursive_variation_select(driver, variant_sections, results=results)
    except Exception as e:
        print(f"[WARNING] Could not extract variations: {e}")
    return results

def scrape_product(url):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    driver.get(url)
    time.sleep(5)

    # Product ID
    product_id_match = re.search(r"/listing/(\d+)", url)
    product_id = product_id_match.group(1) if product_id_match else None

    # Title
    try:
        title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
    except:
        title = None

    # Description
    description = extract_description(driver)

    # Listed date
    listed_date = extract_listed_date(driver)

    # Extract all variations with price info after selection
    variations = extract_all_variation_combinations(driver)

    driver.quit()

    # Build final DataFrame
    final_results = []
    for var in variations:
        final_results.append({
            "product_id": product_id,
            "product_title": title,
            "old_price": var.get("old_price"),
            "discount_percentage": var.get("discount_percentage"),
            "now_price": var.get("now_price"),
            "currency": var.get("currency"),
            "listed_date": listed_date,
            "product_url": url,
            "product_description": description,
            "product_variation": var.get("product_variation")
        })

    return pd.DataFrame(final_results)

if __name__ == "__main__":
    test_url = "https://www.etsy.com/listing/4301871513/custom-canvas-tote-bagpersonalised-logo"
    df = scrape_product(test_url)
    df.to_csv("../data/raw/etsy_product_variations.csv", index=False)
    print("[SUCCESS] CSV saved! Each row = one variation with its price and discount.")


[SUCCESS] CSV saved! Each row = one variation with its price and discount.


### UMM

In [21]:
import time
import re
import json
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product
from datetime import datetime
from decimal import Decimal
from urllib.parse import urlparse

# ---------- HELPERS USING PAGE SOURCE (SAFE, NO EXTRA REQUESTS) ----------

def extract_structured_data(driver):
    """Parse JSON-LD Product and OpenGraph/meta tags from the HTML source."""
    html = driver.page_source

    product_ld = None
    for m in re.finditer(
        r'<script[^>]+type="application/ld\+json"[^>]*>(.*?)</script>',
        html,
        flags=re.DOTALL | re.IGNORECASE,
    ):
        block = m.group(1)
        try:
            data = json.loads(block)
        except Exception:
            continue

        # Product may be dict or list
        if isinstance(data, dict) and data.get("@type") == "Product":
            product_ld = data
            break
        if isinstance(data, list):
            for item in data:
                if isinstance(item, dict) and item.get("@type") == "Product":
                    product_ld = item
                    break
        if product_ld:
            break

    # Basic fields from JSON-LD
    title = None
    description = None
    now_price = None
    currency = None

    if product_ld:
        title = product_ld.get("name")
        description = product_ld.get("description")
        offers = product_ld.get("offers")

        if isinstance(offers, dict) and offers.get("@type") in ("Offer", "AggregateOffer"):
            price = offers.get("price")
            currency = offers.get("priceCurrency")
            if price is not None:
                now_price = float(price)
        elif isinstance(offers, list) and offers:
            first = offers[0]
            if isinstance(first, dict):
                price = first.get("price")
                currency = first.get("priceCurrency")
                if price is not None:
                    now_price = float(price)

    # Fallback title from <h1> if JSON-LD missing
    if not title:
        try:
            h1 = driver.find_element(By.XPATH, "//h1")
            title = h1.text.strip()
        except Exception:
            title = None

    # OG and product:price tags from HTML
    og_desc = None
    og_url = None
    meta_old_price = None
    meta_currency = None

    for m in re.finditer(
        r'<meta[^>]+property="([^"]+)"[^>]+content="([^"]*)"',
        html,
        flags=re.IGNORECASE,
    ):
        prop, content = m.group(1), m.group(2)
        if prop == "og:description":
            og_desc = content
        elif prop == "og:url":
            og_url = content
        elif prop == "product:price:amount":
            try:
                meta_old_price = float(content)
            except Exception:
                pass
        elif prop == "product:price:currency":
            meta_currency = content

    return {
        "product_ld": product_ld,
        "title": title,
        "description": description,
        "now_price_ld": now_price,
        "currency_ld": currency,
        "og_desc": og_desc,
        "og_url": og_url,
        "meta_old_price": meta_old_price,
        "meta_currency": meta_currency,
    }


# ---------- FIELD EXTRACTORS ----------

def extract_prices(driver):
    """
    Use JSON-LD / meta tags as primary source.
    Fallback to visible price block only if needed.
    """
    data = extract_structured_data(driver)

    now_price = data["now_price_ld"]
    old_price = None
    discount_percentage = None
    currency = data["currency_ld"] or data["meta_currency"]

    # Visible price (fallback) ‚Äì use your existing logic but only if needed
    if now_price is None or not currency:
        try:
            price_container = driver.find_element(By.XPATH, "//div[@data-selector='price-only']")
            spans = price_container.find_elements(By.TAG_NAME, "span")
            price_values = []
            symbol = None

            for sp in spans:
                text = sp.text.strip().replace(",", ".")
                if not text:
                    continue
                # currency symbol
                if not symbol:
                    m = re.search(r"[‚Ç¨$¬£]", text)
                    if m:
                        symbol = m.group()
                text_clean = re.sub(r"[‚Ç¨$¬£\s]", "", text)
                if text_clean:
                    try:
                        price_values.append(float(text_clean))
                    except Exception:
                        continue

            if price_values:
                now_price = now_price or price_values[0]
                if len(price_values) >= 2:
                    old_price = price_values[1]

            if not currency and symbol:
                if symbol == "$":
                    currency = "USD"
                elif symbol == "‚Ç¨":
                    currency = "EUR"
                elif symbol == "¬£":
                    currency = "GBP"
        except Exception as e:
            print(f"[ERROR] Visible price fallback failed: {e}")

    # Old price from meta if it is higher than current
    meta_old = data["meta_old_price"]
    if meta_old is not None and now_price is not None and meta_old > now_price:
        old_price = float(meta_old)

    if old_price and now_price:
        discount_percentage = round((old_price - now_price) / old_price * 100, 2)

    return now_price, old_price, discount_percentage, currency


def extract_description(driver):
    """Prefer JSON-LD description, fallback to visible description div."""
    data = extract_structured_data(driver)
    if data["description"]:
        return data["description"]

    try:
        desc_elem = driver.find_element(By.XPATH, "//div[@data-id='description-text']")
        return desc_elem.text.strip()
    except Exception:
        return None


def extract_listed_date(driver):
    """
    Parse date from og:description ("Listed on Nov 27, 2025") which is more stable
    than scraping visible text blocks.
    """
    data = extract_structured_data(driver)
    text = data["og_desc"]
    if not text:
        return None

    marker = "Listed on "
    if marker not in text:
        return None

    part = text.split(marker, 1)[1].strip().rstrip(".")
    # handle e.g. "Nov 27, 2025"
    for fmt in ("%b %d, %Y", "%d %B %Y"):
        try:
            return datetime.strptime(part, fmt)
        except Exception:
            continue
    return None


def extract_variations(driver):
    """
    Keep your current variation logic (button-based) but make it a bit safer.
    This still only uses content already loaded in the page.
    """
    variations_list = []
    try:
        variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
        if not variant_sections:
            return None

        all_options = []
        for section in variant_sections:
            # real option buttons, not the "Select an option" pseudo-button
            opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'Select'))]")
            option_names = []
            for opt in opts:
                name = opt.get_attribute("aria-label") or opt.text
                name = (name or "").strip()
                if name:
                    option_names.append(name)
            if option_names:
                all_options.append(option_names)

        if not all_options:
            return None

        for combo in product(*all_options):
            variations_list.append({"variation": " | ".join(combo)})
    except Exception as e:
        print(f"[WARNING] Could not extract variations: {e}")

    return variations_list or None


# ---------- MAIN SCRAPER ----------

def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/search?q=tote+bag"
    driver.get(search_url)
    time.sleep(5)

    product_links_elements = wait.until(
        EC.presence_of_all_elements_located((By.XPATH, "//a[@data-listing-id]"))
    )
    product_links = []
    seen = set()
    for elem in product_links_elements:
        href = elem.get_attribute("href")
        if not href or href in seen:
            continue
        seen.add(href)
        product_links.append(href)
        if len(product_links) >= limit:
            break

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        # Product ID from URL path
        product_id_match = re.search(r"/listing/(\d+)", url)
        product_id = product_id_match.group(1) if product_id_match else None

        # Structured data for this page
        sdata = extract_structured_data(driver)

        # Title
        title = sdata["title"]

        # Prices
        now_price, old_price, discount_percentage, currency = extract_prices(driver)

        # Description
        description = extract_description(driver)

        # Listed date
        listed_date = extract_listed_date(driver)

        # Variations
        variations = extract_variations(driver)

        results.append({
            "product_id": product_id,
            "product_title": title,
            "old_price": old_price,
            "discount_percentage": discount_percentage,
            "now_price": now_price,
            "currency": currency,
            "listed_date": listed_date,
            "product_url": url,
            "product_description": description,
            "product_variation": variations,
        })

    driver.quit()
    return pd.DataFrame(results)


if __name__ == "__main__":
    df = scrape_products(limit=10)
    df.to_csv("../data/raw/raw_test.csv", index=False)
    print("[SUCCESS] CSV saved!")


[INFO] Scraping product 1/10: https://www.etsy.com/fr/listing/4301871513/sac-fourre-tout-en-toile-personnalise?click_key=047fe661-e6b3-4012-b0b8-5f88aa112e34%3ALTda859e27057383d4ac2f65a7c2cb2f8d50cfcec9&click_sum=3daac757&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-328695-1-1&sr_prefetch=1&pf_from=search&pro=1&frs=1&pop=1&sts=1&content_source=047fe661-e6b3-4012-b0b8-5f88aa112e34%253ALTda859e27057383d4ac2f65a7c2cb2f8d50cfcec9
[INFO] Scraping product 2/10: https://www.etsy.com/fr/listing/1396764287/sac-fourre-tout-chats-et-plantes-sac?click_key=047fe661-e6b3-4012-b0b8-5f88aa112e34%3ALTfca9e0c8df7e5fbecc3bf59947ec06e6c4f683f4&click_sum=fd1f06b4&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-328695-1-2&sr_prefetch=1&pf_from=search&pro=1&etp=1&content_source=047fe661-e6b3-4012-b0b8-5f88aa112e34%253ALTfca9e0c8df7e5fbecc3bf59947ec06e6c4f683f4
[INFO] Scraping product 3

In [22]:
df.head(10)

Unnamed: 0,product_id,product_title,old_price,discount_percentage,now_price,currency,listed_date,product_url,product_description,product_variation
0,4301871513,"Sac fourre-tout en toile personnalis√©, sac fou...",,,,USD,,https://www.etsy.com/fr/listing/4301871513/sac...,Sacs fourre-tout personnalis√©s | Votre parfait...,
1,1396764287,Sac fourre-tout chats et plantes - Sac pour am...,,,15.0,USD,,https://www.etsy.com/fr/listing/1396764287/sac...,Nous vous pr√©sentons notre nouveau sac fourre-...,
2,4339252322,Sac fourre-tout personnalis√© brod√© avec initia...,,,,USD,,https://www.etsy.com/fr/listing/4339252322/sac...,Sacs fourre-tout brod√©s personnalis√©s\n\nFabri...,
3,4391873405,"Sac fourre-tout en nylon matelass√© brod√©, cade...",,,,USD,,https://www.etsy.com/fr/listing/4391873405/sac...,Notre sac √† main matelass√© personnalis√© est un...,
4,1716154949,Sac fourre-tout boh√®me √† fleurs brod√©es en mar...,,,,USD,,https://www.etsy.com/fr/listing/1716154949/sac...,Nous sommes ravis de vous pr√©senter une nouvel...,
5,4374771805,Personalized Embroidered Corduroy Tote ‚Äî Bride...,,,,USD,,https://www.etsy.com/fr/listing/4374771805/per...,üíïPersonalized Embroidered Corduroy Tote ‚Äî Vint...,
6,4363447940,"Sac fourre-tout personnalis√©, cadeau de demois...",,,,USD,,https://www.etsy.com/fr/listing/4363447940/sac...,Nous vous pr√©sentons notre sac fourre-tout pou...,
7,4404716872,"Sac fourre-tout matelass√© art chat, sac en pat...",,,,USD,,https://www.etsy.com/fr/listing/4404716872/sac...,"Apportez de la couleur, de la cr√©ativit√© et de...",
8,1825286680,Sac fourre-tout en velours c√¥tel√© doux √† doubl...,,,,USD,,https://www.etsy.com/fr/listing/1825286680/sac...,Plus de couleurs :\nhttps://www.etsy.com/shop/...,
9,4337757198,Sac fourre-tout brod√© de demoiselle d&#39;honn...,,,,USD,,https://www.etsy.com/fr/listing/4337757198/sac...,Notre sac fourre-tout matelass√© √† imprim√© bloc...,


==================================================================================================================================
# <div align="center">DATA CLEANING & ANALYSIS</div>
==================================================================================================================================

#### üóÉÔ∏è **Raw data**

- Web scraped data saved in a DataFrame then a CSV file and uploaded to google drive
- The df_url has to be a downloadable link to the csv file from google drive
- We load the csv to use for data cleaning and analysis

In [None]:
import pandas as pd

# Load RAW DATA CSV
df_url = 'link to the dataFrame collected from scraping as a downloadable link from google drive'
df_etsy = pd.read_csv(df_url)

print("STEP 1 : RAW CSV loaded successfully!")
df_etsy.head()


----

#### üóÉÔ∏è **Interim data**

In [None]:
# Save INTERIM DATA to CSV
df_etsy.to_csv("../data/interim/interim_data.csv", index=False)
print("STEP 2 : INRTERIM CSV saved successfully!")

----

#### üóÉÔ∏è **Clean data**

In [None]:
# Save CLEAN DATA to CSV
df_etsy.to_csv("../data/clean/clean_data.csv", index=False)
print("STEP 3 : CLEAN CSV saved successfully!")

==================================================================================================================================
# <div align="center">PLOTS</div>
==================================================================================================================================

### üìä PLOT 01:

In [None]:
# PLOT 1

### üìä PLOT 02:

In [None]:
# PLOT 2

### üìä PLOT 03:

In [None]:
# PLOT 3

### üìä PLOT 04:

In [None]:
# PLOT 4

### üìä PLOT 05:

In [None]:
# PLOT 5

==================================================================================================================================
# <div align="center">INSIGHTS</div>
==================================================================================================================================

### üß† INSIGHT 01:
Text

----

### üß† INSIGHT 02:
Text

---

### üß† INSIGHT 03:
Text


==================================================================================================================================