==================================================================================================================================
# <div align="center">PROJECT 03: Etsy Print-On-Demand Trends</div>
==================================================================================================================================

### üìù BUSINESS IDEA

**Print-On-Demand (POD) Business** ‚Äì What the project is about

### ‚ÅâÔ∏è PROBLEM

No API exists to access the market data needed, requiring web scraping to gather insights ‚Äì The challenge we‚Äôre addressing

### üî∞ SOLUTION FRAMEWORK

Web scrape etsy for a specific POD product

Collect the data necessary to clean & analyze


| **Development**                                                                                                                                             | **Presentation**                 |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------- |
| **Business Idea** ‚Üí **Problem Definition** ‚Üí **Data Research & Visualization** ‚Üí **Insights** ‚Üí **Interpretation** ‚Üí **Implications** ‚Üí **Business Impact** | **Limitations & Considerations** |

### üìå SECTION OVERVIEW

* **Project / Business Idea:** What the project is about
* **Problem:** The challenge we‚Äôre addressing
* **Solution / Approach:** How we solve it
* **Research & Plots:** How we analyzed data visually
* **Insights:** What we discovered
* **Interpretation:** Why it matters
* **Implications:** What actions the business can take
* **Business Impact:** Expected results for the business
* **Limitations:** What constraints or gaps exist

==================================================================================================================================
# <div align="center">WEB SCRAPING</div>
==================================================================================================================================

```Etsy``` is a dynamic website, so scraping it requires careful handling.

Since ```Etsy``` uses ```JavaScript``` to load some content,

```requests``` +  ``BeautifulSoup`` might work for static parts (like search results), 

but for dynamic content, ``Selenium`` is more reliable. 

I will be using ``requests`` + ``BeautifulSoup`` for ```product listings``` **(title, price, link)**

Important Note: Etsy uses dynamic loading + anti-bot protections.

Using code with standard HTML scraping can work as long as Etsy doesn‚Äôt block the request.

If blocked, using headers, rotating proxies, or the Etsy API will be required.

==================================================================================================================================

----

### Avoiding getting blocked
| Version                                   | Best For          | Pros                                           | Cons                          |
| ----------------------------------------- | ----------------- | ---------------------------------------------- | ----------------------------- |
| **Requests + BeautifulSoup + Pagination** | Simple scraping   | Fast, clean                                    | Etsy may block request        |
| **Selenium + BeautifulSoup + Pagination** | Reliable scraping | Bypasses bot protection, loads dynamic content | Slower, requires ChromeDriver |


#### üß∞ **Install for web scraping**

In [15]:
# install requests & beautifulsoup
!pip install requests beautifulsoup4 fake-useragent pandas

# install selenium
!pip install selenium pandas




----

### üìå Pagination + BeautifulSoup Version
| Version                                   | Best For          | Pros                                           | Cons                          |
| ----------------------------------------- | ----------------- | ---------------------------------------------- | ----------------------------- |
| **Requests + BeautifulSoup + Pagination** | Simple scraping   | Fast, clean                                    | Etsy may block request        |

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import time


def scrape_products(pages=5, max_items=10):
    base_url = "https://www.etsy.com/search?q=tote+bag&page="
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36"
    }

    data = []

    for page in range(1, pages + 1):
        url = base_url + str(page)
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, "html.parser")

        products = soup.find_all("li", class_="wt-list-unstyled")

        for item in products:
            if len(data) >= max_items:
                return pd.DataFrame(data)

            # URL
            link = item.find("a", href=True)
            if not link:
                continue
            product_url = "https://www.etsy.com" + link["href"]

            # Title
            title_tag = item.find("h3")
            title = title_tag.get_text(strip=True) if title_tag else None

            # Price
            price_tag = item.find("span", class_="currency-value")
            price = None
            if price_tag:
                try:
                    price = float(price_tag.text.replace(",", "."))
                except:
                    pass

            # Rating
            rating_tag = item.find("span", class_="wt-screen-reader-only")
            rating = None
            if rating_tag:
                match_rating = re.search(r"([\d.]+) out of 5", rating_tag.text)
                if match_rating:
                    rating = float(match_rating.group(1))

            # Reviews
            reviews_tag = item.find("span", class_="wt-text-body-01")
            reviews = None
            if reviews_tag:
                match_reviews = re.search(r"\((\d+)\)", reviews_tag.text)
                if match_reviews:
                    reviews = int(match_reviews.group(1))

            # Delivery
            delivery = None
            delivery_tag = item.find(string=re.compile("delivery", re.I))
            if delivery_tag:
                txt = delivery_tag.lower()
                if "free" in txt:
                    delivery = 0
                else:
                    match_del = re.search(r"‚Ç¨\s?([\d.,]+)", delivery_tag)
                    if match_del:
                        delivery = float(match_del.group(1).replace(",", "."))

            data.append({
                "URL": product_url,
                "Title": title,
                "Price": price,
                "Rating": rating,
                "Reviews": reviews,
                "Delivery": delivery
            })

        time.sleep(1)

    return pd.DataFrame(data)


# Example: save CSV
if __name__ == "__main__":
    df = scrape_products()
    df.to_csv("../data/interim/0_interim_price.csv", index=False)
    print("STEP 1 : 'Price' INTERIM and CSV saved successfully!")


### üìå Selenium-Based Version (ChromeDriver)

| Version                                   | Best For          | Pros                                           | Cons                          |
| ----------------------------------------- | ----------------- | ---------------------------------------------- | ----------------------------- |
| **Selenium + BeautifulSoup + Pagination** | Reliable scraping | Bypasses bot protection, loads dynamic content | Slower, requires ChromeDriver |

Link to ChromeDriver: https://googlechromelabs.github.io/chrome-for-testing/#stable

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
import time
import re


def scrape_products_selenium(max_items=10):
    options = Options()
    options.add_argument("--headless")  
    options.add_argument("--disable-blink-features=AutomationControlled")
    options.add_argument("--disable-gpu")
    options.add_argument("start-maximized")
    options.add_argument("user-agent=Mozilla/5.0")

    driver = webdriver.Chrome(options=options)

    data = []
    page = 1

    while len(data) < max_items:
        url = f"https://www.etsy.com/search?q=tote+bag&page={page}"
        driver.get(url)
        time.sleep(4)

        soup = BeautifulSoup(driver.page_source, "html.parser")
        products = soup.find_all("li", class_="wt-list-unstyled")

        for item in products:
            if len(data) >= max_items:
                break

            # URL
            link = item.find("a", href=True)
            if not link:
                continue
            product_url = "https://www.etsy.com" + link["href"]

            # Title
            title_tag = item.find("h3")
            title = title_tag.get_text(strip=True) if title_tag else None

            # Price
            price_tag = item.find("span", class_="currency-value")
            price = None
            if price_tag:
                try:
                    price = float(price_tag.text.replace(",", "."))
                except:
                    pass

            # Rating
            rating_tag = item.find("span", class_="wt-screen-reader-only")
            rating = None
            if rating_tag:
                match_rating = re.search(r"([\d.]+) out of 5", rating_tag.text)
                if match_rating:
                    rating = float(match_rating.group(1))

            # Reviews
            reviews_tag = item.find("span", class_="wt-text-body-01")
            reviews = None
            if reviews_tag:
                match_reviews = re.search(r"\((\d+)\)", reviews_tag.text)
                if match_reviews:
                    reviews = int(match_reviews.group(1))

            # Delivery
            delivery = None
            delivery_tag = item.find(string=re.compile("delivery", re.I))
            if delivery_tag:
                txt = delivery_tag.lower()
                if "free" in txt:
                    delivery = 0
                else:
                    match_del = re.search(r"‚Ç¨\s?([\d.,]+)", delivery_tag)
                    if match_del:
                        delivery = float(match_del.group(1).replace(",", "."))

            data.append({
                "URL": product_url,
                "Title": title,
                "Price": price,
                "Rating": rating,
                "Reviews": reviews,
                "Delivery": delivery
            })

        page += 1
        time.sleep(2)

    driver.quit()

    df = pd.DataFrame(data)
    return df


# Save CSV
if __name__ == "__main__":
    df = scrape_products_selenium()
    df.to_csv("../data/interim/1_interim_price.csv", index=False)
    print("STEP 1 : 'Price' INTERIM and CSV saved successfully!")


In [None]:
"""
Etsy Tote Bag Scraper (Selenium + BeautifulSoup) with:
- Pagination
- Proxy rotation
- Random user-agents
- Class-based design
- Adjustable product limit
Saves final cleaned dataframe to ../data/clean/clean_data.csv
"""

import random
import time
import re
import os
from dataclasses import dataclass, field
from typing import List, Optional

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import WebDriverException, TimeoutException


@dataclass
class EtsyToteScraper:
    user_agents: List[str] = field(default_factory=lambda: [
        # A short sample; replace/extend with more UAs for real rotations
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/120 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_0) AppleWebKit/605.1.15 "
        "(KHTML, like Gecko) Version/16.0 Safari/605.1.15",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/117 Safari/537.36"
    ])
    proxies: List[str] = field(default_factory=list)  # e.g. ["http://ip:port", "http://user:pass@ip:port"]
    chromedriver_path: Optional[str] = None  # if None assumes chromedriver is on PATH
    headless: bool = True
    page_load_wait: float = 3.5  # seconds to wait after loading a page
    max_restarts_for_errors: int = 2

    def _make_driver(self, proxy: Optional[str], user_agent: str):
        """Create a Selenium Chrome WebDriver with given proxy & user agent."""
        options = Options()
        if self.headless:
            options.add_argument("--headless=new")  # use new headless mode
        options.add_argument("--disable-blink-features=AutomationControlled")
        options.add_argument("--no-sandbox")
        options.add_argument("--disable-dev-shm-usage")
        options.add_argument("--disable-gpu")
        options.add_argument("--window-size=1400,1000")
        options.add_argument(f"--user-agent={user_agent}")

        if proxy:
            # Set proxy; Chrome expects --proxy-server argument
            options.add_argument(f'--proxy-server={proxy}')

        # Optional: reduce webdriver fingerprint
        options.add_experimental_option("excludeSwitches", ["enable-automation"])
        options.add_experimental_option('useAutomationExtension', False)

        try:
            if self.chromedriver_path:
                driver = webdriver.Chrome(executable_path=self.chromedriver_path, options=options)  # type: ignore
            else:
                driver = webdriver.Chrome(options=options)
        except TypeError:
            # Some selenium versions use service object; fallback to default constructor
            driver = webdriver.Chrome(options=options)  # type: ignore
        return driver

    @staticmethod
    def _parse_price(price_text: str) -> Optional[float]:
        if not price_text:
            return None
        # Normalize and extract first price-looking token (handles "‚Ç¨12.50" and "12,50 ‚Ç¨")
        price_text = price_text.strip()
        # Keep euro symbol and digits, commas, dots
        m = re.search(r"‚Ç¨\s*([\d\.,]+)|([\d\.,]+)\s*‚Ç¨", price_text)
        if m:
            num = m.group(1) or m.group(2)
        else:
            # fallback: find any number-like substring
            m2 = re.search(r"([\d]{1,3}(?:[.,]\d{1,3})+|\d+)", price_text)
            if not m2:
                return None
            num = m2.group(1)
        # convert to float, handling comma as decimal if needed
        num = num.replace(".", "").replace(",", ".") if num.count(",") == 1 and num.count(".") == 0 else num.replace(",", "")
        try:
            return float(num)
        except Exception:
            return None

    @staticmethod
    def _extract_rating(text: str) -> Optional[float]:
        if not text:
            return None
        m = re.search(r"([0-5](?:\.[0-9])?)\s*out of\s*5", text, re.I)
        if m:
            try:
                return float(m.group(1))
            except:
                return None
        # sometimes rating appears as "4.8" alone
        m2 = re.search(r"\b([0-5]\.\d)\b", text)
        if m2:
            try:
                return float(m2.group(1))
            except:
                return None
        return None

    @staticmethod
    def _extract_reviews(text: str) -> Optional[int]:
        if not text:
            return None
        # look for parentheses e.g. "(123)" or "123 reviews"
        m = re.search(r"\((\d{1,6})\)", text.replace("\xa0", " "))
        if m:
            return int(m.group(1))
        m2 = re.search(r"(\d{1,6})\s+review", text, re.I)
        if m2:
            return int(m2.group(1))
        return None

    @staticmethod
    def _clean_text(elem):
        return elem.get_text(" ", strip=True) if elem else ""

    def scrape(self, max_items: int = 10, max_pages: int = 20, start_page: int = 1) -> pd.DataFrame:
        """
        Scrape Etsy tote bag products.

        Parameters:
        - max_items: total number of product rows to collect (default 10)
        - max_pages: maximum pages to visit (safety cap)
        - start_page: which search page to start from (1-based)
        """
        data_rows = []
        page = start_page
        attempts = 0

        # We'll periodically rotate proxy & UA by restarting the driver
        while len(data_rows) < max_items and page < start_page + max_pages:
            # choose random UA & proxy
            ua = random.choice(self.user_agents)
            proxy = random.choice(self.proxies) if self.proxies else None

            restarts = 0
            while restarts <= self.max_restarts_for_errors:
                driver = None
                try:
                    driver = self._make_driver(proxy, ua)
                    search_url = f"https://www.etsy.com/search?q=tote+bag&page={page}"
                    print(f"[INFO] Loading page {page} (collected {len(data_rows)}/{max_items}) ‚Äî UA chosen, proxy={proxy is not None}")
                    driver.get(search_url)
                    time.sleep(self.page_load_wait + random.uniform(0.5, 2.0))  # allow JS to load

                    soup = BeautifulSoup(driver.page_source, "html.parser")

                    # Etsy product tiles: use `li` elements with data-search-result or a result class
                    product_items = soup.find_all("li", attrs={"data-search-result": True})
                    if not product_items:
                        # fallback heuristics (sometimes different structure)
                        product_items = soup.find_all("div", class_=re.compile(r"v2-listing-card|search-result|listing-link|wt-grid-item"), limit=60)

                    if not product_items:
                        print("[WARN] No product items found on the page. The markup might have changed.")
                        break

                    for item in product_items:
                        if len(data_rows) >= max_items:
                            break

                        # URL
                        link_tag = item.find("a", href=True)
                        if not link_tag:
                            continue
                        product_url = link_tag["href"].split("?")[0]  # remove query params

                        # Title
                        title = None
                        title_tag = item.find("h3")
                        if title_tag:
                            title = title_tag.get_text(" ", strip=True)
                        else:
                            # alternative
                            title_tag2 = item.find("h2") or item.find("p", class_=re.compile("title|text"))
                            title = title_tag2.get_text(" ", strip=True) if title_tag2 else ""

                        # Price - try several selectors
                        price = None
                        # Etsy often uses <span class="currency-value">12.00</span>
                        price_span = item.find("span", class_=re.compile(r"currency-value|listing-price"))
                        if price_span:
                            price = self._parse_price(price_span.get_text(" ", strip=True))
                        else:
                            # try to extract from any text snippet in this tile
                            combined_text = self._clean_text(item)
                            # find euro price in combined text
                            price = self._parse_price(combined_text)

                        # Rating - try screen-reader text or aria labels
                        rating = None
                        rating_span = item.find("span", class_=re.compile(r"screen-reader-only|text-body-01|sr-only"), string=re.compile(r"out of 5", re.I))
                        if rating_span:
                            rating = self._extract_rating(rating_span.get_text(" ", strip=True))
                        else:
                            # try aria-label on an element
                            rating_aria = item.find(attrs={"aria-label": re.compile(r"out of 5", re.I)})
                            if rating_aria:
                                rating = self._extract_rating(rating_aria["aria-label"])

                        # Reviews - look for parentheses or "reviews" nearby
                        reviews = None
                        # check for small count element
                        reviews_candidates = item.find_all(text=re.compile(r"\(\d+\)|\d+\s+review", re.I))
                        if reviews_candidates:
                            for cand in reviews_candidates:
                                r = self._extract_reviews(cand)
                                if r:
                                    reviews = r
                                    break
                        if reviews is None:
                            # fallback to searching whole tile text
                            reviews = self._extract_reviews(self._clean_text(item))

                        # Delivery - detect Free shipping or shipping cost
                        delivery = None
                        # Common pattern: "Free shipping", "Free standard shipping", or "Shipping: ‚Ç¨3.00"
                        shipping_texts = item.find_all(text=re.compile(r"free shipping|shipping|delivery", re.I))
                        if shipping_texts:
                            for st in shipping_texts:
                                st_lower = st.strip().lower()
                                if "free" in st_lower:
                                    delivery = 0
                                    break
                                # try to parse euro amount
                                parsed = self._parse_price(st)
                                if parsed is not None:
                                    delivery = parsed
                                    break
                        if delivery is None:
                            # look at the product page (optional expensive step) - skip to save time

                            # default to None if unknown
                            delivery = None

                        data_rows.append({
                            "URL": product_url,
                            "Title": title,
                            "Price": price,
                            "Rating": rating,
                            "Reviews": reviews,
                            "Delivery": delivery
                        })

                    # Page completed
                    driver.quit()
                    break  # break restart loop on success

                except (WebDriverException, TimeoutException) as e:
                    print(f"[ERROR] WebDriver error: {e} ‚Äî restarting driver (attempt {restarts+1})")
                    if driver:
                        try:
                            driver.quit()
                        except:
                            pass
                    restarts += 1
                    time.sleep(1 + random.random() * 2)
                except Exception as e:
                    print(f"[ERROR] Unexpected error parsing page {page}: {e}")
                    if driver:
                        try:
                            driver.quit()
                        except:
                            pass
                    restarts += 1
                    time.sleep(1 + random.random() * 2)

            page += 1
            attempts += 1
            # polite pause between page loads and to reduce detection risk
            time.sleep(1.0 + random.uniform(0.8, 2.2))

        # Build DataFrame with exactly up to max_items rows (trim if needed)
        df = pd.DataFrame(data_rows)[:max_items]

        # Normalize columns: ensure numeric types where possible
        if not df.empty:
            df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
            df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
            df['Reviews'] = pd.to_numeric(df['Reviews'], errors='coerce').astype('Int64')
            # Delivery: treat None as NaN; where 0 -> free shipping
            df['Delivery'] = pd.to_numeric(df['Delivery'], errors='coerce')

        # Save CSV as requested
        out_path = os.path.join("..", "data", "clean", "clean_data.csv")
        os.makedirs(os.path.dirname(out_path), exist_ok=True)
        df.to_csv(out_path, index=False)
        print("STEP 1 : 'Price' CLEAN and CSV saved successfully!")

        return df


if __name__ == "__main__":
    # === Example usage ===
    # Provide your proxies and optionally a larger user-agent list
    proxies = [
        # "http://user:pass@12.34.56.78:1234",
        # "http://12.34.56.79:8080",
    ]

    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/120 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_0) AppleWebKit/605.1.15 "
        "(KHTML, like Gecko) Version/16.0 Safari/605.1.15",
        # add more UAs here...
    ]

    scraper = EtsyToteScraper(
        user_agents=user_agents,
        proxies=proxies,
        chromedriver_path=None,  # or set path like "/usr/local/bin/chromedriver"
        headless=True,
        page_load_wait=3.5
    )

    print("[START] Scraping up to 10 tote bag products (Selenium + rotating UA/proxy)...")
    df = scraper.scrape(max_items=10, max_pages=30, start_page=1)
    print(df)


### TEST

In [3]:
import undetected_chromedriver as uc
import time

print("Launching Chrome...")

# launch browser
driver = uc.Chrome()

driver.get("https://www.google.com")

print("Page title:", driver.title)

time.sleep(5)
driver.quit()

print("Done!")

Launching Chrome...
Page title: Google
Done!


### WEB SCRAPER INTERIM

In [7]:
import time
import pandas as pd
import undetected_chromedriver as uc
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys


def scrape_products(limit=10):
    """
    Scrape tote bag product data from Etsy using Selenium + BeautifulSoup.
    Includes pagination & anti-bot avoidance.
    Returns a pandas DataFrame.
    """

    # Launch undetected Chrome
    driver = uc.Chrome()
    driver.maximize_window()

    # Etsy tote bags search
    url = "https://www.etsy.com/search?q=tote+bag"
    driver.get(url)
    time.sleep(5)

    products = []

    while len(products) < limit:
        # Scroll to load products
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(3)

        soup = BeautifulSoup(driver.page_source, "html.parser")

        # All product cards
        items = soup.select("li.wt-list-unstyled")  # Etsy product item containers

        for item in items:
            if len(products) >= limit:
                break

            # URL
            url_tag = item.select_one("a.listing-link")
            if not url_tag:
                continue
            product_url = url_tag.get("href")

            # Title
            title_tag = item.select_one("h3")
            title = title_tag.get_text(strip=True) if title_tag else None

            # Price
            price_tag = item.select_one(".currency-value")
            price = price_tag.get_text(strip=True) if price_tag else None

            # Rating
            rating_tag = item.select_one(".wt-screen-reader-only")
            rating = None
            if rating_tag:
                # Example text: "5 out of 5 stars"
                text = rating_tag.get_text(strip=True)
                if "out of 5 stars" in text:
                    rating = float(text.split(" out")[0])

            # Reviews count
            reviews_tag = item.select_one(".wt-text-caption")
            reviews = None
            if reviews_tag:
                text = reviews_tag.get_text(strip=True)
                # e.g. "(123)"
                if text.startswith("(") and text.endswith(")"):
                    try:
                        reviews = int(text.strip("()"))
                    except:
                        reviews = None

            # Delivery price (if available)
            delivery_tag = item.select_one(".wt-text-strikethrough, .wt-text-muted")
            delivery = None
            if delivery_tag:
                delivery_text = delivery_tag.get_text(strip=True)
                # Normalize delivery cost
                if "Free delivery" in delivery_text or "FREE" in delivery_text:
                    delivery = 0
                else:
                    delivery = delivery_text

            products.append({
                "URL": product_url,
                "Title": title,
                "Price": price,
                "Rating": rating,
                "Reviews": reviews,
                "Delivery": delivery
            })

        # Go to next page if needed
        if len(products) < limit:
            next_button = None
            try:
                next_button = driver.find_element(By.CSS_SELECTOR, "a[aria-label='Next page']")
            except:
                pass

            if next_button:
                driver.execute_script("arguments[0].click();", next_button)
                time.sleep(5)
            else:
                break

    driver.quit()
    return pd.DataFrame(products)


# -----------------------------------------------------
# EXECUTION
# -----------------------------------------------------
if __name__ == "__main__":
    df = scrape_products(limit=10)

    # SAVE CSV
    df.to_csv("../data/clean/clean_tote_bags.csv", index=False)
    print("STEP 10 : TOTE BAGS and CSV saved successfully!")
    print(df)


STEP 10 : TOTE BAGS and CSV saved successfully!
Empty DataFrame
Columns: []
Index: []


### THIRD TEST

In [6]:
def scrape_products(limit=10):
    import time
    import pandas as pd
    import undetected_chromedriver as uc
    from bs4 import BeautifulSoup
    from selenium.webdriver.common.by import By

    driver = uc.Chrome()
    driver.maximize_window()

    url = "https://www.etsy.com/search?q=tote+bag"
    driver.get(url)
    time.sleep(5)

    products = []

    while len(products) < limit:

        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(4)

        soup = BeautifulSoup(driver.page_source, "html.parser")

        # NEW: Etsy item selector
        items = soup.select("li[data-listing-id]")

        for item in items:
            if len(products) >= limit:
                break

            # URL
            url_tag = item.select_one("a.listing-link")
            product_url = url_tag.get("href") if url_tag else None

            # Title
            title_tag = item.select_one("h3.wt-text-truncate")
            title = title_tag.get_text(strip=True) if title_tag else None

            # Price
            price_tag = item.select_one("span.currency-value")
            price = price_tag.get_text(strip=True) if price_tag else None

            # Rating
            rating = None
            rating_tag = item.select_one("input[name='rating']")
            if rating_tag:
                rating = float(rating_tag.get("value", 0))

            # Reviews
            reviews = None
            reviews_tag = item.select_one("span.wt-text-caption span")
            if reviews_tag:
                text = reviews_tag.get_text(strip=True)
                if text.startswith("(") and text.endswith(")"):
                    try:
                        reviews = int(text[1:-1])
                    except:
                        reviews = None

            # Delivery
            delivery = None
            delivery_tag = item.select("span.wt-text-caption")
            if delivery_tag:
                text = " ".join([d.get_text(strip=True) for d in delivery_tag])
                if "Free delivery" in text:
                    delivery = 0
                else:
                    delivery = text

            products.append({
                "URL": product_url,
                "Title": title,
                "Price": price,
                "Rating": rating,
                "Reviews": reviews,
                "Delivery": delivery
            })

        # Pagination
        if len(products) < limit:
            try:
                next_button = driver.find_element(By.CSS_SELECTOR, "a[aria-label='Next page']")
                driver.execute_script("arguments[0].click();", next_button)
                time.sleep(5)
            except:
                break

    driver.quit()
    return pd.DataFrame(products)


In [8]:
import time
import pandas as pd
import undetected_chromedriver as uc
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


# =====================================================
#                     MAIN FUNCTION
# =====================================================
def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()

    base_url = "https://www.etsy.com/search?q=tote+bag&page="
    page_num = 1

    results = []

    while len(results) < limit:

        # Go to next page manually using URL pagination
        url = base_url + str(page_num)
        print(f"[INFO] Loading page {page_num}: {url}")
        driver.get(url)

        time.sleep(5)  # Allow dynamic content to load

        # Scroll repeatedly to force lazy-loading products
        for _ in range(5):
            driver.execute_script("window.scrollBy(0, 1500);")
            time.sleep(2)

        soup = BeautifulSoup(driver.page_source, "html.parser")

        # ===========================================
        # NEW: Very stable selector ‚Äî every product has:
        # <li data-listing-id="xxxx">
        # ===========================================
        items = soup.select("li[data-listing-id]")

        print(f"[INFO] Products detected on page: {len(items)}")

        if not items:
            print("[WARNING] No products found. Etsy layout may have changed.")
            break

        for item in items:
            if len(results) >= limit:
                break

            # ========== URL ==========
            url_tag = item.select_one("a[data-listing-id]")
            product_url = url_tag["href"] if url_tag else None

            # ========== TITLE ==========
            title_tag = item.select_one("h3")
            title = title_tag.get_text(strip=True) if title_tag else None

            # ========== PRICE ==========
            price_tag = item.select_one("span.currency-value")
            price = price_tag.get_text(strip=True) if price_tag else None

            # ========== RATING ==========
            rating = None
            rating_input = item.select_one("input[name='rating']")
            if rating_input:
                try:
                    rating = float(rating_input["value"])
                except:
                    pass

            # ========== REVIEWS ==========
            reviews = None
            reviews_span = item.select_one("span.wt-text-caption span")
            if reviews_span:
                txt = reviews_span.get_text(strip=True)
                if txt.startswith("(") and txt.endswith(")"):
                    try:
                        reviews = int(txt[1:-1])
                    except:
                        pass

            # ========== DELIVERY ==========
            delivery = None
            delivery_tags = item.select("span.wt-text-caption, p.wt-text-caption")
            if delivery_tags:
                combined = " ".join(t.get_text(strip=True) for t in delivery_tags)
                if "Free delivery" in combined:
                    delivery = 0
                else:
                    delivery = combined

            results.append({
                "URL": product_url,
                "Title": title,
                "Price": price,
                "Rating": rating,
                "Reviews": reviews,
                "Delivery": delivery
            })

        page_num += 1
        time.sleep(3)

    driver.quit()
    df = pd.DataFrame(results[:limit])
    return df


# =====================================================
#                     EXECUTION
# =====================================================
if __name__ == "__main__":
    print("[INFO] Scraping Etsy‚Ä¶")
    df = scrape_products(limit=10)

    df.to_csv("../data/clean/clean_tote_bags.csv", index=False)
    print("\n[SUCCESS] CSV saved! ‚úî")
    print(df)


[INFO] Scraping Etsy‚Ä¶
[INFO] Loading page 1: https://www.etsy.com/search?q=tote+bag&page=1
[INFO] Products detected on page: 0

[SUCCESS] CSV saved! ‚úî
Empty DataFrame
Columns: []
Index: []


### ANOTHER ONE

In [9]:
import time
import json
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By

def scrape_products(limit=10):
    """
    Scrape tote bag product data from Etsy using Selenium and embedded JSON.
    Works reliably even if Etsy obfuscates CSS classes.
    """
    # Launch Chrome
    driver = uc.Chrome()
    driver.maximize_window()

    results = []
    page = 1

    while len(results) < limit:
        url = f"https://www.etsy.com/search?q=tote+bag&page={page}"
        print(f"[INFO] Loading page {page}...")
        driver.get(url)
        time.sleep(7)  # wait for JS to load JSON

        # Extract all <script type="application/ld+json">
        scripts = driver.find_elements(By.XPATH, "//script[@type='application/ld+json']")
        json_data = None

        for s in scripts:
            try:
                text = s.get_attribute("innerHTML")
                data = json.loads(text)
                # Etsy product list is usually under "@graph" key
                if isinstance(data, dict) and "@graph" in data:
                    json_data = data["@graph"]
                    break
            except:
                continue

        if not json_data:
            print("[WARNING] No JSON found on page")
            break

        # Extract products
        for item in json_data:
            if len(results) >= limit:
                break

            if item.get("@type") != "Product":
                continue

            product_url = item.get("url")
            title = item.get("name")
            price = None
            offers = item.get("offers")
            if offers:
                price = offers.get("price")

            rating = None
            reviews = None
            aggregate = item.get("aggregateRating")
            if aggregate:
                try:
                    rating = float(aggregate.get("ratingValue"))
                    reviews = int(aggregate.get("reviewCount"))
                except:
                    pass

            # Delivery info is not always present
            delivery = None
            # Check for free shipping in description (approximation)
            description = item.get("description", "")
            if "free shipping" in description.lower():
                delivery = 0

            results.append({
                "URL": product_url,
                "Title": title,
                "Price": price,
                "Rating": rating,
                "Reviews": reviews,
                "Delivery": delivery
            })

        page += 1
        time.sleep(2)

    driver.quit()

    # Return only the first `limit` results
    df = pd.DataFrame(results[:limit])
    return df

# --------------------------
# EXECUTION
# --------------------------
if __name__ == "__main__":
    print("[INFO] Scraping Etsy tote bags...")
    df = scrape_products(limit=10)

    # Save to CSV
    df.to_csv("../data/clean/clean_tote_bags.csv", index=False)
    print("[SUCCESS] CSV saved successfully!")
    print(df)


[INFO] Scraping Etsy tote bags...
[INFO] Loading page 1...
[SUCCESS] CSV saved successfully!
Empty DataFrame
Columns: []
Index: []


In [10]:
import time
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()

    search_url = "https://www.etsy.com/search?q=tote+bag"
    driver.get(search_url)

    wait = WebDriverWait(driver, 15)

    # Get first 10 product links
    product_links = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//ul[contains(@class,'wt-grid')]/li//a[@data-listing-id]"))
    )
    product_links = [link.get_attribute("href") for link in product_links][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}: {url}")
        driver.get(url)
        time.sleep(5)

        # Title
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # Price
        try:
            price = driver.find_element(By.XPATH, "//p[@class='wt-text-title-03 wt-mr-xs-2']/span[@class='currency-value']").text.strip()
        except:
            price = None

        # Rating
        try:
            rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
            rating = float(rating_elem.get_attribute("value"))
        except:
            rating = None

        # Reviews
        try:
            reviews_elem = driver.find_element(By.XPATH, "//span[@class='wt-badge wt-mr-xs-1']")
            reviews_text = reviews_elem.text.strip()
            reviews = int(reviews_text[1:-1]) if reviews_text.startswith("(") else None
        except:
            reviews = None

        # Delivery
        try:
            delivery_elem = driver.find_element(By.XPATH, "//p[contains(text(),'delivery') or contains(text(),'shipping')]")
            delivery_text = delivery_elem.text.strip()
            delivery = 0 if "free" in delivery_text.lower() else delivery_text
        except:
            delivery = None

        results.append({
            "URL": url,
            "Title": title,
            "Price": price,
            "Rating": rating,
            "Reviews": reviews,
            "Delivery": delivery
        })

    driver.quit()
    return pd.DataFrame(results)

# --------------------------
# EXECUTION
# --------------------------
if __name__ == "__main__":
    df = scrape_products(limit=10)
    df.to_csv("../data/clean/clean_tote_bags.csv", index=False)
    print("[SUCCESS] CSV saved!")
    print(df)


[INFO] Scraping product 1: https://www.etsy.com/fr/listing/4301871513/sac-fourre-tout-en-toile-personnalise?click_key=52dcb2c3-b26c-4fc5-a670-98e40bd4fd0c%3ALTfc57f11dca77857d34fc147a0110ef9313d6accf&click_sum=a1786af8&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-577973-1-1&sr_prefetch=1&pf_from=search&pro=1&frs=1&pop=1&sts=1&content_source=52dcb2c3-b26c-4fc5-a670-98e40bd4fd0c%253ALTfc57f11dca77857d34fc147a0110ef9313d6accf
[INFO] Scraping product 2: https://www.etsy.com/fr/listing/4301871513/sac-fourre-tout-en-toile-personnalise?click_key=52dcb2c3-b26c-4fc5-a670-98e40bd4fd0c%3ALTfc57f11dca77857d34fc147a0110ef9313d6accf&click_sum=a1786af8&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-577973-1-1&sr_prefetch=1&pf_from=search&pro=1&frs=1&pop=1&sts=1&content_source=52dcb2c3-b26c-4fc5-a670-98e40bd4fd0c%253ALTfc57f11dca77857d34fc147a0110ef9313d6accf
[INFO] Scraping pr

In [11]:
df.head(10)

Unnamed: 0,URL,Title,Price,Rating,Reviews,Delivery
0,https://www.etsy.com/fr/listing/4301871513/sac...,"Sac fourre-tout en toile personnalis√©, sac fou...",,4.7464,,
1,https://www.etsy.com/fr/listing/4301871513/sac...,"Sac fourre-tout en toile personnalis√©, sac fou...",,4.7464,,
2,https://www.etsy.com/fr/listing/4391873405/sac...,"Sac fourre-tout en nylon matelass√© brod√©, cade...",,4.6427,,
3,https://www.etsy.com/fr/listing/4391873405/sac...,"Sac fourre-tout en nylon matelass√© brod√©, cade...",,4.6427,,
4,https://www.etsy.com/fr/listing/4363447940/sac...,"Sac fourre-tout personnalis√©, cadeau de demois...",,4.8334,,
5,https://www.etsy.com/fr/listing/4363447940/sac...,"Sac fourre-tout personnalis√©, cadeau de demois...",,4.8334,,
6,https://www.etsy.com/fr/listing/4404521481/sac...,Sac fourre-tout brod√© style livres Cottagecore...,,3.0,,
7,https://www.etsy.com/fr/listing/4404521481/sac...,Sac fourre-tout brod√© style livres Cottagecore...,,3.0,,
8,https://www.etsy.com/fr/listing/4339252322/sac...,Sac fourre-tout personnalis√© brod√© avec initia...,,4.758,,
9,https://www.etsy.com/fr/listing/4339252322/sac...,Sac fourre-tout personnalis√© brod√© avec initia...,,4.758,,


### FRENCH ETSY

In [13]:
import time
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()

    search_url = "https://www.etsy.com/fr/search?q=tote+bag"
    driver.get(search_url)

    wait = WebDriverWait(driver, 15)

    # Get first 10 product links
    product_links = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//ul[contains(@class,'wt-grid')]/li//a[@data-listing-id]"))
    )
    product_links = [link.get_attribute("href") for link in product_links][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}: {url}")
        driver.get(url)
        time.sleep(5)

        # Title
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # Price
        try:
            price_elem = driver.find_element(
                By.XPATH, "//p[contains(@class,'wt-text-title-03')]/span[contains(@class,'currency-value')]"
            )
            price = price_elem.text.strip()
        except:
            price = None

        # Rating
        try:
            rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
            rating = float(rating_elem.get_attribute("value"))
        except:
            rating = None

        # Reviews
        try:
            reviews_elem = driver.find_element(
                By.XPATH, "//span[contains(@class,'wt-badge') or contains(@class,'wt-mr-xs-1')]"
            )
            reviews_text = reviews_elem.text.strip()
            reviews = int(reviews_text[1:-1]) if reviews_text.startswith("(") else None
        except:
            reviews = None

        # Delivery (French & English)
        try:
            delivery_elem = driver.find_element(
                By.XPATH,
                "//span[contains(text(),'livraison') or contains(text(),'delivery') or contains(text(),'shipping')]"
            )
            delivery_text = delivery_elem.text.strip()
            delivery = 0 if "gratuit" in delivery_text.lower() or "free" in delivery_text.lower() else delivery_text
        except:
            delivery = None

        results.append({
            "URL": url,
            "Title": title,
            "Price": price,
            "Rating": rating,
            "Reviews": reviews,
            "Delivery": delivery
        })

    driver.quit()
    return pd.DataFrame(results)

# --------------------------
# EXECUTION
# --------------------------
if __name__ == "__main__":
    df = scrape_products(limit=10)
    df.to_csv("../data/clean/clean_tote_bags.csv", index=False)
    print("[SUCCESS] CSV saved!")
    print(df)
    df.head(10)


KeyboardInterrupt: 

### PRODUCT PAGE SCRAPING

In [16]:
import time
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/fr/search?q=tote+bag"
    driver.get(search_url)

    # Get first N product links
    product_links = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//ul[contains(@class,'wt-grid')]/li//a[@data-listing-id]"))
    )
    product_links = [link.get_attribute("href") for link in product_links][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}: {url}")
        driver.get(url)
        time.sleep(5)

        # Title
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # Rating
        try:
            rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
            rating = float(rating_elem.get_attribute("value"))
        except:
            rating = None

        # Reviews
        try:
            reviews_elem = driver.find_element(
                By.XPATH, "//span[contains(@class,'wt-badge') or contains(@class,'wt-mr-xs-1')]"
            )
            reviews_text = reviews_elem.text.strip()
            reviews = int(reviews_text[1:-1]) if reviews_text.startswith("(") else None
        except:
            reviews = None

        # Delivery
        try:
            delivery_elem = driver.find_element(
                By.XPATH,
                "//span[contains(text(),'livraison') or contains(text(),'delivery') or contains(text(),'shipping')]"
            )
            delivery_text = delivery_elem.text.strip()
            delivery = 0 if "gratuit" in delivery_text.lower() or "free" in delivery_text.lower() else delivery_text
        except:
            delivery = None

        # ======= Handle Options =======
        option_sections = driver.find_elements(By.XPATH, "//select[@id or @name]")
        if not option_sections:
            # No options, just grab price
            try:
                now_price_elem = driver.find_element(By.XPATH, "//p[contains(@class,'wt-text-title-03')]/span[contains(@class,'currency-value')]")
                now_price = now_price_elem.text.strip()
            except:
                now_price = None
            try:
                old_price_elem = driver.find_element(By.XPATH, "//p[contains(@class,'wt-text-strikethrough')]/span[contains(@class,'currency-value')]")
                old_price = old_price_elem.text.strip()
            except:
                old_price = None
            results.append({
                "URL": url, "Title": title, "Rating": rating, "Reviews": reviews,
                "Delivery": delivery, "Option": None, "Old_Price": old_price, "Now_Price": now_price
            })
        else:
            # Iterate through each option
            for select in option_sections:
                options = select.find_elements(By.TAG_NAME, "option")
                for opt in options:
                    opt_value = opt.get_attribute("value")
                    if opt_value:
                        try:
                            select.click()
                            opt.click()
                            time.sleep(2)  # wait for price update
                        except:
                            pass
                        try:
                            now_price_elem = driver.find_element(By.XPATH, "//p[contains(@class,'wt-text-title-03')]/span[contains(@class,'currency-value')]")
                            now_price = now_price_elem.text.strip()
                        except:
                            now_price = None
                        try:
                            old_price_elem = driver.find_element(By.XPATH, "//p[contains(@class,'wt-text-strikethrough')]/span[contains(@class,'currency-value')]")
                            old_price = old_price_elem.text.strip()
                        except:
                            old_price = None
                        results.append({
                            "URL": url, "Title": title, "Rating": rating, "Reviews": reviews,
                            "Delivery": delivery, "Option": opt.text.strip(), "Old_Price": old_price, "Now_Price": now_price
                        })

    driver.quit()
    return pd.DataFrame(results)

# --------------------------
# EXECUTION
# --------------------------
if __name__ == "__main__":
    df = scrape_products(limit=11)
    df.to_csv("../data/interim/products_10.csv", index=False)
    print("[SUCCESS] CSV saved!")
    print(df)


[INFO] Scraping product 1: https://www.etsy.com/fr/listing/4301871513/sac-fourre-tout-en-toile-personnalise?click_key=e9cd10eb-0d1f-4026-9c1b-e5d3e65b85ab%3ALTddef09146b44024af17cbcd12889ea0b23329c36&click_sum=f7ececfd&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-86035-1-1&sr_prefetch=1&pf_from=search&pro=1&frs=1&pop=1&sts=1&content_source=e9cd10eb-0d1f-4026-9c1b-e5d3e65b85ab%253ALTddef09146b44024af17cbcd12889ea0b23329c36


StaleElementReferenceException: Message: stale element reference: stale element not found in the current frame
  (Session info: chrome=142.0.7444.176); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#staleelementreferenceexception
Stacktrace:
Symbols not available. Dumping unresolved backtrace:
	0x8b4103
	0x8b4144
	0x6be71d
	0x6c5493
	0x6c7864
	0x6c78f8
	0x704d8e
	0x72c90c
	0x6ff7c4
	0x72cac4
	0x74ee17
	0x72c706
	0x6fda30
	0x6fed54
	0xb257b4
	0xb2098a
	0x8dc392
	0x8cc4c8
	0x8d324d
	0x8bc478
	0x8bc63c
	0x8a67ca
	0x76655d49
	0x7799d6db
	0x7799d661


### PRODUCTS 11

In [17]:
import time
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    search_url = "https://www.etsy.com/fr/search?q=tote+bag"
    driver.get(search_url)

    # Get first N product links
    product_links = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//ul[contains(@class,'wt-grid')]/li//a[@data-listing-id]"))
    )
    product_links = [link.get_attribute("href") for link in product_links][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}: {url}")
        driver.get(url)
        time.sleep(5)

        # Title
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # Rating
        try:
            rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
            rating = float(rating_elem.get_attribute("value"))
        except:
            rating = None

        # Reviews
        try:
            reviews_elem = driver.find_element(
                By.XPATH, "//span[contains(@class,'wt-badge') or contains(@class,'wt-mr-xs-1')]"
            )
            reviews_text = reviews_elem.text.strip()
            reviews = int(reviews_text[1:-1]) if reviews_text.startswith("(") else None
        except:
            reviews = None

        # Delivery
        try:
            delivery_elem = driver.find_element(
                By.XPATH,
                "//span[contains(text(),'livraison') or contains(text(),'delivery') or contains(text(),'shipping')]"
            )
            delivery_text = delivery_elem.text.strip()
            delivery = 0 if "gratuit" in delivery_text.lower() or "free" in delivery_text.lower() else delivery_text
        except:
            delivery = None

        # ======= Handle Options Safely =======
        try:
            select_elements = driver.find_elements(By.XPATH, "//select[@id or @name]")
            if not select_elements:
                # No options
                now_price = driver.find_element(By.XPATH, "//p[contains(@class,'wt-text-title-03')]/span[contains(@class,'currency-value')]").text.strip()
                try:
                    old_price = driver.find_element(By.XPATH, "//p[contains(@class,'wt-text-strikethrough')]/span[contains(@class,'currency-value')]").text.strip()
                except:
                    old_price = None
                results.append({
                    "URL": url, "Title": title, "Rating": rating, "Reviews": reviews,
                    "Delivery": delivery, "Option": None, "Old_Price": old_price, "Now_Price": now_price
                })
            else:
                # Iterate through options
                for sel_idx, select in enumerate(select_elements):
                    options = select.find_elements(By.TAG_NAME, "option")
                    for opt_idx, _ in enumerate(options):
                        try:
                            # Refetch select & option to avoid stale reference
                            select_ref = driver.find_elements(By.XPATH, "//select[@id or @name]")[sel_idx]
                            option_ref = select_ref.find_elements(By.TAG_NAME, "option")[opt_idx]
                            option_ref.click()
                            time.sleep(2)  # wait for price to update
                        except:
                            continue

                        # Prices
                        try:
                            now_price = driver.find_element(By.XPATH, "//p[contains(@class,'wt-text-title-03')]/span[contains(@class,'currency-value')]").text.strip()
                        except:
                            now_price = None
                        try:
                            old_price = driver.find_element(By.XPATH, "//p[contains(@class,'wt-text-strikethrough')]/span[contains(@class,'currency-value')]").text.strip()
                        except:
                            old_price = None

                        option_name = option_ref.text.strip()
                        results.append({
                            "URL": url, "Title": title, "Rating": rating, "Reviews": reviews,
                            "Delivery": delivery, "Option": option_name, "Old_Price": old_price, "Now_Price": now_price
                        })
        except Exception as e:
            print(f"[WARNING] Options skipped due to: {e}")

    driver.quit()
    return pd.DataFrame(results)

# --------------------------
# EXECUTION
# --------------------------
if __name__ == "__main__":
    df = scrape_products(limit=10)
    df.to_csv("../data/clean/clean_tote_bags.csv", index=False)
    print("[SUCCESS] CSV saved!")
    print(df)


[INFO] Scraping product 1: https://www.etsy.com/fr/listing/1836666545/tote-bag-petit-bazar-personnalise-ideal?click_key=b6b930c1-5125-4493-ab60-260a3d7e987e%3ALT330107eb53194702e8823e22051e66e818f686c1&click_sum=c67615f5&ls=s&ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=tote+bag&ref=search_grid-574664-1-1&sr_prefetch=1&pf_from=search&pro=1&bes=1&sts=1&local_signal_search=1&content_source=b6b930c1-5125-4493-ab60-260a3d7e987e%253ALT330107eb53194702e8823e22051e66e818f686c1
  (Session info: chrome=142.0.7444.176); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#staleelementreferenceexception
Stacktrace:
Symbols not available. Dumping unresolved backtrace:
	0x8b4103
	0x8b4144
	0x6be71d
	0x6c5493
	0x6c7864
	0x6c78f8
	0x704d8e
	0x72c90c
	0x6ff7c4
	0x72cac4
	0x74ee17
	0x72c706
	0x6fda30
	0x6fed54
	0xb257b4
	0xb2098a
	0x8dc392
	0x8cc4c8
	0x8d324d
	0x8bc478
	0x8bc63c
	0x8a67ca
	0x76655d49
	0x

In [18]:
df.head(11)

Unnamed: 0,URL,Title,Rating,Reviews,Delivery,Option,Old_Price,Now_Price
0,https://www.etsy.com/fr/listing/1836666545/tot...,Tote Bag Petit Bazar Personnalis√© - Id√©al pour...,4.98,,,S√©lectionner une option,,
1,https://www.etsy.com/fr/listing/1836666545/tot...,Tote Bag Petit Bazar Personnalis√© - Id√©al pour...,4.98,,,S√©lectionner une option,,
2,https://www.etsy.com/fr/listing/4301871513/sac...,"Sac fourre-tout en toile personnalis√©, sac fou...",4.7464,,,S√©lectionner une option,,
3,https://www.etsy.com/fr/listing/4301871513/sac...,"Sac fourre-tout en toile personnalis√©, sac fou...",4.7464,,,S√©lectionner une option,,
4,https://www.etsy.com/fr/listing/1825286680/dou...,Double Pocket Soft Corduroy Tote Bag (Dark Bro...,4.9005,,,S√©lectionner une option,,
5,https://www.etsy.com/fr/listing/1825286680/dou...,Double Pocket Soft Corduroy Tote Bag (Dark Bro...,4.9005,,,S√©lectionner une option,,
6,https://www.etsy.com/fr/listing/1075693684/sac...,"Sac personnalis√© pour Enfant, tote bag, pochon...",4.749,,,S√©lectionner une option,,
7,https://www.etsy.com/fr/listing/1075693684/sac...,"Sac personnalis√© pour Enfant, tote bag, pochon...",4.749,,,S√©lectionner une option,,
8,https://www.etsy.com/fr/listing/4314838388/sac...,Sacs en toile de jute personnalis√©s/nom person...,4.6684,,,S√©lectionner une option,,
9,https://www.etsy.com/fr/listing/4314838388/sac...,Sacs en toile de jute personnalis√©s/nom person...,4.6684,,,S√©lectionner une option,,


### HERE

In [19]:
import time
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Launch Chrome
driver = uc.Chrome()
driver.maximize_window()
wait = WebDriverWait(driver, 15)

# Single product URL
url = "https://www.etsy.com/fr/listing/1836666545/tote-bag-petit-bazar-personnalise"
driver.get(url)
time.sleep(5)

# Title
title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()

# Rating
try:
    rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
    rating = float(rating_elem.get_attribute("value"))
except:
    rating = None

# Reviews
try:
    reviews_elem = driver.find_element(By.XPATH, "//span[contains(@class,'wt-badge')]")
    reviews_text = reviews_elem.text.strip()
    reviews = int(reviews_text[1:-1]) if reviews_text.startswith("(") else None
except:
    reviews = None

# Delivery
try:
    delivery_elem = driver.find_element(
        By.XPATH,
        "//span[contains(text(),'livraison') or contains(text(),'delivery')]"
    )
    delivery_text = delivery_elem.text.strip()
    delivery = 0 if "gratuit" in delivery_text.lower() or "free" in delivery_text.lower() else delivery_text
except:
    delivery = None

# ======= Select first real option =======
try:
    select_elements = driver.find_elements(By.XPATH, "//select[@id or @name]")
    for select in select_elements:
        options = select.find_elements(By.TAG_NAME, "option")
        for opt in options:
            if "s√©lectionner" not in opt.text.lower():
                opt.click()
                time.sleep(2)  # wait for price update
                break  # pick the first real option
except Exception as e:
    print("Option selection skipped:", e)

# Price
try:
    now_price = driver.find_element(
        By.XPATH, "//p[contains(@class,'wt-text-title-03')]/span[contains(@class,'currency-value')]"
    ).text.strip()
except:
    now_price = None

try:
    old_price = driver.find_element(
        By.XPATH, "//p[contains(@class,'wt-text-strikethrough')]/span[contains(@class,'currency-value')]"
    ).text.strip()
except:
    old_price = None

print("Title:", title)
print("Rating:", rating)
print("Reviews:", reviews)
print("Delivery:", delivery)
print("Old Price:", old_price)
print("Now Price:", now_price)

driver.quit()


Option selection skipped: Message: stale element reference: stale element not found in the current frame
  (Session info: chrome=142.0.7444.176); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#staleelementreferenceexception
Stacktrace:
Symbols not available. Dumping unresolved backtrace:
	0x8b4103
	0x8b4144
	0x6be71d
	0x6c5493
	0x6c7864
	0x6c78f8
	0x709886
	0x70a41b
	0x6ffc81
	0x72c954
	0x6ff7c4
	0x72cac4
	0x74ee17
	0x72c706
	0x6fda30
	0x6fed54
	0xb257b4
	0xb2098a
	0x8dc392
	0x8cc4c8
	0x8d324d
	0x8bc478
	0x8bc63c
	0x8a67ca
	0x76655d49
	0x7799d6db
	0x7799d661

Title: Tote Bag Petit Bazar Personnalis√© - Id√©al pour Cadeaux !
Rating: 4.98
Reviews: None
Delivery: None
Old Price: None
Now Price: None


### FIX

In [20]:
import time
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Launch Chrome
driver = uc.Chrome()
driver.maximize_window()
wait = WebDriverWait(driver, 15)

# Product page
url = "https://www.etsy.com/fr/listing/1836666545/tote-bag-petit-bazar-personnalise"
driver.get(url)
time.sleep(5)

# Title
title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()

# Rating
try:
    rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
    rating = float(rating_elem.get_attribute("value"))
except:
    rating = None

# Reviews
try:
    reviews_elem = driver.find_element(By.XPATH, "//span[contains(@class,'wt-badge')]")
    reviews_text = reviews_elem.text.strip()
    reviews = int(reviews_text[1:-1]) if reviews_text.startswith("(") else None
except:
    reviews = None

# Delivery
try:
    delivery_elem = driver.find_element(
        By.XPATH, "//span[contains(text(),'livraison') or contains(text(),'delivery')]"
    )
    delivery_text = delivery_elem.text.strip()
    delivery = 0 if "gratuit" in delivery_text.lower() or "free" in delivery_text.lower() else delivery_text
except:
    delivery = None

# ======= Handle Options =======
results = []

try:
    # Variant buttons (color, size, etc.)
    variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
    for section in variant_sections:
        options = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'S√©lectionner'))]")
        for opt in options:
            option_name = opt.get_attribute("aria-label") or opt.text
            try:
                opt.click()
                time.sleep(2)  # wait for price to update
            except:
                continue

            # Prices
            try:
                now_price = driver.find_element(
                    By.XPATH, "//p[contains(@class,'wt-text-title-03')]/span[contains(@class,'currency-value')]"
                ).text.strip()
            except:
                now_price = None
            try:
                old_price = driver.find_element(
                    By.XPATH, "//p[contains(@class,'wt-text-strikethrough')]/span[contains(@class,'currency-value')]"
                ).text.strip()
            except:
                old_price = None

            results.append({
                "URL": url,
                "Title": title,
                "Rating": rating,
                "Reviews": reviews,
                "Delivery": delivery,
                "Option": option_name,
                "Old_Price": old_price,
                "Now_Price": now_price
            })

except Exception as e:
    print("Variant selection skipped:", e)

driver.quit()

# Show results
for r in results:
    print(r)


==================================================================================================================================
# <div align="center">DATA CLEANING & ANALYSIS</div>
==================================================================================================================================

#### üóÉÔ∏è **Raw data**

- Web scraped data saved in a DataFrame then a CSV file and uploaded to google drive
- The df_url has to be a downloadable link to the csv file from google drive
- We load the csv to use for data cleaning and analysis

In [None]:
import pandas as pd

# Load the CSV
df_url = 'link to the dataFrame collected from scraping as a downloadable link from google drive'
df = pd.read_csv(df_url)
df.head(3)

----

#### üóÉÔ∏è **Interim data**

In [None]:
# Save 'Price' INTERIM to CSV
df.to_csv("../data/interim/1_interim_price.csv", index=False)
print("STEP 1 : 'Price' INRTERIM and CSV saved successfully!")

----

#### üóÉÔ∏è **Clean data**

In [None]:
# Save 'Price' CLEAN to CSV
df.to_csv("../data/clean/1_clean_price.csv", index=False)
print("STEP 1 : 'Price' CLEAN and CSV saved successfully!")

==================================================================================================================================
# <div align="center">RESEARCH</div>
==================================================================================================================================

### üåê **Which Are the Best-Selling POD Products on Etsy?**

I‚Äôm researching print-on-demand products to sell on Etsy that only require **digital artwork and marketing**, while the POD provider handles **printing, packaging, and shipping**.


### ‚≠ê Using Google Trends for POD Product Research
üí° **Goal:** Identify which POD product category has been searched the most on Google over the past 5 years (2020‚Äì2025).

Below is the list of product categories I‚Äôm comparing:

1. ```Custom Apparel```
    - T-shirts  
    - Hoodies  
    - Sweatshirts  
    - Tank tops 

2. ```Mug```
    - Ceramic mugs  
    - Color-changing mugs  
    - Espresso mugs  
    - Travel mugs 

3. ```Tote Bag```
    - Cotton totes  
    - All-over print totes  

4. ```Phone Case```
    - iPhone / Samsung cases  
    - Tough / Slim cases  

5. ```Stickers```
    - Die-cut stickers  
    - Kiss-cut stickers  
    - Sticker sheets 

6. ```Hats```
    - Baseball caps  
    - Trucker hats  
    - Beanies  

7. ```Pillows / Cushions```
    - Pillow covers  
    - Stuffed pillows  
    - All-over print pillow designs  

8. ```Blanket```
    - Fleece blankets  
    - Sherpa blankets  
    - Woven blankets  

9. ```Wall Art```
    - Posters  
    - Canvas prints  
    - Framed posters  
    - Metal prints  

10. ```Doormat```
    - Printed coir doormats  
    - Rubber-backed doormats 

11. ```Drinkware```
    - Stainless steel tumblers  
    - Water bottles  
    - Wine tumblers 

12. ```Calendar```
    - Custom printed wall calendars  

13. ```Yoga Mat```
    - Printed yoga mats 

14. ```Bedding```
    - Duvet covers  
    - Pillowcases  
    - All-over print bed sets

15. ```Pet Accessories```
    - Pet bandanas  
    - Pet beds  
    - Pet bowls  
    - Pet blankets  

16. ```Ornaments```
    - Ceramic ornaments
    - Wood ornaments
    - Metal ornaments 



------
### üéØ Chosen POD product to research is : tote bags

aria-label="4.9 star rating with 398 reviews"

etsy store selling print on demand products

data needed
- product title keywords to use to optimize sales / using title
- product description keywords / 
- insight the niches based on most selling keywords
- period when to sell / using reviews
- price / most selling price tag and range
- targeted audience ?
- how to market it?

Chosen website for Data Scraping : Etsy

data to extract : 

- product_title, for the keywords used in it to analyse the niche of this POD product

- product_price, for figuring the best price to sell it at

- product_listing_date, the date this product got created and added on etsy 

- product_rating, to know which niche in this POD product is selling the most 
- product_niche_rating

- product_reviews_date, to compare nbr_review vs nbr_orders 
and to have a plot showing the rating of this product over time
when did those sales happen the most and if it was recent or not
two products can be sold with the same amount of orders but
at different lengths of time

In [1]:
# product_category : t-shirt, mug, calendar,...
# product_niche : comedy, drama, horror, halloween, cartoon, anime, ... 
# product_price :  in euros
# product_listing_date: 00/00/0000 date created and added to etsy on product page
# product_rating: 0.0/5 current rating of the product to compare
# product_reviews_ratings: DataFrame with reviews ratings of each product from product page
# product_reviews_dates: DataFrame with reviews dates of each product from product page
# product_reviews_date: DataFrame with reviews descriptions of each product from product page

==================================================================================================================================
# <div align="center">PLOTS</div>
==================================================================================================================================

### üìä PLOT 01:

In [2]:
# PLOT 1

### üìä PLOT 02:

In [3]:
# PLOT 2

### üìä PLOT 03:

In [4]:
# PLOT 3

### üìä PLOT 04:

In [1]:
# PLOT 4

### üìä PLOT 05:

In [6]:
# PLOT 5

==================================================================================================================================
# <div align="center">INSIGHTS</div>
==================================================================================================================================

### üß† INSIGHT 01:
Text

----

### üß† INSIGHT 02:
Text

---

### üß† INSIGHT 03:
Text


==================================================================================================================================