==================================================================================================================================
# <div align="center">PROJECT 03: Etsy Print-On-Demand Trends</div>
==================================================================================================================================

### üìù BUSINESS IDEA

**Print-On-Demand (POD) Business** ‚Äì What the project is about

### ‚ÅâÔ∏è PROBLEM

No API exists to access the market data needed, requiring web scraping to gather insights ‚Äì The challenge we‚Äôre addressing

### üî∞ SOLUTION FRAMEWORK

Web scrape etsy for a specific POD product

Collect the data necessary to clean & analyze


| **Development**                                                                                                                                             | **Presentation**                 |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------- |
| **Business Idea** ‚Üí **Problem Definition** ‚Üí **Data Research & Visualization** ‚Üí **Insights** ‚Üí **Interpretation** ‚Üí **Implications** ‚Üí **Business Impact** | **Limitations & Considerations** |

### üìå SECTION OVERVIEW

* **Project / Business Idea:** What the project is about
* **Problem:** The challenge we‚Äôre addressing
* **Solution / Approach:** How we solve it
* **Research & Plots:** How we analyzed data visually
* **Insights:** What we discovered
* **Interpretation:** Why it matters
* **Implications:** What actions the business can take
* **Business Impact:** Expected results for the business
* **Limitations:** What constraints or gaps exist

==================================================================================================================================
# <div align="center">WEB SCRAPING</div>
==================================================================================================================================

```Etsy``` is a dynamic website, so scraping it requires careful handling.

Since ```Etsy``` uses ```JavaScript``` to load some content,

```requests``` +  ``BeautifulSoup`` might work for static parts (like search results), 

but for dynamic content, ``Selenium`` is more reliable. 

I will be using ``requests`` + ``BeautifulSoup`` for ```product listings``` **(title, price, link)**

Important Note: Etsy uses dynamic loading + anti-bot protections.

Using code with standard HTML scraping can work as long as Etsy doesn‚Äôt block the request.

If blocked, using headers, rotating proxies, or the Etsy API will be required.

----

### üß∞ **Install for web scraping**

In [None]:
# install requests & beautifulsoup
!pip install requests beautifulsoup4 fake-useragent pandas

# install selenium
!pip install selenium pandas

---

### üìå **Avoid web BLOCKED**
| Version                                   | Best For          | Pros                                           | Cons                          |
| ----------------------------------------- | ----------------- | ---------------------------------------------- | ----------------------------- |
| **Requests + BeautifulSoup + Pagination** | Simple scraping   | Fast, clean                                    | Etsy may block request        |
| **Selenium + BeautifulSoup + Pagination** | Reliable scraping | Bypasses bot protection, loads dynamic content | Slower, requires ChromeDriver |


----

### üìå **Pagination + BeautifulSoup**
| Version                                   | Best For          | Pros                                           | Cons                          |
| ----------------------------------------- | ----------------- | ---------------------------------------------- | ----------------------------- |
| **Requests + BeautifulSoup + Pagination** | Simple scraping   | Fast, clean                                    | Etsy may block request        |

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import time


def scrape_products(pages=5, max_items=10):
    base_url = "https://www.etsy.com/search?q=tote+bag&page="
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36"
    }

    data = []

    for page in range(1, pages + 1):
        url = base_url + str(page)
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, "html.parser")

        products = soup.find_all("li", class_="wt-list-unstyled")

        for item in products:
            if len(data) >= max_items:
                return pd.DataFrame(data)

            # URL
            link = item.find("a", href=True)
            if not link:
                continue
            product_url = "https://www.etsy.com" + link["href"]

            # Title
            title_tag = item.find("h3")
            title = title_tag.get_text(strip=True) if title_tag else None

            # Price
            price_tag = item.find("span", class_="currency-value")
            price = None
            if price_tag:
                try:
                    price = float(price_tag.text.replace(",", "."))
                except:
                    pass

            # Rating
            rating_tag = item.find("span", class_="wt-screen-reader-only")
            rating = None
            if rating_tag:
                match_rating = re.search(r"([\d.]+) out of 5", rating_tag.text)
                if match_rating:
                    rating = float(match_rating.group(1))

            # Reviews
            reviews_tag = item.find("span", class_="wt-text-body-01")
            reviews = None
            if reviews_tag:
                match_reviews = re.search(r"\((\d+)\)", reviews_tag.text)
                if match_reviews:
                    reviews = int(match_reviews.group(1))

            # Delivery
            delivery = None
            delivery_tag = item.find(string=re.compile("delivery", re.I))
            if delivery_tag:
                txt = delivery_tag.lower()
                if "free" in txt:
                    delivery = 0
                else:
                    match_del = re.search(r"‚Ç¨\s?([\d.,]+)", delivery_tag)
                    if match_del:
                        delivery = float(match_del.group(1).replace(",", "."))

            data.append({
                "URL": product_url,
                "Title": title,
                "Price": price,
                "Rating": rating,
                "Reviews": reviews,
                "Delivery": delivery
            })

        time.sleep(1)

    return pd.DataFrame(data)


# Example: save CSV
if __name__ == "__main__":
    df = scrape_products()
    df.to_csv("../data/interim/0_interim_price.csv", index=False)
    print("STEP 1 : 'Price' INTERIM and CSV saved successfully!")


---

### üìå **Selenium-Based (ChromeDriver)**

| Version                                   | Best For          | Pros                                           | Cons                          |
| ----------------------------------------- | ----------------- | ---------------------------------------------- | ----------------------------- |
| **Selenium + BeautifulSoup + Pagination** | Reliable scraping | Bypasses bot protection, loads dynamic content | Slower, requires ChromeDriver |

Link to ChromeDriver: https://googlechromelabs.github.io/chrome-for-testing/#stable

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
import time
import re


def scrape_products_selenium(max_items=10):
    options = Options()
    options.add_argument("--headless")  
    options.add_argument("--disable-blink-features=AutomationControlled")
    options.add_argument("--disable-gpu")
    options.add_argument("start-maximized")
    options.add_argument("user-agent=Mozilla/5.0")

    driver = webdriver.Chrome(options=options)

    data = []
    page = 1

    while len(data) < max_items:
        url = f"https://www.etsy.com/search?q=tote+bag&page={page}"
        driver.get(url)
        time.sleep(4)

        soup = BeautifulSoup(driver.page_source, "html.parser")
        products = soup.find_all("li", class_="wt-list-unstyled")

        for item in products:
            if len(data) >= max_items:
                break

            # URL
            link = item.find("a", href=True)
            if not link:
                continue
            product_url = "https://www.etsy.com" + link["href"]

            # Title
            title_tag = item.find("h3")
            title = title_tag.get_text(strip=True) if title_tag else None

            # Price
            price_tag = item.find("span", class_="currency-value")
            price = None
            if price_tag:
                try:
                    price = float(price_tag.text.replace(",", "."))
                except:
                    pass

            # Rating
            rating_tag = item.find("span", class_="wt-screen-reader-only")
            rating = None
            if rating_tag:
                match_rating = re.search(r"([\d.]+) out of 5", rating_tag.text)
                if match_rating:
                    rating = float(match_rating.group(1))

            # Reviews
            reviews_tag = item.find("span", class_="wt-text-body-01")
            reviews = None
            if reviews_tag:
                match_reviews = re.search(r"\((\d+)\)", reviews_tag.text)
                if match_reviews:
                    reviews = int(match_reviews.group(1))

            # Delivery
            delivery = None
            delivery_tag = item.find(string=re.compile("delivery", re.I))
            if delivery_tag:
                txt = delivery_tag.lower()
                if "free" in txt:
                    delivery = 0
                else:
                    match_del = re.search(r"‚Ç¨\s?([\d.,]+)", delivery_tag)
                    if match_del:
                        delivery = float(match_del.group(1).replace(",", "."))

            data.append({
                "URL": product_url,
                "Title": title,
                "Price": price,
                "Rating": rating,
                "Reviews": reviews,
                "Delivery": delivery
            })

        page += 1
        time.sleep(2)

    driver.quit()

    df = pd.DataFrame(data)
    return df


# Save CSV
if __name__ == "__main__":
    df = scrape_products_selenium()
    df.to_csv("../data/interim/1_interim_price.csv", index=False)
    print("STEP 1 : 'Price' INTERIM and CSV saved successfully!")


---

## üìå **Product PAGE**
The main data fields to extract from Etsy's product page :

### ‚≠ê **Etsy Product Info**

| Field Name            | Python Data Type       | Concise Definition               | Long Definition                                                                                       |
|-----------------------|-----------------------|---------------------------------|-------------------------------------------------------------------------------------------------------|
| product_id            | `str`                   | Unique Etsy listing ID.          | Unique identifier assigned by Etsy to each product listing.                                           |
| product_title         | `str`                   | Product‚Äôs title.                 | The full title/name of the product as shown on the listing page.                                      |
| old_price             | `float` or `Decimal`      | Price before discount.           | The original price before any discounts were applied.                                                 |
| discount_percentage   | `float`                 | Discount rate in percent.        | The discount value expressed as a percentage (e.g., 20.0 for 20%).                                    |
| now_price             | `float` or `Decimal`      | Price after discount.            | The current price after applying discounts.                                                           |
| currency              | `str`                   | Currency code (e.g., USD).       | The currency code used for the product price (e.g., "USD", "EUR").                                    |
| listed_date           | `datetime`              | Date the item was listed.        | The date (and optionally time) when the product was first listed on Etsy.                             |
| product_url           | `str`                   | Link to the product page.        | The direct URL link to the Etsy product page.                                                         |
| product_description   | `str`                   | Product description text.        | The text description of the product, including details, features, and information provided by seller.|
| product_variation     | `list[dict]`            | List of available variations.    | A list of variation options (size, color, material, etc.), each represented as a dictionary.          |


---

### ‚≠ê **Insighted Data**

| Field Name                 | Python Data Type       | Concise Definition                               |
|---------------------------|-------------------------|---------------------------------------------------|
| product_niche             | `str`                     | Product theme or genre (comedy, anime‚Ä¶) based on `product_title` & `product_description`.         |

---

### ‚≠ê **Etsy Product Reviews (Extra dataset)**

| Field Name                     | Python Data Type | Concise Definition                         |
|-------------------------------|------------------|---------------------------------------------|
| product_reviews         | `pd.DataFrame`     | Ratings extracted from all reviews, Dates when each review was posted, Text content of each review.          |


---

## üìå **CODE**

### FR VERSION

In [None]:
import time
import re
import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from itertools import product

def get_prices(driver):
    """
    Extract now price, old price, and calculate percentage difference.
    Returns: now_price, old_price, percentage_difference
    """
    now_price, old_price = None, None

    try:
        # Grab all relevant price elements
        price_elements = driver.find_elements(By.XPATH, "//p[contains(@class,'wt-text-title')]/span | //span[contains(@class,'wt-text-strikethrough')]")
        for elem in price_elements:
            text = elem.text.strip().replace("‚Ç¨", "").replace("+", "").replace(",", ".")
            try:
                value = float(text)
            except:
                continue

            # Determine if strikethrough -> old price
            if "wt-text-strikethrough" in elem.get_attribute("class"):
                old_price = value
            else:
                now_price = value

        # Fallback if only one price found
        if now_price is None and old_price is not None:
            now_price = old_price
        if old_price is None:
            old_price = now_price

    except:
        now_price, old_price = None, None

    # Calculate percentage difference
    percentage_difference_price = round((old_price - now_price) / old_price * 100, 2) if old_price and now_price and old_price != now_price else None

    return now_price, old_price, percentage_difference_price


def scrape_products(limit=10):
    driver = uc.Chrome()
    driver.maximize_window()
    wait = WebDriverWait(driver, 15)

    # Search page for tote bags
    search_url = "https://www.etsy.com/fr/search?q=tote+bag"
    driver.get(search_url)
    time.sleep(5)

    # Collect product links
    product_links = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//ul[contains(@class,'wt-grid')]/li//a[@data-listing-id]"))
    )
    product_links = [link.get_attribute("href") for link in product_links][:limit]

    results = []

    for idx, url in enumerate(product_links):
        print(f"[INFO] Scraping product {idx+1}/{len(product_links)}: {url}")
        driver.get(url)
        time.sleep(5)

        # --- Title ---
        try:
            title = wait.until(EC.presence_of_element_located((By.XPATH, "//h1"))).text.strip()
        except:
            title = None

        # --- Rating ---
        try:
            rating_elem = driver.find_element(By.XPATH, "//input[@name='rating']")
            rating = float(rating_elem.get_attribute("value"))
        except:
            rating = None

        # --- Reviews ---
        try:
            reviews_elem = driver.find_element(By.XPATH, "//h2[contains(@class,'review-header-text')]")
            txt_reviews = reviews_elem.text.strip()
            match = re.search(r"\((.*?)\)", txt_reviews)
            if match:
                num_text = match.group(1).strip()
                if "K" in num_text or "k" in num_text:
                    num_text = num_text.replace("K", "").replace("k", "").replace(",", ".")
                    nbr_reviews = int(float(num_text) * 1000)
                else:
                    num_text = num_text.replace(",", "").replace(" ", "").replace(".", "")
                    nbr_reviews = int(num_text)
            else:
                nbr_reviews = 0
        except:
            txt_reviews = None
            nbr_reviews = None

        # --- Delivery ---
        try:
            delivery_elem = driver.find_element(By.XPATH, "//span[contains(text(),'livraison') or contains(text(),'delivery')]")
            delivery_text = delivery_elem.text.strip()
            delivery = 0 if "gratuit" in delivery_text.lower() or "free" in delivery_text.lower() else delivery_text
        except:
            delivery = None

        # --- Variants ---
        try:
            variant_sections = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")
            if not variant_sections:
                # Single price
                now_price, old_price, percentage_difference_price = get_prices(driver)
                results.append({
                    "URL": url, "Title": title, "Rating": rating,
                    "txt_reviews": txt_reviews, "nbr_reviews": nbr_reviews,
                    "Delivery": delivery, "Option": None,
                    "Old_Price": old_price, "Now_Price": now_price,
                    "Percentage_Difference_Price": percentage_difference_price
                })
            else:
                # Handle variants
                all_options = []
                for section in variant_sections:
                    opts = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'S√©lectionner'))]")
                    option_names = [opt.get_attribute("aria-label") or opt.text for opt in opts]
                    all_options.append(option_names)

                # Generate all combinations
                for combo in product(*all_options):
                    try:
                        # Click each option
                        for sec_idx, option_name in enumerate(combo):
                            section = driver.find_elements(By.XPATH, "//fieldset[contains(@data-selector,'option')]")[sec_idx]
                            opt_buttons = section.find_elements(By.XPATH, ".//button[not(contains(@aria-label,'S√©lectionner'))]")
                            for btn in opt_buttons:
                                btn_name = btn.get_attribute("aria-label") or btn.text
                                if btn_name == option_name:
                                    btn.click()
                                    time.sleep(2)
                                    break

                        # Extract prices
                        now_price, old_price, percentage_difference_price = get_prices(driver)

                        results.append({
                            "URL": url, "Title": title, "Rating": rating,
                            "txt_reviews": txt_reviews, "nbr_reviews": nbr_reviews,
                            "Delivery": delivery, "Option": " | ".join(combo),
                            "Old_Price": old_price, "Now_Price": now_price,
                            "Percentage_Difference_Price": percentage_difference_price
                        })
                    except Exception as e:
                        print(f"[WARNING] Could not process combo {combo}: {e}")

        except Exception as e:
            print(f"[WARNING] Variant handling skipped for product {url}: {e}")

    driver.quit()
    return pd.DataFrame(results)


if __name__ == "__main__":
    df = scrape_products(limit=10)
    df.to_csv("../data/clean/clean_tote_bags.csv", index=False)
    print("[SUCCESS] CSV saved!")
    df.head(10)

==================================================================================================================================
# <div align="center">DATA CLEANING & ANALYSIS</div>
==================================================================================================================================

#### üóÉÔ∏è **Raw data**

- Web scraped data saved in a DataFrame then a CSV file and uploaded to google drive
- The df_url has to be a downloadable link to the csv file from google drive
- We load the csv to use for data cleaning and analysis

In [None]:
import pandas as pd

# Load RAW DATA CSV
df_url = 'link to the dataFrame collected from scraping as a downloadable link from google drive'
df_etsy = pd.read_csv(df_url)

print("STEP 1 : RAW CSV loaded successfully!")
df_etsy.head()


----

#### üóÉÔ∏è **Interim data**

In [None]:
# Save INTERIM DATA to CSV
df_etsy.to_csv("../data/interim/interim_data.csv", index=False)
print("STEP 2 : INRTERIM CSV saved successfully!")

----

#### üóÉÔ∏è **Clean data**

In [None]:
# Save CLEAN DATA to CSV
df_etsy.to_csv("../data/clean/clean_data.csv", index=False)
print("STEP 3 : CLEAN CSV saved successfully!")

==================================================================================================================================
# <div align="center">RESEARCH</div>
==================================================================================================================================

### üåê **Which Are the Best-Selling POD Products on Etsy?**

I‚Äôm researching print-on-demand products to sell on Etsy that only require **digital artwork and marketing**, while the POD provider handles **printing, packaging, and shipping**.


### ‚≠ê Using Google Trends for POD Product Research
üí° **Goal:** Identify which POD product category has been searched the most on Google over the past 5 years (2020‚Äì2025).

Below is the list of product categories I‚Äôm comparing:

1. ```Custom Apparel```
    - T-shirts  
    - Hoodies  
    - Sweatshirts  
    - Tank tops 

2. ```Mug```
    - Ceramic mugs  
    - Color-changing mugs  
    - Espresso mugs  
    - Travel mugs 

3. ```Tote Bag```
    - Cotton totes  
    - All-over print totes  

4. ```Phone Case```
    - iPhone / Samsung cases  
    - Tough / Slim cases  

5. ```Stickers```
    - Die-cut stickers  
    - Kiss-cut stickers  
    - Sticker sheets 

6. ```Hats```
    - Baseball caps  
    - Trucker hats  
    - Beanies  

7. ```Pillows / Cushions```
    - Pillow covers  
    - Stuffed pillows  
    - All-over print pillow designs  

8. ```Blanket```
    - Fleece blankets  
    - Sherpa blankets  
    - Woven blankets  

9. ```Wall Art```
    - Posters  
    - Canvas prints  
    - Framed posters  
    - Metal prints  

10. ```Doormat```
    - Printed coir doormats  
    - Rubber-backed doormats 

11. ```Drinkware```
    - Stainless steel tumblers  
    - Water bottles  
    - Wine tumblers 

12. ```Calendar```
    - Custom printed wall calendars  

13. ```Yoga Mat```
    - Printed yoga mats 

14. ```Bedding```
    - Duvet covers  
    - Pillowcases  
    - All-over print bed sets

15. ```Pet Accessories```
    - Pet bandanas  
    - Pet beds  
    - Pet bowls  
    - Pet blankets  

16. ```Ornaments```
    - Ceramic ornaments
    - Wood ornaments
    - Metal ornaments 



------
### üéØ Chosen POD product to research is : tote bags

aria-label="4.9 star rating with 398 reviews"

etsy store selling print on demand products

data needed
- product title keywords to use to optimize sales / using title
- product description keywords / 
- insight the niches based on most selling keywords
- period when to sell / using reviews
- price / most selling price tag and range
- targeted audience ?
- how to market it?

Chosen website for Data Scraping : Etsy

data to extract : 

- product_title, for the keywords used in it to analyse the niche of this POD product

- product_price, for figuring the best price to sell it at

- product_listing_date, the date this product got created and added on etsy 

- product_rating, to know which niche in this POD product is selling the most 
- product_niche_rating

- product_reviews_date, to compare nbr_review vs nbr_orders 
and to have a plot showing the rating of this product over time
when did those sales happen the most and if it was recent or not
two products can be sold with the same amount of orders but
at different lengths of time

In [None]:
From the product page
# product : t-shirt, mug, calendar,...
# product_niche : comedy, drama, horror, halloween, cartoon, anime, ... 
# currency : usd or eur
# product_price :  00.00
# listed_date: 00/00/0000 date created and added to etsy on product page
# product_rating: 0.00/5 current rating of the product to compare

# product_reviews_ratings: DataFrame with reviews ratings of each product from product page
# product_reviews_dates: DataFrame with reviews dates of each product from product page
# product_reviews_date: DataFrame with reviews descriptions of each product from product page

==================================================================================================================================
# <div align="center">PLOTS</div>
==================================================================================================================================

### üìä PLOT 01:

In [None]:
# PLOT 1

### üìä PLOT 02:

In [None]:
# PLOT 2

### üìä PLOT 03:

In [None]:
# PLOT 3

### üìä PLOT 04:

In [None]:
# PLOT 4

### üìä PLOT 05:

In [None]:
# PLOT 5

==================================================================================================================================
# <div align="center">INSIGHTS</div>
==================================================================================================================================

### üß† INSIGHT 01:
Text

----

### üß† INSIGHT 02:
Text

---

### üß† INSIGHT 03:
Text


==================================================================================================================================