==================================================================================================================================
# <div align="center">PROJECT 03: Etsy Print-On-Demand Trends</div>
==================================================================================================================================

### üìù BUSINESS IDEA

**Print-On-Demand (POD) Business** ‚Äì What the project is about

### ‚ÅâÔ∏è PROBLEM

No API exists to access the market data needed, requiring web scraping to gather insights ‚Äì The challenge we‚Äôre addressing

### üî∞ SOLUTION FRAMEWORK

Web scrape etsy for a specific POD product

Collect the data necessary to clean & analyze


| **Development**                                                                                                                                             | **Presentation**                 |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------- |
| **Business Idea** ‚Üí **Problem Definition** ‚Üí **Data Research & Visualization** ‚Üí **Insights** ‚Üí **Interpretation** ‚Üí **Implications** ‚Üí **Business Impact** | **Limitations & Considerations** |

### üìå SECTION OVERVIEW

* **Project / Business Idea:** What the project is about
* **Problem:** The challenge we‚Äôre addressing
* **Solution / Approach:** How we solve it
* **Research & Plots:** How we analyzed data visually
* **Insights:** What we discovered
* **Interpretation:** Why it matters
* **Implications:** What actions the business can take
* **Business Impact:** Expected results for the business
* **Limitations:** What constraints or gaps exist

==================================================================================================================================
# <div align="center">WEB SCRAPING</div>
==================================================================================================================================

```Etsy``` is a dynamic website, so scraping it requires careful handling.

Since ```Etsy``` uses ```JavaScript``` to load some content,

```requests``` +  ``BeautifulSoup`` might work for static parts (like search results), 

but for dynamic content, ``Selenium`` is more reliable. 

I will be using ``requests`` + ``BeautifulSoup`` for ```product listings``` **(title, price, link)**

Important Note: Etsy uses dynamic loading + anti-bot protections.

Using code with standard HTML scraping can work as long as Etsy doesn‚Äôt block the request.

If blocked, using headers, rotating proxies, or the Etsy API will be required.

==================================================================================================================================

----

### Avoiding getting blocked
| Version                                   | Best For          | Pros                                           | Cons                          |
| ----------------------------------------- | ----------------- | ---------------------------------------------- | ----------------------------- |
| **Requests + BeautifulSoup + Pagination** | Simple scraping   | Fast, clean                                    | Etsy may block request        |
| **Selenium + BeautifulSoup + Pagination** | Reliable scraping | Bypasses bot protection, loads dynamic content | Slower, requires ChromeDriver |


#### üß∞ **Install for web scraping**

In [2]:
# install requests & beautifulsoup
!pip install requests beautifulsoup4 fake-useragent pandas

# install selenium
!pip install selenium pandas




----

### üìå Pagination + BeautifulSoup Version
| Version                                   | Best For          | Pros                                           | Cons                          |
| ----------------------------------------- | ----------------- | ---------------------------------------------- | ----------------------------- |
| **Requests + BeautifulSoup + Pagination** | Simple scraping   | Fast, clean                                    | Etsy may block request        |

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import time


def scrape_products(pages=5, max_items=10):
    base_url = "https://www.etsy.com/search?q=tote+bag&page="
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36"
    }

    data = []

    for page in range(1, pages + 1):
        url = base_url + str(page)
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, "html.parser")

        products = soup.find_all("li", class_="wt-list-unstyled")

        for item in products:
            if len(data) >= max_items:
                return pd.DataFrame(data)

            # URL
            link = item.find("a", href=True)
            if not link:
                continue
            product_url = "https://www.etsy.com" + link["href"]

            # Title
            title_tag = item.find("h3")
            title = title_tag.get_text(strip=True) if title_tag else None

            # Price
            price_tag = item.find("span", class_="currency-value")
            price = None
            if price_tag:
                try:
                    price = float(price_tag.text.replace(",", "."))
                except:
                    pass

            # Rating
            rating_tag = item.find("span", class_="wt-screen-reader-only")
            rating = None
            if rating_tag:
                match_rating = re.search(r"([\d.]+) out of 5", rating_tag.text)
                if match_rating:
                    rating = float(match_rating.group(1))

            # Reviews
            reviews_tag = item.find("span", class_="wt-text-body-01")
            reviews = None
            if reviews_tag:
                match_reviews = re.search(r"\((\d+)\)", reviews_tag.text)
                if match_reviews:
                    reviews = int(match_reviews.group(1))

            # Delivery
            delivery = None
            delivery_tag = item.find(string=re.compile("delivery", re.I))
            if delivery_tag:
                txt = delivery_tag.lower()
                if "free" in txt:
                    delivery = 0
                else:
                    match_del = re.search(r"‚Ç¨\s?([\d.,]+)", delivery_tag)
                    if match_del:
                        delivery = float(match_del.group(1).replace(",", "."))

            data.append({
                "URL": product_url,
                "Title": title,
                "Price": price,
                "Rating": rating,
                "Reviews": reviews,
                "Delivery": delivery
            })

        time.sleep(1)

    return pd.DataFrame(data)


# Example: save CSV
if __name__ == "__main__":
    df = scrape_products()
    df.to_csv("../data/interim/0_interim_price.csv", index=False)
    print("STEP 1 : 'Price' INTERIM and CSV saved successfully!")


### üìå Selenium-Based Version (ChromeDriver)

| Version                                   | Best For          | Pros                                           | Cons                          |
| ----------------------------------------- | ----------------- | ---------------------------------------------- | ----------------------------- |
| **Selenium + BeautifulSoup + Pagination** | Reliable scraping | Bypasses bot protection, loads dynamic content | Slower, requires ChromeDriver |

Link to ChromeDriver: https://googlechromelabs.github.io/chrome-for-testing/#stable

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
import time
import re


def scrape_products_selenium(max_items=10):
    options = Options()
    options.add_argument("--headless")  
    options.add_argument("--disable-blink-features=AutomationControlled")
    options.add_argument("--disable-gpu")
    options.add_argument("start-maximized")
    options.add_argument("user-agent=Mozilla/5.0")

    driver = webdriver.Chrome(options=options)

    data = []
    page = 1

    while len(data) < max_items:
        url = f"https://www.etsy.com/search?q=tote+bag&page={page}"
        driver.get(url)
        time.sleep(4)

        soup = BeautifulSoup(driver.page_source, "html.parser")
        products = soup.find_all("li", class_="wt-list-unstyled")

        for item in products:
            if len(data) >= max_items:
                break

            # URL
            link = item.find("a", href=True)
            if not link:
                continue
            product_url = "https://www.etsy.com" + link["href"]

            # Title
            title_tag = item.find("h3")
            title = title_tag.get_text(strip=True) if title_tag else None

            # Price
            price_tag = item.find("span", class_="currency-value")
            price = None
            if price_tag:
                try:
                    price = float(price_tag.text.replace(",", "."))
                except:
                    pass

            # Rating
            rating_tag = item.find("span", class_="wt-screen-reader-only")
            rating = None
            if rating_tag:
                match_rating = re.search(r"([\d.]+) out of 5", rating_tag.text)
                if match_rating:
                    rating = float(match_rating.group(1))

            # Reviews
            reviews_tag = item.find("span", class_="wt-text-body-01")
            reviews = None
            if reviews_tag:
                match_reviews = re.search(r"\((\d+)\)", reviews_tag.text)
                if match_reviews:
                    reviews = int(match_reviews.group(1))

            # Delivery
            delivery = None
            delivery_tag = item.find(string=re.compile("delivery", re.I))
            if delivery_tag:
                txt = delivery_tag.lower()
                if "free" in txt:
                    delivery = 0
                else:
                    match_del = re.search(r"‚Ç¨\s?([\d.,]+)", delivery_tag)
                    if match_del:
                        delivery = float(match_del.group(1).replace(",", "."))

            data.append({
                "URL": product_url,
                "Title": title,
                "Price": price,
                "Rating": rating,
                "Reviews": reviews,
                "Delivery": delivery
            })

        page += 1
        time.sleep(2)

    driver.quit()

    df = pd.DataFrame(data)
    return df


# Save CSV
if __name__ == "__main__":
    df = scrape_products_selenium()
    df.to_csv("../data/interim/1_interim_price.csv", index=False)
    print("STEP 1 : 'Price' INTERIM and CSV saved successfully!")


In [None]:
"""
Etsy Tote Bag Scraper (Selenium + BeautifulSoup) with:
- Pagination
- Proxy rotation
- Random user-agents
- Class-based design
- Adjustable product limit
Saves final cleaned dataframe to ../data/clean/clean_data.csv
"""

import random
import time
import re
import os
from dataclasses import dataclass, field
from typing import List, Optional

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import WebDriverException, TimeoutException


@dataclass
class EtsyToteScraper:
    user_agents: List[str] = field(default_factory=lambda: [
        # A short sample; replace/extend with more UAs for real rotations
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/120 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_0) AppleWebKit/605.1.15 "
        "(KHTML, like Gecko) Version/16.0 Safari/605.1.15",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/117 Safari/537.36"
    ])
    proxies: List[str] = field(default_factory=list)  # e.g. ["http://ip:port", "http://user:pass@ip:port"]
    chromedriver_path: Optional[str] = None  # if None assumes chromedriver is on PATH
    headless: bool = True
    page_load_wait: float = 3.5  # seconds to wait after loading a page
    max_restarts_for_errors: int = 2

    def _make_driver(self, proxy: Optional[str], user_agent: str):
        """Create a Selenium Chrome WebDriver with given proxy & user agent."""
        options = Options()
        if self.headless:
            options.add_argument("--headless=new")  # use new headless mode
        options.add_argument("--disable-blink-features=AutomationControlled")
        options.add_argument("--no-sandbox")
        options.add_argument("--disable-dev-shm-usage")
        options.add_argument("--disable-gpu")
        options.add_argument("--window-size=1400,1000")
        options.add_argument(f"--user-agent={user_agent}")

        if proxy:
            # Set proxy; Chrome expects --proxy-server argument
            options.add_argument(f'--proxy-server={proxy}')

        # Optional: reduce webdriver fingerprint
        options.add_experimental_option("excludeSwitches", ["enable-automation"])
        options.add_experimental_option('useAutomationExtension', False)

        try:
            if self.chromedriver_path:
                driver = webdriver.Chrome(executable_path=self.chromedriver_path, options=options)  # type: ignore
            else:
                driver = webdriver.Chrome(options=options)
        except TypeError:
            # Some selenium versions use service object; fallback to default constructor
            driver = webdriver.Chrome(options=options)  # type: ignore
        return driver

    @staticmethod
    def _parse_price(price_text: str) -> Optional[float]:
        if not price_text:
            return None
        # Normalize and extract first price-looking token (handles "‚Ç¨12.50" and "12,50 ‚Ç¨")
        price_text = price_text.strip()
        # Keep euro symbol and digits, commas, dots
        m = re.search(r"‚Ç¨\s*([\d\.,]+)|([\d\.,]+)\s*‚Ç¨", price_text)
        if m:
            num = m.group(1) or m.group(2)
        else:
            # fallback: find any number-like substring
            m2 = re.search(r"([\d]{1,3}(?:[.,]\d{1,3})+|\d+)", price_text)
            if not m2:
                return None
            num = m2.group(1)
        # convert to float, handling comma as decimal if needed
        num = num.replace(".", "").replace(",", ".") if num.count(",") == 1 and num.count(".") == 0 else num.replace(",", "")
        try:
            return float(num)
        except Exception:
            return None

    @staticmethod
    def _extract_rating(text: str) -> Optional[float]:
        if not text:
            return None
        m = re.search(r"([0-5](?:\.[0-9])?)\s*out of\s*5", text, re.I)
        if m:
            try:
                return float(m.group(1))
            except:
                return None
        # sometimes rating appears as "4.8" alone
        m2 = re.search(r"\b([0-5]\.\d)\b", text)
        if m2:
            try:
                return float(m2.group(1))
            except:
                return None
        return None

    @staticmethod
    def _extract_reviews(text: str) -> Optional[int]:
        if not text:
            return None
        # look for parentheses e.g. "(123)" or "123 reviews"
        m = re.search(r"\((\d{1,6})\)", text.replace("\xa0", " "))
        if m:
            return int(m.group(1))
        m2 = re.search(r"(\d{1,6})\s+review", text, re.I)
        if m2:
            return int(m2.group(1))
        return None

    @staticmethod
    def _clean_text(elem):
        return elem.get_text(" ", strip=True) if elem else ""

    def scrape(self, max_items: int = 10, max_pages: int = 20, start_page: int = 1) -> pd.DataFrame:
        """
        Scrape Etsy tote bag products.

        Parameters:
        - max_items: total number of product rows to collect (default 10)
        - max_pages: maximum pages to visit (safety cap)
        - start_page: which search page to start from (1-based)
        """
        data_rows = []
        page = start_page
        attempts = 0

        # We'll periodically rotate proxy & UA by restarting the driver
        while len(data_rows) < max_items and page < start_page + max_pages:
            # choose random UA & proxy
            ua = random.choice(self.user_agents)
            proxy = random.choice(self.proxies) if self.proxies else None

            restarts = 0
            while restarts <= self.max_restarts_for_errors:
                driver = None
                try:
                    driver = self._make_driver(proxy, ua)
                    search_url = f"https://www.etsy.com/search?q=tote+bag&page={page}"
                    print(f"[INFO] Loading page {page} (collected {len(data_rows)}/{max_items}) ‚Äî UA chosen, proxy={proxy is not None}")
                    driver.get(search_url)
                    time.sleep(self.page_load_wait + random.uniform(0.5, 2.0))  # allow JS to load

                    soup = BeautifulSoup(driver.page_source, "html.parser")

                    # Etsy product tiles: use `li` elements with data-search-result or a result class
                    product_items = soup.find_all("li", attrs={"data-search-result": True})
                    if not product_items:
                        # fallback heuristics (sometimes different structure)
                        product_items = soup.find_all("div", class_=re.compile(r"v2-listing-card|search-result|listing-link|wt-grid-item"), limit=60)

                    if not product_items:
                        print("[WARN] No product items found on the page. The markup might have changed.")
                        break

                    for item in product_items:
                        if len(data_rows) >= max_items:
                            break

                        # URL
                        link_tag = item.find("a", href=True)
                        if not link_tag:
                            continue
                        product_url = link_tag["href"].split("?")[0]  # remove query params

                        # Title
                        title = None
                        title_tag = item.find("h3")
                        if title_tag:
                            title = title_tag.get_text(" ", strip=True)
                        else:
                            # alternative
                            title_tag2 = item.find("h2") or item.find("p", class_=re.compile("title|text"))
                            title = title_tag2.get_text(" ", strip=True) if title_tag2 else ""

                        # Price - try several selectors
                        price = None
                        # Etsy often uses <span class="currency-value">12.00</span>
                        price_span = item.find("span", class_=re.compile(r"currency-value|listing-price"))
                        if price_span:
                            price = self._parse_price(price_span.get_text(" ", strip=True))
                        else:
                            # try to extract from any text snippet in this tile
                            combined_text = self._clean_text(item)
                            # find euro price in combined text
                            price = self._parse_price(combined_text)

                        # Rating - try screen-reader text or aria labels
                        rating = None
                        rating_span = item.find("span", class_=re.compile(r"screen-reader-only|text-body-01|sr-only"), string=re.compile(r"out of 5", re.I))
                        if rating_span:
                            rating = self._extract_rating(rating_span.get_text(" ", strip=True))
                        else:
                            # try aria-label on an element
                            rating_aria = item.find(attrs={"aria-label": re.compile(r"out of 5", re.I)})
                            if rating_aria:
                                rating = self._extract_rating(rating_aria["aria-label"])

                        # Reviews - look for parentheses or "reviews" nearby
                        reviews = None
                        # check for small count element
                        reviews_candidates = item.find_all(text=re.compile(r"\(\d+\)|\d+\s+review", re.I))
                        if reviews_candidates:
                            for cand in reviews_candidates:
                                r = self._extract_reviews(cand)
                                if r:
                                    reviews = r
                                    break
                        if reviews is None:
                            # fallback to searching whole tile text
                            reviews = self._extract_reviews(self._clean_text(item))

                        # Delivery - detect Free shipping or shipping cost
                        delivery = None
                        # Common pattern: "Free shipping", "Free standard shipping", or "Shipping: ‚Ç¨3.00"
                        shipping_texts = item.find_all(text=re.compile(r"free shipping|shipping|delivery", re.I))
                        if shipping_texts:
                            for st in shipping_texts:
                                st_lower = st.strip().lower()
                                if "free" in st_lower:
                                    delivery = 0
                                    break
                                # try to parse euro amount
                                parsed = self._parse_price(st)
                                if parsed is not None:
                                    delivery = parsed
                                    break
                        if delivery is None:
                            # look at the product page (optional expensive step) - skip to save time

                            # default to None if unknown
                            delivery = None

                        data_rows.append({
                            "URL": product_url,
                            "Title": title,
                            "Price": price,
                            "Rating": rating,
                            "Reviews": reviews,
                            "Delivery": delivery
                        })

                    # Page completed
                    driver.quit()
                    break  # break restart loop on success

                except (WebDriverException, TimeoutException) as e:
                    print(f"[ERROR] WebDriver error: {e} ‚Äî restarting driver (attempt {restarts+1})")
                    if driver:
                        try:
                            driver.quit()
                        except:
                            pass
                    restarts += 1
                    time.sleep(1 + random.random() * 2)
                except Exception as e:
                    print(f"[ERROR] Unexpected error parsing page {page}: {e}")
                    if driver:
                        try:
                            driver.quit()
                        except:
                            pass
                    restarts += 1
                    time.sleep(1 + random.random() * 2)

            page += 1
            attempts += 1
            # polite pause between page loads and to reduce detection risk
            time.sleep(1.0 + random.uniform(0.8, 2.2))

        # Build DataFrame with exactly up to max_items rows (trim if needed)
        df = pd.DataFrame(data_rows)[:max_items]

        # Normalize columns: ensure numeric types where possible
        if not df.empty:
            df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
            df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
            df['Reviews'] = pd.to_numeric(df['Reviews'], errors='coerce').astype('Int64')
            # Delivery: treat None as NaN; where 0 -> free shipping
            df['Delivery'] = pd.to_numeric(df['Delivery'], errors='coerce')

        # Save CSV as requested
        out_path = os.path.join("..", "data", "clean", "clean_data.csv")
        os.makedirs(os.path.dirname(out_path), exist_ok=True)
        df.to_csv(out_path, index=False)
        print("STEP 1 : 'Price' CLEAN and CSV saved successfully!")

        return df


if __name__ == "__main__":
    # === Example usage ===
    # Provide your proxies and optionally a larger user-agent list
    proxies = [
        # "http://user:pass@12.34.56.78:1234",
        # "http://12.34.56.79:8080",
    ]

    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/120 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_0) AppleWebKit/605.1.15 "
        "(KHTML, like Gecko) Version/16.0 Safari/605.1.15",
        # add more UAs here...
    ]

    scraper = EtsyToteScraper(
        user_agents=user_agents,
        proxies=proxies,
        chromedriver_path=None,  # or set path like "/usr/local/bin/chromedriver"
        headless=True,
        page_load_wait=3.5
    )

    print("[START] Scraping up to 10 tote bag products (Selenium + rotating UA/proxy)...")
    df = scraper.scrape(max_items=10, max_pages=30, start_page=1)
    print(df)


### TEST

In [3]:
import undetected_chromedriver as uc
import time

print("Launching Chrome...")

# launch browser
driver = uc.Chrome()

driver.get("https://www.google.com")

print("Page title:", driver.title)

time.sleep(5)
driver.quit()

print("Done!")

Launching Chrome...
Page title: Google
Done!


### WEB SCRAPER INTERIM

In [4]:
import time
import pandas as pd
import undetected_chromedriver as uc
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys


def scrape_products(limit=10):
    """
    Scrape tote bag product data from Etsy using Selenium + BeautifulSoup.
    Includes pagination & anti-bot avoidance.
    Returns a pandas DataFrame.
    """

    # Launch undetected Chrome
    driver = uc.Chrome()
    driver.maximize_window()

    # Etsy tote bags search
    url = "https://www.etsy.com/search?q=tote+bag"
    driver.get(url)
    time.sleep(5)

    products = []

    while len(products) < limit:
        # Scroll to load products
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(3)

        soup = BeautifulSoup(driver.page_source, "html.parser")

        # All product cards
        items = soup.select("li.wt-list-unstyled")  # Etsy product item containers

        for item in items:
            if len(products) >= limit:
                break

            # URL
            url_tag = item.select_one("a.listing-link")
            if not url_tag:
                continue
            product_url = url_tag.get("href")

            # Title
            title_tag = item.select_one("h3")
            title = title_tag.get_text(strip=True) if title_tag else None

            # Price
            price_tag = item.select_one(".currency-value")
            price = price_tag.get_text(strip=True) if price_tag else None

            # Rating
            rating_tag = item.select_one(".wt-screen-reader-only")
            rating = None
            if rating_tag:
                # Example text: "5 out of 5 stars"
                text = rating_tag.get_text(strip=True)
                if "out of 5 stars" in text:
                    rating = float(text.split(" out")[0])

            # Reviews count
            reviews_tag = item.select_one(".wt-text-caption")
            reviews = None
            if reviews_tag:
                text = reviews_tag.get_text(strip=True)
                # e.g. "(123)"
                if text.startswith("(") and text.endswith(")"):
                    try:
                        reviews = int(text.strip("()"))
                    except:
                        reviews = None

            # Delivery price (if available)
            delivery_tag = item.select_one(".wt-text-strikethrough, .wt-text-muted")
            delivery = None
            if delivery_tag:
                delivery_text = delivery_tag.get_text(strip=True)
                # Normalize delivery cost
                if "Free delivery" in delivery_text or "FREE" in delivery_text:
                    delivery = 0
                else:
                    delivery = delivery_text

            products.append({
                "URL": product_url,
                "Title": title,
                "Price": price,
                "Rating": rating,
                "Reviews": reviews,
                "Delivery": delivery
            })

        # Go to next page if needed
        if len(products) < limit:
            next_button = None
            try:
                next_button = driver.find_element(By.CSS_SELECTOR, "a[aria-label='Next page']")
            except:
                pass

            if next_button:
                driver.execute_script("arguments[0].click();", next_button)
                time.sleep(5)
            else:
                break

    driver.quit()
    return pd.DataFrame(products)


# -----------------------------------------------------
# EXECUTION
# -----------------------------------------------------
if __name__ == "__main__":
    df = scrape_products(limit=10)

    # SAVE CSV
    df.to_csv("../data/clean/clean_tote_bags.csv", index=False)
    print("STEP 10 : TOTE BAGS and CSV saved successfully!")
    print(df)


STEP 10 : TOTE BAGS and CSV saved successfully!
Empty DataFrame
Columns: []
Index: []


==================================================================================================================================
# <div align="center">DATA CLEANING & ANALYSIS</div>
==================================================================================================================================

#### üóÉÔ∏è **Raw data**

- Web scraped data saved in a DataFrame then a CSV file and uploaded to google drive
- The df_url has to be a downloadable link to the csv file from google drive
- We load the csv to use for data cleaning and analysis

In [None]:
import pandas as pd

# Load the CSV
df_url = 'link to the dataFrame collected from scraping as a downloadable link from google drive'
df = pd.read_csv(df_url)
df.head(3)

----

#### üóÉÔ∏è **Interim data**

In [None]:
# Save 'Price' INTERIM to CSV
df.to_csv("../data/interim/1_interim_price.csv", index=False)
print("STEP 1 : 'Price' INRTERIM and CSV saved successfully!")

----

#### üóÉÔ∏è **Clean data**

In [None]:
# Save 'Price' CLEAN to CSV
df.to_csv("../data/clean/1_clean_price.csv", index=False)
print("STEP 1 : 'Price' CLEAN and CSV saved successfully!")

==================================================================================================================================
# <div align="center">RESEARCH</div>
==================================================================================================================================

### üåê **Which Are the Best-Selling POD Products on Etsy?**

I‚Äôm researching print-on-demand products to sell on Etsy that only require **digital artwork and marketing**, while the POD provider handles **printing, packaging, and shipping**.


### ‚≠ê Using Google Trends for POD Product Research
üí° **Goal:** Identify which POD product category has been searched the most on Google over the past 5 years (2020‚Äì2025).

Below is the list of product categories I‚Äôm comparing:

1. ```Custom Apparel```
    - T-shirts  
    - Hoodies  
    - Sweatshirts  
    - Tank tops 

2. ```Mug```
    - Ceramic mugs  
    - Color-changing mugs  
    - Espresso mugs  
    - Travel mugs 

3. ```Tote Bag```
    - Cotton totes  
    - All-over print totes  

4. ```Phone Case```
    - iPhone / Samsung cases  
    - Tough / Slim cases  

5. ```Stickers```
    - Die-cut stickers  
    - Kiss-cut stickers  
    - Sticker sheets 

6. ```Hats```
    - Baseball caps  
    - Trucker hats  
    - Beanies  

7. ```Pillows / Cushions```
    - Pillow covers  
    - Stuffed pillows  
    - All-over print pillow designs  

8. ```Blanket```
    - Fleece blankets  
    - Sherpa blankets  
    - Woven blankets  

9. ```Wall Art```
    - Posters  
    - Canvas prints  
    - Framed posters  
    - Metal prints  

10. ```Doormat```
    - Printed coir doormats  
    - Rubber-backed doormats 

11. ```Drinkware```
    - Stainless steel tumblers  
    - Water bottles  
    - Wine tumblers 

12. ```Calendar```
    - Custom printed wall calendars  

13. ```Yoga Mat```
    - Printed yoga mats 

14. ```Bedding```
    - Duvet covers  
    - Pillowcases  
    - All-over print bed sets

15. ```Pet Accessories```
    - Pet bandanas  
    - Pet beds  
    - Pet bowls  
    - Pet blankets  

16. ```Ornaments```
    - Ceramic ornaments
    - Wood ornaments
    - Metal ornaments 



------
### üéØ Chosen POD product to research is : tote bags

aria-label="4.9 star rating with 398 reviews"

etsy store selling print on demand products

data needed
- product title keywords to use to optimize sales / using title
- product description keywords / 
- insight the niches based on most selling keywords
- period when to sell / using reviews
- price / most selling price tag and range
- targeted audience ?
- how to market it?

Chosen website for Data Scraping : Etsy

data to extract : 

- product_title, for the keywords used in it to analyse the niche of this POD product

- product_price, for figuring the best price to sell it at

- product_listing_date, the date this product got created and added on etsy 

- product_rating, to know which niche in this POD product is selling the most 
- product_niche_rating

- product_reviews_date, to compare nbr_review vs nbr_orders 
and to have a plot showing the rating of this product over time
when did those sales happen the most and if it was recent or not
two products can be sold with the same amount of orders but
at different lengths of time

In [1]:
# product_category : t-shirt, mug, calendar,...
# product_niche : comedy, drama, horror, halloween, cartoon, anime, ... 
# product_price :  in euros
# product_listing_date: 00/00/0000 date created and added to etsy on product page
# product_rating: 0.0/5 current rating of the product to compare
# product_reviews_ratings: DataFrame with reviews ratings of each product from product page
# product_reviews_dates: DataFrame with reviews dates of each product from product page
# product_reviews_date: DataFrame with reviews descriptions of each product from product page

==================================================================================================================================
# <div align="center">PLOTS</div>
==================================================================================================================================

### üìä PLOT 01:

In [2]:
# PLOT 1

### üìä PLOT 02:

In [3]:
# PLOT 2

### üìä PLOT 03:

In [4]:
# PLOT 3

### üìä PLOT 04:

In [1]:
# PLOT 4

### üìä PLOT 05:

In [6]:
# PLOT 5

==================================================================================================================================
# <div align="center">INSIGHTS</div>
==================================================================================================================================

### üß† INSIGHT 01:
Text

----

### üß† INSIGHT 02:
Text

---

### üß† INSIGHT 03:
Text


==================================================================================================================================