==================================================================================================================================
# <div align="center">Project 03: Etsy POD Trends Web scrapping & Analysis</div>
==================================================================================================================================

### BUSINESS IDEA: ```POD BUSINESS``` 

### PROBLEM:

### SOLUTION:

BUSINESS IDEA -> PROBLEM -> RESEARCH + CHART/PLOTS -> INSIGHTS -> INTERPRETATIONS -> IMPLICATIONS -> BUSINESS IMPACT |-> LIMITATION

==================================================================================================================================
# <div align="center">CODE</div>
==================================================================================================================================

Etsy is a dynamic website, so scraping it requires careful handling.

Since Etsy uses JavaScript to load some content,

requests +  ``BeautifulSoup`` might work for static parts (like search results), 

but for dynamic content, ``Selenium`` is more reliable. 

I will be using ``requests`` + ``BeautifulSoup`` for **product listings** (title, price, link)

==================================================================================================================================

#### **INSTALL LIBRARIES**

In [None]:
# install requests & beautifulsoup
!pip install requests beautifulsoup4 fake-useragent stem pandas

# install selenium
!pip install selenium pandas

In [None]:

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import pandas as pd
import time
import random
import stem.process
from stem import Signal
from stem.control import Controller

# 1. CONFIGURATION --------------------------------------------------------------------------------------

SEARCH_QUERY = "handmade bag"
BASE_URL = f"https://www.etsy.com/search?q={SEARCH_QUERY.replace(' ', '+')}"

USE_TOR = True              # Turn TOR IP rotation on/off
MAX_RETRIES = 5
MIN_DELAY = 4               # Minimum delay between requests
MAX_DELAY = 12              # Maximum delay
PROXY_LIST = [
    # "http://user:pass@ip:port",
    # "http://ip:port",
]

ua = UserAgent()


# 2. TOR SETUP (optional but recommended) --------------------------------------------------------------
def start_tor():
    print("Launching TOR...")
    return stem.process.launch_tor_with_config(
        config={
            "SocksPort": "9050",
            "ControlPort": "9051",
            "CookieAuthentication": "1",
        },
        take_ownership=True,
    )

def new_tor_identity():
    try:
        with Controller.from_port(port=9051) as controller:
            controller.authenticate()
            controller.signal(Signal.NEWNYM)
        print("üîÑ TOR: New identity requested.")
    except Exception as e:
        print("TOR identity change failed:", e)

if USE_TOR:
    tor_process = start_tor()
    proxies = {
        "http": "socks5://127.0.0.1:9050",
        "https": "socks5://127.0.0.1:9050",
    }
else:
    proxies = None  # Will use rotating HTTP proxies later

# 3. RANDOMIZED HEADERS ---------------------------------------------------------------------------
def random_headers():
    return {
        "User-Agent": ua.random,
        "Accept-Language": random.choice([
            "en-US,en;q=0.8",
            "en-GB,en;q=0.7",
            "fr-FR,fr;q=0.9",
            "de-DE,de;q=0.8"
        ]),
        "Referer": random.choice([
            "https://www.google.com",
            "https://www.bing.com",
            "https://duckduckgo.com",
        ]),
        "DNT": str(random.choice([0, 1])),
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Mode": random.choice(["navigate", "same-origin"]),
        "Sec-Fetch-Dest": "document",
    }

# 4. SAFE REQUEST FUNCTION ------------------------------------------------------------------------------

def safe_get(url):
    for attempt in range(1, MAX_RETRIES + 1):
        
        # Random proxy from list (if not using TOR)
        proxy = {"http": random.choice(PROXY_LIST),
                 "https": random.choice(PROXY_LIST)} if (PROXY_LIST and not USE_TOR) else proxies

        try:
            headers = random_headers()
            print(f"[Attempt {attempt}] Fetching: {url}")
            print("Headers:", headers["User-Agent"])

            response = requests.get(
                url,
                headers=headers,
                proxies=proxy,
                timeout=15
            )

            if response.status_code == 200:
                print("‚úî Success")
                return response
            
            print(f"‚ö† Status {response.status_code}")

        except Exception as e:
            print("‚ùå Error:", e)

        # Request new TOR identity after failures
        if USE_TOR:
            new_tor_identity()

        # Backoff delay
        delay = random.uniform(MIN_DELAY * attempt, MAX_DELAY * attempt)
        print(f"‚è≥ Waiting {delay:.2f}s before retry...\n")
        time.sleep(delay)

    raise Exception("Failed after max retries.")


# 5. SCRAPING -------------------------------------------------------------------------------------

response = safe_get(BASE_URL)
soup = BeautifulSoup(response.text, "html.parser")

titles, prices, links = [], [], []

products = soup.find_all("li", class_="wt-list-unstyled")

for p in products:
    t = p.find("h3")
    c = p.find("span", class_="currency-value")
    a = p.find("a", href=True)

    if t and c and a:
        titles.append(t.get_text(strip=True))
        prices.append(c.get_text(strip=True))
        links.append(a["href"])


# 6. SAVE CSV & DISPLAY -------------------------------------------------------------------

df = pd.DataFrame({"Title": titles, "Price": prices, "Link": links})
print(df.head())

df.to_csv("etsy_scrape_protected.csv", index=False)
print("\nSaved as etsy_scrape_protected.csv")

Launching TOR...


OSError: 'tor' isn't available on your system. Maybe it's not in your PATH?

In [11]:
# Install required packages
!pip install requests beautifulsoup4 fake-useragent pandas

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import pandas as pd
import random
import time

# -------------------------
# Random headers function
# -------------------------
ua = UserAgent()
def random_headers():
    return {
        "User-Agent": ua.random,
        "Accept-Language": random.choice(["en-US,en;q=0.8","fr-FR,fr;q=0.9"]),
        "Referer": "https://www.google.com"
    }

# -------------------------
# Tor proxy (must run Tor manually first)
# -------------------------
proxies = {
    "http": "socks5://127.0.0.1:9050",
    "https": "socks5://127.0.0.1:9050",
}

# -------------------------
# Etsy search URL
# -------------------------
search_query = "t-shirt"
url = f"https://www.etsy.com/search?q={search_query.replace(' ', '+')}"

# Random delay
time.sleep(random.uniform(3, 8))

# -------------------------
# Send request
# -------------------------
response = requests.get(url, headers=random_headers(), proxies=proxies, timeout=15)

titles = []
prices = []
ratings = []

if response.status_code == 200:
    soup = BeautifulSoup(response.text, "html.parser")

    # Each product is in a <li> tag with specific class
    products = soup.find_all("li", class_="wt-list-unstyled")  # adjust as needed

    for product in products:
        # Title
        title_tag = product.find("h3")
        title = title_tag.get_text(strip=True) if title_tag else None

        # Price
        price_tag = product.find("span", class_="currency-value")
        price = price_tag.get_text(strip=True) if price_tag else None

        # Rating (optional)
        rating_tag = product.find("span", class_="screen-reader-only")
        rating = rating_tag.get_text(strip=True) if rating_tag else None

        if title:
            titles.append(title)
            prices.append(price)
            ratings.append(rating)

    # -------------------------
    # Create DataFrame
    # -------------------------
    df = pd.DataFrame({
        "Title": titles,
        "Price": prices,
        "Rating": ratings
    })

    print(df.head())

else:
    print("Failed to fetch page. Status code:", response.status_code)




ConnectionError: SOCKSHTTPSConnectionPool(host='www.etsy.com', port=443): Max retries exceeded with url: /search?q=t-shirt (Caused by NewConnectionError('<urllib3.contrib.socks.SOCKSHTTPSConnection object at 0x000002D8D81CF890>: Failed to establish a new connection: [WinError 10061] Aucune connexion n‚Äôa pu √™tre √©tablie car l‚Äôordinateur cible l‚Äôa express√©ment refus√©e'))

In [12]:
# Install required packages if needed
!pip install requests beautifulsoup4 fake-useragent pandas

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import pandas as pd
import random
import time

# -------------------------
# Random headers function
# -------------------------
ua = UserAgent()
def random_headers():
    return {
        "User-Agent": ua.random,
        "Accept-Language": random.choice(["en-US,en;q=0.8","fr-FR,fr;q=0.9"]),
        "Referer": "https://www.google.com"
    }

# -------------------------
# Tor proxy (Tor must be running manually)
# -------------------------
proxies = {
    "http": "socks5://127.0.0.1:9050",  # check your Tor log if different
    "https": "socks5://127.0.0.1:9050",
}

# -------------------------
# Search query
# -------------------------
search_query = "t-shirt"
base_url = "https://www.etsy.com/search"

# Data storage
titles = []
prices = []
ratings = []

# -------------------------
# Scrape multiple pages
# -------------------------
for page in range(1, 4):  # first 3 pages, change as needed
    print(f"Scraping page {page}...")
    params = {
        "q": search_query,
        "ref": "search_bar",
        "page": page
    }

    # Random delay to reduce blocking
    time.sleep(random.uniform(3, 7))

    try:
        response = requests.get(base_url, headers=random_headers(), proxies=proxies, params=params, timeout=15)
        if response.status_code != 200:
            print(f"Failed to fetch page {page}. Status code: {response.status_code}")
            continue

        soup = BeautifulSoup(response.text, "html.parser")

        # Each product
        products = soup.find_all("li", class_="wt-list-unstyled")  # Etsy product container

        for product in products:
            # Title
            title_tag = product.find("h3")
            title = title_tag.get_text(strip=True) if title_tag else None

            # Price
            price_tag = product.find("span", class_="currency-value")
            price = price_tag.get_text(strip=True) if price_tag else None

            # Rating (optional)
            rating_tag = product.find("span", class_="screen-reader-only")
            rating = rating_tag.get_text(strip=True) if rating_tag else None

            if title:
                titles.append(title)
                prices.append(price)
                ratings.append(rating)

    except Exception as e:
        print(f"Error on page {page}: {e}")
        continue

# -------------------------
# Create DataFrame
# -------------------------
df = pd.DataFrame({
    "Title": titles,
    "Price": prices,
    "Rating": ratings
})

print(df.head())
# Save to CSV
df.to_csv("etsy_tshirt_products.csv", index=False)
print("Saved results to etsy_tshirt_products.csv")


Scraping page 1...
Failed to fetch page 1. Status code: 403
Scraping page 2...
Failed to fetch page 2. Status code: 403
Scraping page 3...
Failed to fetch page 3. Status code: 403
Empty DataFrame
Columns: [Title, Price, Rating]
Index: []
Saved results to etsy_tshirt_products.csv


In [None]:
import requests
import random
from fake_useragent import UserAgent
from bs4 import BeautifulSoup

ua = UserAgent()

def random_headers():
    return {
        "User-Agent": ua.random,
        "Accept-Language": random.choice(["en-US,en;q=0.8", "fr-FR,fr;q=0.9"]),
        "Referer": "https://www.google.com"
    }

proxies = {
    "http": "socks5://127.0.0.1:9050",
    "https": "socks5://127.0.0.1:9050",
}

url = "https://www.etsy.com/search?q=handmade+bag"

# Random delay
import time
time.sleep(random.uniform(3, 8))

response = requests.get(url, headers=random_headers(), proxies=proxies, timeout=15)

soup = BeautifulSoup(response.text, "html.parser")

print(soup.title.text)


In [16]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
import time
import random

# -------------------------
# Tor proxy settings
# -------------------------
tor_socks_port = 9050  # check your Tor log

chrome_options = Options()
chrome_options.add_argument("--headless")  # run without opening browser
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--proxy-server=socks5://127.0.0.1:{}".format(tor_socks_port))

# Path to ChromeDriver
service = Service(r"C:\Path\To\chromedriver.exe")  # change to your path
driver = webdriver.Chrome(service=service, options=chrome_options)

# -------------------------
# Etsy search
# -------------------------
search_query = "t-shirt"
url = f"https://www.etsy.com/search?q={search_query.replace(' ', '+')}"

driver.get(url)
time.sleep(random.uniform(3,6))  # random delay

# -------------------------
# Scrape product info
# -------------------------
titles = []
prices = []
ratings = []

# Etsy uses divs with data-search-result attribute
products = driver.find_elements(By.CSS_SELECTOR, "li[data-search-result]")

for product in products:
    try:
        title = product.find_element(By.TAG_NAME, "h3").text
    except:
        title = None
    try:
        price = product.find_element(By.CSS_SELECTOR, "span.currency-value").text
    except:
        price = None
    try:
        rating = product.find_element(By.CSS_SELECTOR, "span.screen-reader-only").text
    except:
        rating = None

    if title:
        titles.append(title)
        prices.append(price)
        ratings.append(rating)

# -------------------------
# Save to DataFrame
# -------------------------
df = pd.DataFrame({
    "Title": titles,
    "Price": prices,
    "Rating": ratings
})

print(df.head())

# Optional: save to CSV
df.to_csv("etsy_tshirt_selenium.csv", index=False)

driver.quit()


NoSuchDriverException: Message: Unable to obtain driver for chrome; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors/driver_location


==================================================================================================================================
# <div align="center">INSIGHTS</div>
==================================================================================================================================

### INSIGHT 01:

==================================================================================================================================
# <div align="center">RESEARCH</div>
==================================================================================================================================

### üåê **Which Are the Best-Selling POD Products on Etsy?**

I‚Äôm researching print-on-demand products to sell on Etsy that only require **digital artwork and marketing**, while the POD provider handles **printing, packaging, and shipping**.


### ‚≠ê Using Google Trends for POD Product Research
üí° **Goal:** Identify which POD product category has been searched the most on Google over the past 5 years (2020‚Äì2025).

Below is the list of product categories I‚Äôm comparing:

1. ```Custom Apparel```
    - T-shirts  
    - Hoodies  
    - Sweatshirts  
    - Tank tops 

2. ```Mug```
    - Ceramic mugs  
    - Color-changing mugs  
    - Espresso mugs  
    - Travel mugs 

3. ```Tote Bag```
    - Cotton totes  
    - All-over print totes  

4. ```Phone Case```
    - iPhone / Samsung cases  
    - Tough / Slim cases  

5. ```Stickers```
    - Die-cut stickers  
    - Kiss-cut stickers  
    - Sticker sheets 

6. ```Hats```
    - Baseball caps  
    - Trucker hats  
    - Beanies  

7. ```Pillows / Cushions```
    - Pillow covers  
    - Stuffed pillows  
    - All-over print pillow designs  

8. ```Blanket```
    - Fleece blankets  
    - Sherpa blankets  
    - Woven blankets  

9. ```Wall Art```
    - Posters  
    - Canvas prints  
    - Framed posters  
    - Metal prints  

10. ```Doormat```
    - Printed coir doormats  
    - Rubber-backed doormats 

11. ```Drinkware```
    - Stainless steel tumblers  
    - Water bottles  
    - Wine tumblers 

12. ```Calendar```
    - Custom printed wall calendars  

13. ```Yoga Mat```
    - Printed yoga mats 

14. ```Bedding```
    - Duvet covers  
    - Pillowcases  
    - All-over print bed sets

15. ```Pet Accessories```
    - Pet bandanas  
    - Pet beds  
    - Pet bowls  
    - Pet blankets  

16. ```Ornaments```
    - Ceramic ornaments
    - Wood ornaments
    - Metal ornaments 

------
### The POD product I chose to research is : custom Calendar

aria-label="4.9 star rating with 398 reviews"

etsy store selling print on demand products

data needed
- product title keywords to use to optimize sales / using title
- product description keywords / 
- insight the niches based on most selling keywords
- period when to sell / using reviews
- price / most selling price tag and range
- targeted audience ?
- how to market it?

Chosen website for Data Scraping : Etsy

data to extract : 

- product_title, for the keywords used in it to analyse the niche of this POD product

- product_price, for figuring the best price to sell it at

- product_listing_date, the date this product got created and added on etsy 

- product_rating, to know which niche in this POD product is selling the most 
- product_niche_rating

- product_reviews_date, to compare nbr_review vs nbr_orders 
and to have a plot showing the rating of this product over time
when did those sales happen the most and if it was recent or not
two products can be sold with the same amount of orders but
at different lengths of time

In [1]:
# product_category : t-shirt, mug, calendar,...
# product_niche : comedy, drama, horror, halloween, cartoon, anime, ... 
# product_price :  in euros
# product_listing_date: 00/00/0000 date created and added to etsy on product page
# product_rating: 0.0/5 current rating of the product to compare
# product_reviews_ratings: DataFrame with reviews ratings of each product from product page
# product_reviews_dates: DataFrame with reviews dates of each product from product page
# product_reviews_date: DataFrame with reviews descriptions of each product from product page

==================================================================================================================================
# <div align="center">DATA VISUALIZATION (CHARTS/PLOTS)</div>
==================================================================================================================================

In [2]:
# PLOT 1

In [3]:
# PLOT 2

In [4]:
# PLOT 3

In [5]:
# PLOT 4

In [6]:
# PLOT 5