# **Data Mining Introduction**

### **Team: Iker Arza & Sofia Fedane**

### **Context**

This project builds a real-world dataset of bakeries in Ireland using web scraping techniques taught in class. The objective is to collect customer-facing business information from a live online platform, transform it into a structured dataset, and prepare it for further analysis and predictive modelling.

### **Dataset Source**

The dataset for this project was created exclusively from Yelp.ie, a public review platform that provides rich information on local businesses, including:

* Business name
* Star rating
* Number of reviews
* Price range (€ / €€ / €€€)
* Categories (e.g., Bakery, Café, Coffee Shop)
* Location text
* Short customer review snippet

Using Selenium and BeautifulSoup, multiple pages of Yelp search results were scraped across several Irish regions.

### **Business Motivation**

The bakery sector in Ireland spans small artisan bakeries, modern café–bakery hybrids, and larger commercial chains. Understanding what makes some bakeries more successful than others, such as:

* higher ratings,
* more reviews,
* premium or budget pricing,
* category specialisation,
* regional differences

can generate insights valuable to:

* **bakery owners** (competitive benchmarking),
* **entrepreneurs** (market opportunities),
* **marketing teams** (targeting customer preferences),
* **industry analysts** (regional demand trends).

A high-quality dataset of bakery ratings and attributes enables meaningful exploratory analysis and supports data-driven decision-making in the bakery and hospitality industry.

In [None]:
import time
import os
import pandas as pd
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException

os.makedirs("../data", exist_ok=True)

driver = webdriver.Chrome()
driver.maximize_window()
time.sleep(1)


In [None]:
YELP_LIMIT = 2500

YELP_URLS = {
    "Dublin":    "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Dublin",
    "Cork":      "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Cork",
    "Galway":    "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Galway",
    "Limerick":  "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Limerick",
    "Waterford": "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Waterford",
    "Kerry":     "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Kerry",
    "Louth":     "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Louth",
    "Kilkenny": "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Kilkenny",
    "Wexford":  "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Wexford",
    "Donegal":  "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Donegal",
    "Belfast":  "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Belfast",
    "Derry":    "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Derry",

}

# Data Source
**Data Source: Yelp**

We scrape bakery listings from Yelp.ie across multiple Irish regions, including Dublin, Cork, Galway, Limerick, Waterford, Kerry, and Louth. Yelp provides rich customer-oriented information that complements GoldenPages.

The available fields include:
- Business name
- Star rating
- Number of reviews
- Price range (€, €€, €€€)
- Location / area tags
- Business categories (e.g., “Bakery”, “Café”, “Patisserie”)
- Short review snippet visible in search results


In this part of the project, we:
- Loop through each selected Irish region.
- Load the search results page for bakeries.
- Scroll to dynamically load all visible listings.
- Parse each listing card using BeautifulSoup.
- Extract key customer-focused features such as rating, review count, price range, categories, and review snippet.
- Click the “Next” button until no further pages are available or until the combined total dataset reaches 1,500 rows.

These rows form the second half of the dataset and complement the GoldenPages business information.

## Yelp Scraping Function

In [None]:
def scrape_yelp(max_rows=1200, max_pages_per_region=40):
    rows = []
    
    for region_label, base_url in YELP_URLS.items():
        print(f"\nYelp region: {region_label}")

        for page_number in range(max_pages_per_region):

            if len(rows) >= max_rows:
                print("Reached Yelp row limit:", len(rows))
                return rows

            # Build page URL
            page_url = base_url if page_number == 0 else base_url + f"&start={page_number * 10}"
            print(f"  Page {page_number + 1}: {page_url}")

            driver.get(page_url)
            time.sleep(2.5)

            # Scroll load
            try:
                body = driver.find_element(By.TAG_NAME, "body")
                for _ in range(3):
                    body.send_keys(Keys.END)
                    time.sleep(1)
            except:
                pass

            soup = BeautifulSoup(driver.page_source, "html.parser")
            cards = soup.find_all("div", attrs={"data-testid": "serp-ia-card"})

            if not cards:
                print("  No more results for this region.")
                break

            for card in cards:
                if len(rows) >= max_rows:
                    break

                name_tag    = card.find("a", class_="y-css-1x1e1r2")
                rating_tag  = card.find("span", class_="y-css-f73en8")
                review_tag  = card.find("span", class_="y-css-1vi7y4e")
                loc_tag     = card.find("span", class_="y-css-wpsy4m")
                price_tag   = card.find("span", class_="y-css-1y784sg")
                snippet_tag = card.find("p", class_="y-css-oyr8zn")

                categories = ", ".join([c.get_text(strip=True) for c in card.find_all("p")])

                rows.append({
                    "source": "Yelp",
                    "region": region_label,
                    "name": name_tag.get_text(strip=True) if name_tag else None,
                    "rating_raw": rating_tag.get_text(strip=True) if rating_tag else None,
                    "review_count_raw": review_tag.get_text(strip=True) if review_tag else None,
                    "location": loc_tag.get_text(strip=True) if loc_tag else None,
                    "price_range": price_tag.get_text(strip=True) if price_tag else None,
                    "categories": categories,
                    "snippet": snippet_tag.get_text(" ", strip=True)[:200] if snippet_tag else None
                })

            print("    Total Yelp collected:", len(rows))

    return rows


In [None]:
print("\n--- STARTING YELP SCRAPING ---")
yelp_rows = scrape_yelp(max_rows=YELP_LIMIT)
print("Final Yelp count:", len(yelp_rows))

# --------- Convert to DataFrame ---------
df = pd.DataFrame(yelp_rows)

# --------- Deduplicate based on name + region + location ---------
df = df.drop_duplicates(subset=["name", "region", "location"], keep="first")

# --------- Save dataset ---------
df.to_csv("../data/dataProject.csv", index=False)
print("\nDataset saved to ../data/dataProject.csv")

# --------- Show CSV size ---------
size_mb = os.path.getsize("../data/dataProject.csv")/(1024*1024)
print(f"CSV size: {size_mb:.2f} MB")

---
# **Data Mining Summary, Issues & Limitations**

### **Project Overview**

The data mining phase involved me, Sofia, building a dataset of Irish bakeries through automated web scraping of Yelp.ie.

We chose this topic as I really love to eat a variety of desserts, as a customer, but I'm also picky and would like to know which places are best to choose from.

The process required anti-bot protections as you would find in the AI use declaration, while collecting useful business intelligence data.

The scraping procedure gathered ~1,519 bakery listings across multiple Irish regions before deduplicating to 1,488.

---

## **Technical Implementation & Methodology**

### **Scraping Architecture**
The data collection used a two-layer approach, same from class:
- **Selenium WebDriver**: Dynamic content loading, JavaScript execution, pagination
- **BeautifulSoup**: HTML parsing, structured data extraction
- **Pandas**: Data storage and transformation

### **Regional Coverage Strategy**
Geographical sampling across 12 Irish regions:
- **Big / Urban Centers**: Dublin, Cork, Galway, Belfast
- **Regional:**: Kerry, Louth, Kilkenny, Wexford, Donegal, Limerick, Waterford, Derry

We did this becauae we wanted to do it in a stratified way to ensure the market representtion was as broad as possible rather than just urban.


### **Data Extraction Attributes**
8 key business attributes:
- Business identity (name, region)
- Performance metrics (star rating, review count)
- Market positioning (price range, categories)
- Customer insights (review snippets, location context)

---

## **Technical Challenges & Solutions**

### **1. Dynamic Content Management**
**Challenge**: Yelp relies heavily on JS rendering so traditional scraping methods captured empty page templates when we started.

**Solution**: Implemented Selenium with planned scrolling and loading delays to get complete content rendering before parsing.

### **2. HTML Structure Volatility**
**Challenge**: Yelp randomly changes CSS classes and that breaks selector-based scraping.

**Solution**:
- 'data-testid' attributes
- Structural element relationships
- Fallback selection strategies

### **3. Anti-Bot Countermeasures**
**Challenge**: Yelp's bot detection triggered CAPTCHAs and IP rate limiting during intensive scraping sessions.

**Mitigation Strategies**:
- Randomized request delays between 2.5-5 seconds
- Used session persistence to maintain browser state
- Incremental saving to keep progress during interruptions
- Error handling to continue when the website blocks that page

### **4. Data Consistency Issues**
**Challenge**: Inconsistent field availability across listings actually reflects real-world business profile variations.

**Approach**: Accepted natural missingness patterns as authentic market characteristics rather than a failure in scraping.

---

## **Data Quality Assessment**

### **Completeness Analysis**
- **Price ranges**: 940 missing (Which is ~63.2% and is common for new/unpriced establishments)
- **Ratings**: 440 missing (Which is ~29.6% and can be new businesses without reviews)
- **Review counts**: 516 missing (Which is ~34.7%, consistent with business activity patterns)

These patterns shows genuine business characteristics rather than collection errors.

### **Data Capture Issues Identified**
- **78 entries** with business hours incorrectly captured as ratings
- **272 entries** with mixed currency formats (€, $, £) from Yelp's international platform
- **10 different price formats** requiring standardization

### **Geographical Distribution**
- **High density**: Dublin, Cork (high population)
- **Medium density**: Galway, Belfast (urban tourism)
- **Lower density**: Regional areas like our Louth county (authentic market distribution)

### **Duplicate Management**
Implemented deduplication using composite keys (name + region + location) to keep data uniqueness but preserving legitimate chain locations across regions like The Home Bakery.

---

## **Methodological Strengths**

### **Data Integrity**
- **Type preservation**: Kept original data formats during extraction
- **Relationship consistency**: Kept business attribute relationships
- **Metadata tracking**: Included source and collection timing context

### **Ethical Compliance**
- Respectful request intervals avoided service disruption
- Data usage aligned with academic research purposes
- Transparent methodology documentation

---

## **Limitations & Boundary Conditions**

### **Technical Constraints**
1. **Rate Limiting**: Collection speed limited by anti-bot protections
2. **Content Stability**: Periodic Yelp UI changes required me to update the selector
3. **Regional Biases**: Listing availability reflects Yelp's user base distribution

### **Market Representation**
- **Coverage**: Yelp's Irish businesses
- **Recency**: Current as of collection date (November 2025)

---

## **Strategic Value & Applications**
1. **Competitive Intelligence**: Regional benchmarking and positioning analysis
2. **Consumer Insights**: Rating patterns and review sentiment trends
3. **Market Gaps**: Geographical and categorical opportunity identification
4. **Quality Correlations**: Relationship between price, ratings, and business attributes

---


# **AI Use Declaration (Correct + Safe Version)**

### **AI Use Declaration**

Generative AI (ChatGPT) was used only during the Data Mining stage of the project.
AI was not used for Data Cleaning, EDA, Feature Engineering, Modelling, or Conclusions.

The purpose of AI use was to troubleshoot issues encountered during web scraping, specifically:

* being temporarily IP-blocked by GoldenPages
* receiving bot-detection/CAPTCHA pages from Yelp
* selectors breaking due to frequent HTML changes
* deciding whether to rescrape and how many listings to target

The AI was used to diagnose technical scraping errors and suggest corrective actions such as using a different network IP, reviewing CSS selectors, and adjusting scraping volume.
All scraping code and dataset collection were authored and executed by me.

---

### **Technologies Used**

* ChatGPT (OpenAI)

---

### **Prompts Provided (Summarised)**

Examples of the types of prompts I submitted:

* *“GoldenPages isn’t loading, was my IP blocked?”*
* *“Even when I visit the page normally I get CAPTCHA, what does that mean?”*
* *“I switched to mobile hotspot and scraping works again, is this due to IP banning?”*
* *“I scraped 852 rows. Should I scrape more to get a stronger modelling dataset?”*
* *“My scraper returned 0 rows, is this due to IP ban or did selectors break?”*

---

### **Outputs Received (Summarised)**

ChatGPT provided:

* confirmation that GoldenPages had likely IP-blocked my home network
* an explanation that switching to mobile data bypassed the block
* warnings about CAPTCHA lockout when scraping Yelp repeatedly
* advice that datasets around 1,200-2,000 rows improve modelling quality
* identification that “0 rows scraped” was caused by selector break, not IP ban
* guidance to pause scraping to avoid extending a block

No AI-generated code was copied into the notebook, and no analysis, explanations, or modelling were AI-produced.

---

### **How AI Output Was Used**

* Used informationally to understand why the scraper failed
* Used to help decide whether to retry scraping or adjust the number of pages
* Used to confirm that certain errors (0 rows, CAPTCHA page) were expected results of bot detection

AI output was not inserted directly into the project files and was ot used for any analysis or modelling stages.

---

### **Declaration**

I confirm that generative AI was used only for troubleshooting the Data Mining stage as described above, and no AI-generated content appears in the analysis, data cleaning, EDA, modelling, or conclusions.

---
