# **Data Mining Introduction**

### **Team: Iker Arza & Sofia Fedane**

### **Context**

This project builds a real-world dataset of bakeries in Ireland using web scraping techniques taught in class. The objective is to collect customer-facing business information from a live online platform, transform it into a structured dataset, and prepare it for further analysis and predictive modelling.

### **Dataset Source**

The dataset for this project was created exclusively from Yelp.ie, a public review platform that provides rich information on local businesses, including:

* Business name
* Star rating
* Number of reviews
* Price range (€ / €€ / €€€)
* Categories (e.g., Bakery, Café, Coffee Shop)
* Location text
* Short customer review snippet

Using Selenium and BeautifulSoup, multiple pages of Yelp search results were scraped across several Irish regions.

### **Business Motivation**

The bakery sector in Ireland spans small artisan bakeries, modern café–bakery hybrids, and larger commercial chains. Understanding what makes some bakeries more successful than others, such as:

* higher ratings,
* more reviews,
* premium or budget pricing,
* category specialisation,
* regional differences

can generate insights valuable to:

* **bakery owners** (competitive benchmarking),
* **entrepreneurs** (market opportunities),
* **marketing teams** (targeting customer preferences),
* **industry analysts** (regional demand trends).

A high-quality dataset of bakery ratings and attributes enables meaningful exploratory analysis and supports data-driven decision-making in the bakery and hospitality industry.

In [36]:
import time
import os
import pandas as pd
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException

os.makedirs("../data", exist_ok=True)

driver = webdriver.Chrome()
driver.maximize_window()
time.sleep(1)


In [37]:
YELP_LIMIT = 2500

YELP_URLS = {
    "Dublin":    "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Dublin",
    "Cork":      "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Cork",
    "Galway":    "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Galway",
    "Limerick":  "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Limerick",
    "Waterford": "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Waterford",
    "Kerry":     "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Kerry",
    "Louth":     "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Louth",
    "Kilkenny": "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Kilkenny",
    "Wexford":  "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Wexford",
    "Donegal":  "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Donegal",
    "Belfast":  "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Belfast",
    "Derry":    "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Derry",

}

# Data Source
**Data Source: Yelp**

We scrape bakery listings from Yelp.ie across multiple Irish regions, including Dublin, Cork, Galway, Limerick, Waterford, Kerry, and Louth. Yelp provides rich customer-oriented information that complements GoldenPages.

The available fields include:
- Business name
- Star rating
- Number of reviews
- Price range (€, €€, €€€)
- Location / area tags
- Business categories (e.g., “Bakery”, “Café”, “Patisserie”)
- Short review snippet visible in search results


In this part of the project, we:
- Loop through each selected Irish region.
- Load the search results page for bakeries.
- Scroll to dynamically load all visible listings.
- Parse each listing card using BeautifulSoup.
- Extract key customer-focused features such as rating, review count, price range, categories, and review snippet.
- Click the “Next” button until no further pages are available or until the combined total dataset reaches 1,500 rows.

These rows form the second half of the dataset and complement the GoldenPages business information.

## Yelp Scraping Function

In [38]:
def scrape_yelp(max_rows=2500, max_pages_per_region=200):
    rows = []
    
    for region_label, base_url in YELP_URLS.items():
        print(f"\nYelp region: {region_label}")

        for page_number in range(max_pages_per_region):

            if len(rows) >= max_rows:
                print("Reached Yelp row limit:", len(rows))
                return rows

            # --------- Build page URL ---------
            page_url = base_url if page_number == 0 else base_url + f"&start={page_number * 10}"
            print(f"  Page {page_number + 1}: {page_url}")

            driver.get(page_url)
            time.sleep(2.5)

            # --------- Scroll to load content ---------
            try:
                body = driver.find_element(By.TAG_NAME, "body")
                for _ in range(3):
                    body.send_keys(Keys.END)
                    time.sleep(1)
            except:
                pass

            soup = BeautifulSoup(driver.page_source, "html.parser")
            cards = soup.find_all("div", attrs={"data-testid": "serp-ia-card"})

            if not cards:
                print("  No more results for this region.")
                break

            # --------- Extract card data ---------
            for card in cards:
                if len(rows) >= max_rows:
                    break

                name_tag    = card.find("a", class_="y-css-1x1e1r2")
                rating_tag  = card.find("span", class_="y-css-f73en8")
                review_tag  = card.find("span", class_="y-css-1vi7y4e")
                loc_tag     = card.find("span", class_="y-css-wpsy4m")
                price_tag   = card.find("span", class_="y-css-1y784sg")
                snippet_tag = card.find("p", class_="y-css-oyr8zn")

                categories = ", ".join([c.get_text(strip=True) for c in card.find_all("p")])

                rows.append({
                    "source": "Yelp",
                    "region": region_label,
                    "name": name_tag.get_text(strip=True) if name_tag else None,
                    "rating_raw": rating_tag.get_text(strip=True) if rating_tag else None,
                    "review_count_raw": review_tag.get_text(strip=True) if review_tag else None,
                    "location": loc_tag.get_text(strip=True) if loc_tag else None,
                    "price_range": price_tag.get_text(strip=True) if price_tag else None,
                    "categories": categories,
                    "snippet": snippet_tag.get_text(" ", strip=True)[:200] if snippet_tag else None
                })

            print("    Total Yelp collected:", len(rows))

    return rows

In [39]:
print("\n--- STARTING YELP SCRAPING ---")
yelp_rows = scrape_yelp(max_rows=YELP_LIMIT)
print("Final Yelp count:", len(yelp_rows))

# --------- Convert to DataFrame ---------
df = pd.DataFrame(yelp_rows)

# --------- Deduplicate based on name + region + location ---------
df = df.drop_duplicates(subset=["name", "region", "location"], keep="first")

# --------- Save dataset ---------
df.to_csv("../data/dataProject.csv", index=False)
print("\nDataset saved to ../data/dataProject.csv")

# --------- Show CSV size ---------
size_mb = os.path.getsize("../data/dataProject.csv")/(1024*1024)
print(f"CSV size: {size_mb:.2f} MB")


--- STARTING YELP SCRAPING ---

Yelp region: Dublin
  Page 1: https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Dublin
    Total Yelp collected: 10
  Page 2: https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Dublin&start=10
    Total Yelp collected: 20
  Page 3: https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Dublin&start=20
    Total Yelp collected: 30
  Page 4: https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Dublin&start=30
    Total Yelp collected: 40
  Page 5: https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Dublin&start=40
    Total Yelp collected: 50
  Page 6: https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Dublin&start=50
    Total Yelp collected: 60
  Page 7: https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Dublin&start=60
    Total Yelp collected: 70
  Page 8: https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Dublin&start=70
    Total Yelp collected: 80
  Page 9: https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Dublin&star

---
# **Data Mining Summary, Issues & Limitations**

### **Overview**

The data mining phase used web scraping of Yelp.ie to build a real-world dataset of bakeries in Ireland. Selenium was used to automate browser navigation, scroll dynamically loaded content, and paginate through search results. BeautifulSoup parsed each page to extract structured business information such as name, rating, review count, price range, categories, location, and review snippets.

The scraping process successfully gathered **1,519 bakery listings**, which provides a strong foundation for meaningful exploratory data analysis and regression modelling.

---

## **Successful Aspects of the Data Mining Process**

### **Yelp Scraping**

* Successfully scraped bakery listings across multiple regions in Ireland (e.g., Dublin, Cork, Galway, Limerick, etc.).
* Pagination using the `start=` query parameter worked reliably for multi-page navigation.
* Selenium scrolling ensured dynamically loaded content (cards, snippets, ratings) was fully rendered before parsing.
* Extracted the following key fields:

  * business name
  * region
  * rating
  * review count
  * price range (€ / €€ / €€€)
  * category labels
  * review snippet

### **High Row Count Achieved**

* Final Yelp-only scrape returned **1,519 listings**, exceeding the target of 1,500 entries.
* This provides strong statistical power and richer patterns for modelling.

---

## **Challenges & Limitations Encountered**

### **1. Dynamic Content Loading**

Yelp loads many page elements via JavaScript. Without scrolling, not all listings were visible. Selenium scroll events were required to ensure complete capture of business cards.

### **2. Changing HTML Structure**

Yelp frequently updates its CSS classes (e.g., dynamic `y-css-*` names).
To improve stability, parsing relied on:

* `data-testid` attributes
* generic tags
* resilient selectors

### **3. Optional / Missing Fields**

Some fields are not present for every business:

* `price_range` is missing for low-detail listings
* new businesses may have no `rating_raw` or `review_count_raw`
* `snippet` is sometimes unavailable in search results

These missing values are expected for real-world Yelp data.

### **4. Regional Variation in Listings**

Regions such as Kerry and Louth had fewer available bakery listings compared to Dublin.
This creates uneven distribution across counties but does not prevent analysis.

---

## **Final Output**

* **Total records extracted:** 1,519
* **Total records after deduplication:** 1,519 (Yelp listings rarely duplicate names+locations)
* **Dataset saved to:** `../data/dataProject.csv`
* **Columns collected:** 8–10 fields depending on completeness (all from Yelp.ie)
* **Technologies used:**

  * Selenium
  * BeautifulSoup
  * pandas
* **Data source:** **Yelp.ie (sole source)**

---
