# Data Mining Introduction
### Team: Iker Arza & Sofia Fedane

**Context:**
This project aims to build a real-world dataset of bakeries in Ireland using web scraping techniques taught in class. The goal is to collect business and customer-facing information from multiple online platforms, consolidate it into a single dataset, clean it, and prepare it for analysis and modelling.

**Dataset:**
Two public online platforms were used:
1. GoldenPages.ie: provides contact information such as name, address, phone number, business categories, and short summaries.
2. Yelp.ie: provides customer-driven information such as ratings, review counts, price ranges, location tags, and customer review snippets.

**Business Motivation**
The bakery sector in Ireland is diverse, ranging from small artisan bakeries to large commercial chains. Understanding what attributes make bakeries successful, such as: ratings, category labels, pricing range, location, and customer reviews, may help identify trends related to consumer preferences and regional differences.

In [None]:
import time
import os
import pandas as pd
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException

os.makedirs("../data", exist_ok=True)

driver = webdriver.Chrome()
driver.maximize_window()
time.sleep(1)


In [None]:
MAX_ROWS = 1500

GOLDENPAGES_URLS = {
    "bakery":       "https://www.goldenpages.ie/q/business/advanced/what/bakery/",
    "cake_shop":    "https://www.goldenpages.ie/q/business/advanced/what/cake%20shop/",
    "coffee_shop":  "https://www.goldenpages.ie/q/business/advanced/what/coffee%20shop/",
    "dessert_shop": "https://www.goldenpages.ie/q/business/advanced/what/dessert%20shop/",
    "pastry_shop":  "https://www.goldenpages.ie/q/business/advanced/what/pastry%20shop/",
}

YELP_URLS = {
    "Dublin":    "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Dublin",
    "Cork":      "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Cork",
    "Galway":    "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Galway",
    "Limerick":  "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Limerick",
    "Waterford": "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Waterford",
    "Kerry":     "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Kerry",
    "Louth":     "https://www.yelp.ie/search?find_desc=Bakeries&find_loc=Louth",
}

## Data Source 1
**Data Source: GoldenPages**

We scrape business listings from GoldenPages.ie, focusing on bakery-related search terms such as bakery, cake shop, coffee shop, dessert shop, and pastry shop.
GoldenPages provides structured business-oriented information, including:
- Business name
- Physical address
- Phone number
- Business category
- Short business summary / description

For each search term, GoldenPages displays listings across multiple pages using numeric pagination (/1, /2, /3, …).

In this part of the project, we:
- Loop through each bakery-related search term.
- Paginate automatically through all available result pages.
- Scroll the page to load dynamically-rendered content (using Selenium).
- Parse each listing card using BeautifulSoup.
- Extract key information fields such as name, address, phone number, category, and summary.
- Stop scraping GoldenPages when we reach the global project limit of 1,500 rows or run out of pages.
- These rows form the first half of our combined bakery dataset.

## GoldenPages Scraper Function

In [None]:
def scrape_goldenpages():
    rows = []

    for search_label, base_url in GOLDENPAGES_URLS.items():
        print(f"\nGoldenPages search: {search_label}")

        page_number = 1
        last_url = None

        while True:
            if len(rows) >= MAX_ROWS:
                print("Reached MAX_ROWS in GoldenPages:", len(rows))
                return rows

            # Build the URL of the next page
            if page_number == 1:
                page_url = base_url
            else:
                page_url = base_url.rstrip("/") + f"/{page_number}"

            print(f"  Page {page_number}: {page_url}")
            driver.get(page_url)
            time.sleep(2.5)

            current_url = driver.current_url

            # Stop if the site loops back to the same page
            if last_url is not None and current_url == last_url:
                print("  Same URL encountered; stopping this search term.")
                break
            last_url = current_url

            # Scroll to load dynamic content
            try:
                body = driver.find_element(By.TAG_NAME, "body")
                for _ in range(3):
                    body.send_keys(Keys.END)
                    time.sleep(1)
            except Exception:
                pass

            soup = BeautifulSoup(driver.page_source, "html.parser")
            cards = soup.find_all("div", class_="listing_container")
            print("    Cards found:", len(cards))

            if len(cards) == 0:
                print("  No listings on this page; stopping this search term.")
                break

            # Extract the details
            for c in cards:
                if len(rows) >= MAX_ROWS:
                    print("Reached MAX_ROWS while processing cards.")
                    return rows

                name_tag = c.find("a", class_="listing_title_link")
                name = name_tag.get_text(" ", strip=True) if name_tag else None

                addr_tag = c.find("div", class_="listing_address")
                address = addr_tag.get_text(" ", strip=True) if addr_tag else None

                phone_tag = c.find("a", class_="link_listing_number")
                phone = phone_tag.get_text(strip=True) if phone_tag else None

                category_from_page = None
                cat_div = c.find("div", class_="listing_categories")
                if cat_div:
                    li = cat_div.find("li")
                    if li:
                        category_from_page = li.get_text(" ", strip=True)

                summary = None
                summary_div = c.find("div", class_="listing_summary")
                if summary_div:
                    p = summary_div.find("p")
                    if p:
                        summary = p.get_text(" ", strip=True)

                rows.append({
                    "source": "GoldenPages",
                    "category_search": search_label,
                    "name": name,
                    "address": address,
                    "phone": phone,
                    "category_from_page": category_from_page,
                    "summary": summary,
                })

            print("    Total collected:", len(rows))
            page_number += 1

    return rows


# Data Source 2
**Data Source: Yelp**

We scrape bakery listings from Yelp.ie across multiple Irish regions, including Dublin, Cork, Galway, Limerick, Waterford, Kerry, and Louth. Yelp provides rich customer-oriented information that complements GoldenPages.

The available fields include:
- Business name
- Star rating
- Number of reviews
- Price range (€, €€, €€€)
- Location / area tags
- Business categories (e.g., “Bakery”, “Café”, “Patisserie”)
- Short review snippet visible in search results


In this part of the project, we:
- Loop through each selected Irish region.
- Load the search results page for bakeries.
- Scroll to dynamically load all visible listings.
- Parse each listing card using BeautifulSoup.
- Extract key customer-focused features such as rating, review count, price range, categories, and review snippet.
- Click the “Next” button until no further pages are available or until the combined total dataset reaches 1,500 rows.

These rows form the second half of the dataset and complement the GoldenPages business information.

## Yelp Scraping Function

In [None]:
def scrape_yelp(max_rows_remaining, max_pages_per_region=10):
    rows = []

    for region_label, base_url in YELP_URLS.items():
        print(f"\nYelp region: {region_label}")

        page_number = 0

        while page_number < max_pages_per_region:

            if len(rows) >= max_rows_remaining:
                print("Reached Yelp limit:", len(rows))
                return rows

            # Build the page URL
            if page_number == 0:
                page_url = base_url
            else:
                page_url = base_url + f"&start={page_number * 10}"

            print(f"  Page {page_number + 1}: {page_url}")
            driver.get(page_url)
            time.sleep(3)

            # Scroll to ensure full content loads
            try:
                body = driver.find_element(By.TAG_NAME, "body")
                for _ in range(3):
                    body.send_keys(Keys.END)
                    time.sleep(1.5)
            except Exception:
                pass

            soup = BeautifulSoup(driver.page_source, "html.parser")
            cards = soup.find_all("div", attrs={"data-testid": "serp-ia-card"})

            if not cards:
                print("  No further results for this region.")
                break

            # Extract fields from each card
            for card in cards:
                if len(rows) >= max_rows_remaining:
                    print("Reached remaining limit inside card loop.")
                    return rows

                name_tag = card.find("a", class_="y-css-1x1e1r2")
                name = name_tag.get_text(strip=True) if name_tag else None

                rating_tag = card.find("span", class_="y-css-f73en8")
                rating_raw = rating_tag.get_text(strip=True) if rating_tag else None

                reviews_tag = card.find("span", class_="y-css-1vi7y4e")
                review_count_raw = reviews_tag.get_text(strip=True) if reviews_tag else None

                loc_tag = card.find("span", class_="y-css-wpsy4m")
                location = loc_tag.get_text(strip=True) if loc_tag else None

                price_tag = card.find("span", class_="y-css-1y784sg")
                price_range = price_tag.get_text(strip=True) if price_tag else None

                categories = None
                cat_container = card.find("div", attrs={"data-testid": "serp-ia-categories"})
                if cat_container:
                    categories = ", ".join(
                        p.get_text(strip=True) for p in cat_container.find_all("p")
                    )

                snippet_tag = card.find("p", class_="y-css-oyr8zn")
                snippet = snippet_tag.get_text(" ", strip=True) if snippet_tag else None

                rows.append({
                    "source": "Yelp",
                    "region": region_label,
                    "name": name,
                    "rating_raw": rating_raw,
                    "review_count_raw": review_count_raw,
                    "location": location,
                    "price_range": price_range,
                    "categories": categories,
                    "snippet": snippet,
                })

            print("    Total collected:", len(rows))
            page_number += 1

    return rows


In [None]:
# Collect GoldenPages data
goldenpages_rows = scrape_goldenpages()
current_total = len(goldenpages_rows)
print("\nGoldenPages collected:", current_total, "rows")

# Then collect Yelp data if needed to reach the MAX_ROWS target
if current_total < MAX_ROWS:
    remaining = MAX_ROWS - current_total
    print("Additional rows required:", remaining)
    yelp_rows = scrape_yelp(max_rows_remaining=remaining)
else:
    print("Reached MAX_ROWS from GoldenPages; skipping Yelp")
    yelp_rows = []

# Combine both sources
all_rows = goldenpages_rows + yelp_rows
df = pd.DataFrame(all_rows)

print("\nRows before duplicate removal:", len(df))

# Then remove duplicates using fields available in class
df = df.drop_duplicates(subset=["name", "address"], keep="first")

print("Rows after duplicate removal:", len(df))

# Save dataset
df.to_csv("../data/dataProject.csv", index=False)
print("\nDataset saved to dataProject.csv")


## Data Cleaning and Exploration (EDA)

In [1]:
import pandas as pd

df = pd.read_csv("../data/dataProject.csv")
df.head()

df.info()
df.describe(include='all')
df.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1091 entries, 0 to 1090
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   source              1091 non-null   object 
 1   category_search     1071 non-null   object 
 2   name                1091 non-null   object 
 3   address             1071 non-null   object 
 4   phone               1066 non-null   object 
 5   category_from_page  1071 non-null   object 
 6   summary             397 non-null    object 
 7   region              20 non-null     object 
 8   rating_raw          20 non-null     float64
 9   review_count_raw    20 non-null     object 
 10  location            20 non-null     object 
 11  price_range         18 non-null     object 
 12  categories          20 non-null     object 
 13  snippet             20 non-null     object 
dtypes: float64(1), object(13)
memory usage: 119.5+ KB


source                   0
category_search         20
name                     0
address                 20
phone                   25
category_from_page      20
summary                694
region                1071
rating_raw            1071
review_count_raw      1071
location              1071
price_range           1073
categories            1071
snippet               1071
dtype: int64

### Data Quality Summary

The combined dataset contains 1,091 rows and 14 columns. These rows come from two different sources (GoldenPages and Yelp), each providing different types of information. Because of this, the pattern of missing values is expected and consistent with the nature of each source.

**GoldenPages rows**
GoldenPages listings include:
- business name
- category used in the search
- address
- phone number
- short business summary (when available)

They **do not contain**:
- ratings
- review counts
- price range
- categories list
- customer review snippets
- region

This explains why these columns show a large number of missing values (around ~1070 missing entries each).

**Yelp rows**
Yelp listings include:
- name
- rating
- review count
- location text
- price range
- category tags
- customer review snippet
- region label

They **do not contain**:
- phone number
- GoldenPages summary
- GoldenPages category info  

This explains why `phone`, `summary`, `category_search`, and `category_from_page` have missing values for most rows.

### Interpretation of Missing Values

The missing values are *not* data errors — they are a structural consequence of merging two datasets with different schemas. Each row contains only the attributes provided by its source.

- GoldenPages contributes **~1,071 rows**
- Yelp contributes **~20 rows** after deduplication
- Columns that belong only to one source appear as `NaN` for the other

This is expected behaviour in a multi-source scraping project, and the dataset is suitable for exploration and basic analysis.

In [None]:
#Visualisation

import matplotlib.pyplot as plt

# Counting Bakeries by Region (YELP data only)
df[df['source']=="Yelp"]['region'].value_counts().plot(kind='bar')
plt.title("Number of Bakeries per Region (Yelp)")
plt.xlabel("Region")
plt.ylabel("Count")
plt.show()

# Ratings Distribution (also YELP data only)
df['rating_raw'] = pd.to_numeric(df['rating_raw'], errors='coerce')

df[df['source']=="Yelp"]['rating_raw'].plot(kind="hist", bins=10)
plt.title("Distribution of Bakery Ratings")
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.show()

### Business Insights
- Dublin has the highest number of bakeries listed on Yelp, indicating strong competition and demand.
- Rating distribution suggests that most bakeries in Ireland receive positive reviews.
- GoldenPages data is more focused on basic business listings, whereas Yelp provides customer perception.