---
# 1. Business Understanding

### Business Insights (COME BACK TO THIS)
- Dublin has the highest number of bakeries listed on Yelp, indicating strong competition and demand.
- Rating distribution suggests that most bakeries in Ireland receive positive reviews.
---

# 2. Data Mining Summary

### **Read Further Analysis in *DataMining.ipynb***

The dataset used in this project was created entirely through a web-scraping process implemented in the `DataMining.ipynb` notebook. The data comes exclusively from Yelp.ie, where Selenium and BeautifulSoup were used to:

* automate browser navigation,
* paginate through multiple search result pages across Irish regions,
* scroll dynamically loaded content, and
* extract structured information such as business names, ratings, reviews, price ranges, categories, locations, and review snippets.

The `DataMining.ipynb` notebook documents the full scraping workflow, including technical challenges (dynamic HTML, pagination limits, missing optional fields), and the rationale behind the chosen approach.

The dataset contains approximately ~1519 bakery listings depending on scrape limits used, and is saved as `dataProject.csv` for all subsequent cleaning, EDA, feature engineering, and modelling.


---
# 3. Data Cleaning


In this step, we classify all variables by type and purpose, handle missing values,
convert raw scraped text fields into numeric form, and prepare the dataset for
exploratory analysis and modelling.

Since the dataset comes solely from Yelp.ie, missing values occur because some
businesses do not list a price range, have no reviews yet, or lack a visible
preview snippet. These are expected and not scraping errors.

The cleaning process includes:
- assigning variable types (numerical or categorical)
- assigning variable purpose (response / explanatory)
- converting rating and review counts to numeric variables
- encoding price range (€ / €€ / €€€) as an ordinal variable
- creating additional useful features for modelling
- removing unusable rows (e.g., missing ratings for regression)

In [10]:
import pandas as pd

df = pd.read_csv("../data/dataProject.csv")

df.head()
df.info()
df.isna().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1492 entries, 0 to 1491
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   source            1492 non-null   object
 1   region            1492 non-null   object
 2   name              1492 non-null   object
 3   rating_raw        1057 non-null   object
 4   review_count_raw  982 non-null    object
 5   location          1492 non-null   object
 6   price_range       545 non-null    object
 7   categories        1492 non-null   object
 8   snippet           978 non-null    object
dtypes: object(9)
memory usage: 105.0+ KB


source                0
region                0
name                  0
rating_raw          435
review_count_raw    510
location              0
price_range         947
categories            0
snippet             514
dtype: int64

In [11]:
# Convert rating_raw to numeric
df['rating_raw'] = pd.to_numeric(df['rating_raw'], errors='coerce')

# Extract numeric review count
df['review_count_raw'] = (
    df['review_count_raw']
    .astype(str)
    .str.extract(r'(\d+)')[0]
    .astype(float)
)

# Encode price range (€=1, €€=2, €€€=3)
price_map = {'€': 1, '€€': 2, '€€€': 3}
df['price_encoded'] = df['price_range'].map(price_map)


In [12]:
bakery_keywords = {
    "bakery", "bakeries", "bread", "pastry", "pastries", "cake", "cakes",
    "coffee", "tea", "cafe", "café", "donut", "donuts", "patisserie",
    "dessert", "sweets", "cupcake", "brownie", "sandwich", "croissant"
}

def extract_categories_from_snippet(text):
    if not isinstance(text, str):
        return None
    text = text.lower()
    matches = [word for word in bakery_keywords if word in text]
    if matches:
        return ", ".join(sorted(set(matches)))
    else:
        return "bakery"   # fallback category

# Create new category column
df['categories_fixed'] = df['snippet'].apply(extract_categories_from_snippet)

# Category count
df['category_count'] = df['categories_fixed'].apply(
    lambda x: len(x.split(", ")) if isinstance(x, str) else 0
)

In [13]:
# Drop rows missing essential modelling variables
df = df.dropna(subset=['rating_raw', 'review_count_raw', 'price_encoded'])

# One-hot encode regions
df = pd.get_dummies(df, columns=['region'], drop_first=True)


In [14]:
# -------------------------------------------
# Preview ALL columns after cleaning
# -------------------------------------------

pd.set_option('display.max_columns', None)   # show all columns
pd.set_option('display.max_colwidth', None)  # show full text (optional)

df.head(15)  # show first 15 rows for better inspection


Unnamed: 0,source,name,rating_raw,review_count_raw,location,price_range,categories,snippet,price_encoded,categories_fixed,category_count,region_Dublin,region_Galway,region_Kerry,region_Limerick,region_Louth
0,Yelp,Bread 41,4.7,81.0,South Inner City,€€,"“Wow. If you want some butter ladened richfreshly bakedpastries, this is your place!”more, Cafes, Bakeries","“Wow. If you want some butter ladened rich freshly baked pastries, this is your place!” more",2.0,pastries,1,True,False,False,False,False
1,Yelp,The Bakery,4.2,26.0,Temple Bar,€,"“It is an entirely different species of food altogether. The bread isfreshly bakedevery morning and...”more, Bakeries, Coffee & Tea Shops",“It is an entirely different species of food altogether. The bread is freshly baked every morning and...” more,1.0,bread,1,True,False,False,False,False
2,Yelp,The Bakehouse,4.3,285.0,North Inner City,€,"“Thefreshly bakedbread was amazing, black pudding was delicious and well everything was just...”more, Bakeries, Coffee & Tea Shops, Breakfast & Brunch","“The freshly baked bread was amazing, black pudding was delicious and well everything was just...” more",1.0,bread,1,True,False,False,False,False
3,Yelp,The Rolling Donut,4.3,78.0,North Inner City,€,"“I can think of bettersweet treats. Having said that, they do a selection of vegan donuts, and...”more, Donuts, Bakeries","“I can think of better sweet treats . Having said that, they do a selection of vegan donuts, and...” more",1.0,"donut, donuts",2,True,False,False,False,False
4,Yelp,Queen of Tarts,4.4,543.0,South Inner City,€€,"“Best place for anyone who likes a bit of cake or chocolate!Delicious pastries, great coffee and...”more, Desserts, Tea Rooms, Breakfast & Brunch","“Best place for anyone who likes a bit of cake or chocolate! Delicious pastries , great coffee and...” more",2.0,"cake, coffee, pastries",3,True,False,False,False,False
5,Yelp,Krüst Bakery,3.9,47.0,South Inner City,€,"“Great coffee and absolutelydelicious pastries, if wholeheartedly recommend a cronut.”more, Coffee & Tea Shops, Bakeries, Sandwiches","“Great coffee and absolutely delicious pastries , if wholeheartedly recommend a cronut.” more",1.0,"coffee, pastries",2,True,False,False,False,False
6,Yelp,The Bretzel Bakery,4.4,37.0,Harcourt,€€,"“I love coming here at the weekend to buy fresh bread rolls andsweet treats.”more, Bakeries, Breakfast & Brunch",“I love coming here at the weekend to buy fresh bread rolls and sweet treats .” more,2.0,bread,1,True,False,False,False,False
7,Yelp,Hansel and Gretel Bakery & Patisserie,4.0,20.0,South Inner City,€€,"“Quaint little shop! No seating, butdelicious pastries& breads! Very nice, clean, tiny shop!”more, Bakeries, Cake Shop & Patisserie Shops","“Quaint little shop! No seating, but delicious pastries & breads! Very nice, clean, tiny shop!” more",2.0,"bread, pastries",2,True,False,False,False,False
8,Yelp,Queen of Tarts,4.4,287.0,Temple Bar,€€,"“Amazing breakfast,delicious pastries, and great staff! I had the hearty breakfast and added beans.”more, Desserts, Coffee & Tea Shops","“Amazing breakfast, delicious pastries , and great staff! I had the hearty breakfast and added beans.” more",2.0,pastries,1,True,False,False,False,False
10,Yelp,Ann's Bakery,3.3,34.0,North Inner City,€,"“The bakery at the front is still quite similar - yummy cakes like éclairs and breadsfreshly baked.”more, Bakeries, Irish",“The bakery at the front is still quite similar - yummy cakes like éclairs and breads freshly baked .” more,1.0,"bakery, bread, cake, cakes",4,True,False,False,False,False


In [15]:
df.head()
df.info()
df.describe()


<class 'pandas.core.frame.DataFrame'>
Index: 260 entries, 0 to 833
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   source            260 non-null    object 
 1   name              260 non-null    object 
 2   rating_raw        260 non-null    float64
 3   review_count_raw  260 non-null    float64
 4   location          260 non-null    object 
 5   price_range       260 non-null    object 
 6   categories        260 non-null    object 
 7   snippet           260 non-null    object 
 8   price_encoded     260 non-null    float64
 9   categories_fixed  260 non-null    object 
 10  category_count    260 non-null    int64  
 11  region_Dublin     260 non-null    bool   
 12  region_Galway     260 non-null    bool   
 13  region_Kerry      260 non-null    bool   
 14  region_Limerick   260 non-null    bool   
 15  region_Louth      260 non-null    bool   
dtypes: bool(5), float64(3), int64(1), object(7)
memor

Unnamed: 0,rating_raw,review_count_raw,price_encoded,category_count
count,260.0,260.0,260.0,260.0
mean,4.058462,30.642308,1.661538,1.346154
std,0.627157,67.971597,0.596694,0.683072
min,1.4,1.0,1.0,1.0
25%,3.7,4.0,1.0,1.0
50%,4.1,8.0,2.0,1.0
75%,4.5,26.0,2.0,1.25
max,5.0,543.0,3.0,4.0


In [16]:
# import pandas as pd

# # -----------------------------------------------------------
# # 1. Load RAW dataset
# # -----------------------------------------------------------
# df_raw = pd.read_csv("../data/dataProject.csv")

# print("=== RAW DATA PREVIEW ===")
# display(df_raw.head())
# print("\n=== RAW DATA INFO ===")
# print(df_raw.info())
# print("\n=== RAW MISSING VALUES ===")
# print(df_raw.isna().sum())

# # -----------------------------------------------------------
# # 2. Create a copy for cleaning
# # -----------------------------------------------------------
# df_clean = df_raw.copy()

# # -----------------------------------------------------------
# # 3. Clean rating_raw (convert to float)
# # -----------------------------------------------------------
# df_clean['rating_raw'] = pd.to_numeric(df_clean['rating_raw'], errors='coerce')

# # -----------------------------------------------------------
# # 4. Clean review_count_raw (extract digits only)
# # -----------------------------------------------------------
# df_clean['review_count_raw'] = (
#     df_clean['review_count_raw']
#     .astype(str)
#     .str.extract(r'(\d+)')[0]
#     .astype(float)
# )

# # -----------------------------------------------------------
# # 5. Encode price_range (€, €€, €€€ → 1,2,3)
# # -----------------------------------------------------------
# price_map = {'€': 1, '€€': 2, '€€€': 3}
# df_clean['price_encoded'] = df_clean['price_range'].map(price_map)

# # -----------------------------------------------------------
# # 6. Process categories
# # -----------------------------------------------------------
# # Split into list
# df_clean['categories_list'] = df_clean['categories'].astype(str).str.split(', ')

# # Count categories
# df_clean['category_count'] = df_clean['categories_list'].apply(
#     lambda x: len(x) if isinstance(x, list) else 0
# )

# # -----------------------------------------------------------
# # 7. Drop rows missing modelling-critical values
# # -----------------------------------------------------------
# df_model = df_clean.dropna(subset=['rating_raw', 'review_count_raw', 'price_encoded'])

# # -----------------------------------------------------------
# # 8. One-hot encode region column
# # -----------------------------------------------------------
# df_model = pd.get_dummies(df_model, columns=['region'], drop_first=True)

# # -----------------------------------------------------------
# # 9. FINAL OUTPUTS
# # -----------------------------------------------------------
# print("\n=== CLEANED DATA (for EDA) ===")
# display(df_clean.head())
# print(df_clean.info())

# print("\n=== MODELLING DATA (complete cases only) ===")
# display(df_model.head())
# print(df_model.info())

# print("\n=== MODELLING DATA SUMMARY ===")
# display(df_model.describe())


### **Data Quality Summary (Yelp)**

The final dataset contains only Yelp.ie listings, so all columns share a consistent structure. Missing values occur because some Yelp fields are optional for businesses, not because of merging multiple sources.

---

### **Variables and Data Availability**

**High-completeness fields (rarely missing)**

* `name`
* `region`
* `location`

**Moderate-completeness fields (sometimes missing)**

* `rating_raw` (missing when a business has no reviews yet)
* `review_count_raw`
* `categories`

**Lower-completeness fields (optional on Yelp)**

* `price_range`
* `snippet`

---

### **Why Missing Values Occur**

* Yelp does not require all businesses to list a price range.
* New businesses may have no reviews, so `rating_raw` and `review_count_raw` can be empty.
* Some listings have no preview snippet in search results.
* Category tags vary greatly across businesses.

No columns have missing values due to scraping issues or source inconsistencies—only due to incomplete business profiles on Yelp.

---

### **Interpretation**

* Missingness is natural and expected for real-world web data.
* Since all rows come from one platform, the missingness is not structural and does not distort comparisons across regions or price levels.
* After type conversion and basic NA handling, the dataset is suitable for reliable EDA and regression modelling.


---
# 4. Exploratory Data Analysis

In [17]:
# #Visualisation

# import matplotlib.pyplot as plt

# # Counting Bakeries by Region (YELP data only)
# df[df['source']=="Yelp"]['region'].value_counts().plot(kind='bar')
# plt.title("Number of Bakeries per Region (Yelp)")
# plt.xlabel("Region")
# plt.ylabel("Count")
# plt.show()

# # Ratings Distribution (also YELP data only)
# df['rating_raw'] = pd.to_numeric(df['rating_raw'], errors='coerce')

# df[df['source']=="Yelp"]['rating_raw'].plot(kind="hist", bins=10)
# plt.title("Distribution of Bakery Ratings")
# plt.xlabel("Rating")
# plt.ylabel("Frequency")
# plt.show()

---
# 5. Feature Engineering

---
# 6. Predictive Modelling

---
# 7. Findings and Conclusions

---

# Work Split per Member
Sofia Fedane
- Improved and documented the Data Mining Summary
- Performed all Data Cleaning tasks:
- variable typing
- variable purpose assignment
- missing value handling
- conversions (rating, reviews, price range, etc.)
- outlier treatment
- Completed all Univariate Analysis (numerical + categorical)
- Completed 3 Bivariate Analysis questions
- Performed all baseline regression modelling, including:
- feature engineering
- one-hot encoding
- train/test split
- Linear Regression model
- coefficient interpretation
- Wrote the Findings & Conclusions section


Iker Arza
- Wrote the Business Understanding section
- Completed the remaining 3 Bivariate Analysis questions
- Performed the full Multivariate Analysis:
- correlation matrix
- region × price_range heatmap
- top 10% high-rating analysis
- Implemented the advanced regression models:
- Random Forest Regressor
- Gradient Boosting Regressor
- Produced the model comparison table
- Selected and justified the final recommended model
- Wrote docucentation for advanced modelling and interpretations
- Shared Responsibilities
- Wrote the Modelling Introduction
- justified regression choice
- defined the response variable
- listed predictor variables
- stated modelling limitations