---
# 1. Business Understanding

### Business Insights (COME BACK TO THIS)
- Dublin has the highest number of bakeries listed on Yelp, indicating strong competition and demand.
- Rating distribution suggests that most bakeries in Ireland receive positive reviews.
---

# 2. Data Mining Summary

### **Read Further Analysis in *DataMining.ipynb***

The dataset used in this project was created entirely through a web-scraping process implemented in the `DataMining.ipynb` notebook. The data comes exclusively from Yelp.ie, where Selenium and BeautifulSoup were used to:

* automate browser navigation,
* paginate through multiple search result pages across Irish regions,
* scroll dynamically loaded content, and
* extract structured information such as business names, ratings, reviews, price ranges, categories, locations, and review snippets.

The `DataMining.ipynb` notebook documents the full scraping workflow, including technical challenges (dynamic HTML, pagination limits, missing optional fields), and the rationale behind the chosen approach.

The dataset contains approximately ~1519 bakery listings depending on scrape limits used, and is saved as `dataProject.csv` for all subsequent cleaning, EDA, feature engineering, and modelling.


---
# 3. Data Cleaning


In [2]:
import pandas as pd

df = pd.read_csv("../data/dataProject.csv")
df.head()

df.info()
df.describe(include='all')
df.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1091 entries, 0 to 1090
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   source              1091 non-null   object 
 1   category_search     1071 non-null   object 
 2   name                1091 non-null   object 
 3   address             1071 non-null   object 
 4   phone               1066 non-null   object 
 5   category_from_page  1071 non-null   object 
 6   summary             397 non-null    object 
 7   region              20 non-null     object 
 8   rating_raw          20 non-null     float64
 9   review_count_raw    20 non-null     object 
 10  location            20 non-null     object 
 11  price_range         18 non-null     object 
 12  categories          20 non-null     object 
 13  snippet             20 non-null     object 
dtypes: float64(1), object(13)
memory usage: 119.5+ KB


source                   0
category_search         20
name                     0
address                 20
phone                   25
category_from_page      20
summary                694
region                1071
rating_raw            1071
review_count_raw      1071
location              1071
price_range           1073
categories            1071
snippet               1071
dtype: int64

### **Data Quality Summary (Yelp)**

The final dataset contains only Yelp.ie listings, so all columns share a consistent structure. Missing values occur because some Yelp fields are optional for businesses, not because of merging multiple sources.

---

### **Variables and Data Availability**

**High-completeness fields (rarely missing)**

* `name`
* `region`
* `location`

**Moderate-completeness fields (sometimes missing)**

* `rating_raw` (missing when a business has no reviews yet)
* `review_count_raw`
* `categories`

**Lower-completeness fields (optional on Yelp)**

* `price_range`
* `snippet`

---

### **Why Missing Values Occur**

* Yelp does not require all businesses to list a price range.
* New businesses may have no reviews, so `rating_raw` and `review_count_raw` can be empty.
* Some listings have no preview snippet in search results.
* Category tags vary greatly across businesses.

No columns have missing values due to scraping issues or source inconsistencies—only due to incomplete business profiles on Yelp.

---

### **Interpretation**

* Missingness is natural and expected for real-world web data.
* Since all rows come from one platform, the missingness is not structural and does not distort comparisons across regions or price levels.
* After type conversion and basic NA handling, the dataset is suitable for reliable EDA and regression modelling.


---
# 4. Exploratory Data Analysis

In [4]:
# #Visualisation

# import matplotlib.pyplot as plt

# # Counting Bakeries by Region (YELP data only)
# df[df['source']=="Yelp"]['region'].value_counts().plot(kind='bar')
# plt.title("Number of Bakeries per Region (Yelp)")
# plt.xlabel("Region")
# plt.ylabel("Count")
# plt.show()

# # Ratings Distribution (also YELP data only)
# df['rating_raw'] = pd.to_numeric(df['rating_raw'], errors='coerce')

# df[df['source']=="Yelp"]['rating_raw'].plot(kind="hist", bins=10)
# plt.title("Distribution of Bakery Ratings")
# plt.xlabel("Rating")
# plt.ylabel("Frequency")
# plt.show()

---
# 5. Feature Engineering

---
# 6. Predictive Modelling

---
# 7. Findings and Conclusions

---

# Work Split per Member
Sofia Fedane
- Improved and documented the Data Mining Summary
- Performed all Data Cleaning tasks:
- variable typing
- variable purpose assignment
- missing value handling
- conversions (rating, reviews, price range, etc.)
- outlier treatment
- Completed all Univariate Analysis (numerical + categorical)
- Completed 3 Bivariate Analysis questions
- Performed all baseline regression modelling, including:
- feature engineering
- one-hot encoding
- train/test split
- Linear Regression model
- coefficient interpretation
- Wrote the Findings & Conclusions section


Iker Arza
- Wrote the Business Understanding section
- Completed the remaining 3 Bivariate Analysis questions
- Performed the full Multivariate Analysis:
- correlation matrix
- region × price_range heatmap
- top 10% high-rating analysis
- Implemented the advanced regression models:
- Random Forest Regressor
- Gradient Boosting Regressor
- Produced the model comparison table
- Selected and justified the final recommended model
- Wrote docucentation for advanced modelling and interpretations
- Shared Responsibilities
- Wrote the Modelling Introduction
- justified regression choice
- defined the response variable
- listed predictor variables
- stated modelling limitations