# 0. Problem Setup  

## a. Problem Statement  
The objective of this project is to **predict house prices in Hanoi** based on a variety of features such as:  
- Address  
- Number of bedrooms  
- House orientation  
- Price per square meter  
- …  

This approach differs from the simple calculation of *Price = Unit Price × Area*.  
The distinction lies in the fact that property prices are influenced by multiple factors such as:  
- Location (proximity to the city center)  
- Property type (apartment, landed house, etc.)  

For instance, when querying the model with “the price of an apartment in Hoan Kiem District,” the model will estimate the price for that type of property with the given features in that specific area.  
In other words, the model **does not return a list of available houses** but instead provides a **predicted price value**.  



## b. Features of Interest  
The main features considered in this study are:  
- **Property type**  
- **Area (m²)**  
- **Number of bedrooms**  
- **Number of bathrooms**  
- **Number of floors**  
- **District**  
- **Project name** (optional; included to test whether it improves model performance)  
- **Longitude and latitude** (to determine distance from the city center)  



## c. Project Workflow  
1. **Dataset**  
   - Data collection process  
   - Raw dataset  
   - Cleaned dataset  
   - Final dataset prepared for modeling  

2. **Exploratory Data Analysis (EDA)**  
   - Examine correlations between features  
   - Extract key insights  

3. **Modeling**  
   - Build and train regression models:    

4. **Demo**  
   - Perform sample predictions with the trained models  

5. **Evaluation**  
   - Compare model performance metrics  

6. **Web Application**  
   - Deploy results through a **Flask-based web app** for interactive demonstration  


# 1. Dataset Information  

The dataset was scraped from **batdongsan.com**, containing real estate information in Hanoi up to the present time.  

Initially, I attempted web scraping using Colab, but had to switch to **VS Code** with Selenium. However, due to the website’s **Cloudflare protection** and complex structure, this approach was difficult to maintain.  

As a solution, I switched to using the **Apify cloud platform**, which already provides a pre-built tool for scraping this website.  

Here is the scraping:  
[Apify Run Results](https://console.apify.com/view/runs/UC46Zkd7fm5gYef49)  

The **raw dataset** obtained from this process is stored in the corresponding folder.  

# 2. Data Cleaning and Preprocessing  

Now I will clean the dataset and prepare the data for **EDA** and subsequent **model training**.  

In [116]:
import pandas as pd
import re
import numpy as np
from geopy.distance import geodesic

In [None]:
# Data Inspection
df = pd.read_csv('data/dataset_raw.csv')
column_summary = pd.DataFrame({
    'non_null_count': df.isna().sum(),
    'percent_filled': df.notnull().sum() / len(df) * 100,
    'dtype': df.dtypes
}).sort_values(by='percent_filled', ascending=False)

pd.set_option('display.max_rows', None)  
column_summary.head(50)

Unnamed: 0,non_null_count,percent_filled,dtype
url,0,100.0,object
expiredDate,0,100.0,object
postedDate,0,100.0,object
price,0,100.0,object
imageUrls/4,0,100.0,object
imageUrls/3,0,100.0,object
imageUrls/2,0,100.0,object
address,0,100.0,object
imageUrls/0,0,100.0,object
priceMil,0,100.0,float64


### Observation:
Most irrelevant fields are either **images** or **HTML-related metadata**.  
Therefore, I will only keep columns where more than **50% of the values are filled**, and selectively drop or keep columns based on their usefulness.  

### Column decisions:  

1. **postedDate** → Drop (not using time series)  
2. **expiredDate** → Drop (irrelevant)  
3. **price** → Drop (text format, replaced by `priceVnd`)  
4. **address** → Keep  
5. **priceMil** → Drop  
6. **priceVnd** → Keep (this will be the **target variable**)  
7. **title** → Drop  
8. **type** → Drop  
9. **area** → Drop (replaced by `areaM2`)  
10. **long** → Keep  
11. **lat** → Keep  
12. **pricePerM2** → Drop  
13. **areaM2** → Keep  
14. **legal** → Keep  
15. **priceExt** → Drop  
16. **priceBil** → Drop  
17. **bedroom** → Drop  
18. **bedroomCount** → Keep  
19. **bathroomCount** → Keep  
20. **furniture** → Keep  
21. **direction** → Drop  
22. **frontage** → Drop  
23. **balconyDirection** → Drop  
24. **road** → Drop  
25. **floorCount** → Keep  
26. **url** → Keep (to extract property type if needed)  

In [None]:
# Create new dataframe with needed features
df_new = df[['url', 'priceVnd', 'areaM2', 'bedroomCount', 'Số phòng tắm, vệ sinh',
    'floorCount', 'legal', 'address', 'furniture', 'lat', 'long']].copy()

# Renaming features
df_new.rename(columns={
    'priceVnd': 'price',
    'areaM2': 'area',
    'bedroomCount': 'bedrooms',
    'Số phòng tắm, vệ sinh': 'bathrooms',
    'floorCount': 'floors',
    'legal': 'legal_status',
    'furniture': 'furniture',
    'lat': 'latitude',
    'long': 'longitude'
}, inplace=True)

In [120]:
df_new.head(2)

Unnamed: 0,url,price,area,bedrooms,bathrooms,floors,legal_status,address,furniture,latitude,longitude
0,https://batdongsan.com.vn/ban-nha-biet-thu-lie...,14000000000.0,88.0,4 phòng,4 phòng,4 tầng,Sổ đỏ/ Sổ hồng,"Vinhomes Wonder City, Tân Hội, Đan Phượng, Hà Nội",,21.095906,105.711798
1,https://batdongsan.com.vn/ban-nha-mat-pho-phuo...,-1.0,202.0,,,,Sổ đỏ/ Sổ hồng,"Dự án An Phú Shop Villa, Phường Dương Nội, Hà ...",,20.984206,105.750963


In [None]:
# Drop price -1 (found error) and null price and area (important feature cannot fillna)
df_new = df_new[df_new['price'] != -1]
df_new = df_new[df_new['price'].notnull() & df_new['area'].notnull()]

# Extract property type from url
def extract_property_type_from_url(url):
    url = url.lower()
    if 'can-ho' in url or 'chung-cu' in url:
        return 'apartment'
    elif 'nha-mat-pho' in url:
        return 'street_house'
    elif 'nha-rieng' in url:
        return 'private_house'
    elif 'nha-biet-thu' in url or 'biet-thu' in url:
        return 'villa'
    elif 'shophouse' in url:
        return 'shophouse'
    elif re.search(r'\bdat\b', url):
        return 'land'
    else:
        return 'other'
    
# Apply the function to extract property from url
df_new['property_type'] = df_new['url'].apply(extract_property_type_from_url)

# Log transform Price 
df_new['log_price'] = np.log1p(df_new['price'])

# Clean Furniture and rewrite labels
df_new['furniture_clean'] = df_new['furniture'].astype(str).str.lower().str.strip()

df_new['furniture_grouped'] = df_new['furniture_clean'].apply(
    lambda x: 'full' if 'đầy đủ' in x or 'full' in x or 'cao cấp' in x else
              'basic' if 'cơ bản' in x else
              'none' if 'không' in x else 'other'
)

# Clean legal_status by rewriting labels
df_new['legal_status_clean'] = df_new['legal_status'].astype(str).str.lower().str.strip()

df_new['legal_status_grouped'] = df_new['legal_status_clean'].apply(
    lambda x: 'has_title' if 'sổ đỏ' in x or 'sổ hồng' in x or 'sổ' in x else
              'contract' if 'hợp đồng' in x or 'mua bán' in x else
              'none' if 'không' in x or 'chưa rõ' in x else 'other')

#Change type of Area
df_new['area'] = df_new['area'].astype(int)

#Clean bedrooms by taking only number, change to float and fillna with median
df_new['bedrooms'] = df_new['bedrooms'].str.split(' ').str[0].str.strip()

df_new['bedrooms'] = df_new['bedrooms'].astype(float)

df_new['bedrooms'] = df_new['bedrooms'].fillna(df_new['bedrooms'].median()).astype(int)

#Do the same with bathrooms
df_new['bathrooms'] = df_new['bathrooms'].str.split(' ').str[0].str.strip()

df_new['bathrooms'] = df_new['bathrooms'].astype(float)

df_new['bathrooms'] = df_new['bathrooms'].fillna(df_new['bathrooms'].median()).astype(int)

#Do the same with floors
df_new['floors'] = df_new['floors'].str.split(' ').str[0].str.strip()

df_new['floors'] = df_new['floors'].astype(float)

df_new['floors'] = df_new['floors'].fillna(df_new['floors'].median())

#Drop unnecessary columns
df_new = df_new[['log_price', 'area', 'bedrooms', 'bathrooms', 'floors', 'latitude', 'longitude', 'property_type', 'furniture_grouped','legal_status_grouped', 'price']]

#Handle na for the last time
df_new = df_new.dropna(axis = 0)

#Use the Geosedic along with longitude/latitude to get the distance from the city center
hanoi_center = (21.0285, 105.8542)
def calc_distance_from_center(lat, lon):
    return geodesic((lat, lon), hanoi_center).km

df_new['distance_to_center'] = df_new.apply(lambda row: calc_distance_from_center(row['latitude'], row['longitude']), axis=1)

#Rename for the last time
df_new = df_new.rename(columns = {'furniture_grouped': 'furniture', 'legal_status_grouped': 'legal_status'})

In [122]:
df_new.head()

Unnamed: 0,log_price,area,bedrooms,bathrooms,floors,latitude,longitude,property_type,furniture,legal_status,price,distance_to_center
0,23.362323,88,4,4,4.0,21.095906,105.711798,villa,other,has_title,14000000000.0,16.574719
2,23.544645,96,3,2,4.0,21.096608,105.711918,villa,other,has_title,16800000000.0,16.598702
3,23.035801,65,1,1,1.0,21.009765,105.743733,shophouse,other,has_title,10100000000.0,11.669648
5,23.520547,96,3,2,5.0,21.096257,105.712943,villa,other,other,16400000000.0,16.486178
7,23.550579,75,6,7,5.0,21.03933,105.872671,private_house,full,has_title,16900000000.0,2.263692


In [123]:
df_new.to_csv("dataset_cleaned.csv", index=False, encoding="utf-8-sig")