# Real Estate Data Analysis – ImmoEliza

## Project Overview

The goal of this challenge is to support the real estate company *ImmoEliza* in its ambition to become the leading real estate player in Belgium. To do so, the company needs a strong pricing strategy based on data.

Before building a machine learning model, we will perform a thorough data analysis to:

- Understand the structure and content of the dataset
- Clean and prepare the data
- Extract key insights for business decision-making
- Visualize patterns and trends in the Belgian real estate market

This project is carried out as part of the `challenge-data-analysis`.

## Team Members
- [Evi]
- [Moussa]
- [Yves]

## Notebook Structure
1. Data loading and exploration  
2. Data cleaning  
3. Exploratory data analysis (EDA)  
4. Guided analysis and visual questions  
5. Interpretation and business insights  
6. Optional bonus visualizations  
7. Export and documentation  

# 1. Data loading and exploration  
- 1.1. Import Required Libraries
- 1.2. Load the Dataset
- 1.3. First Glance at the Data
  - Dataset shape
  - Column names
  - Data types
  - First rows (`.head()`)

## 1.1. Import Required Libraries

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## 1.2. Load the Dataset


In [2]:
csv_path = "../data/zimmo_real_estate_jgchoti.csv"
df = pd.read_csv(csv_path)

## 1.3. First Glance at the Data
  - Dataset shape
  - Column names
  - Data types
  - First rows (`.head()`)

In [3]:
# Display dataset shape: number of rows and columns
print("Dataset shape:", df.shape)

# Display column names
print("\nColumn names:")
print(df.columns.tolist())

# Display data types
print("\nData types:")
print(df.dtypes)

# Display the first 5 rows
df.head()

Dataset shape: (25403, 18)

Column names:
['zimmo code', 'type', 'price', 'street', 'number', 'postcode', 'city', 'living area(m²)', 'ground area(m²)', 'bedroom', 'bathroom', 'garage', 'garden', 'EPC(kWh/m²)', 'renovation obligation', 'year built', 'mobiscore', 'url']

Data types:
zimmo code                object
type                      object
price                    float64
street                    object
number                    object
postcode                  object
city                      object
living area(m²)          float64
ground area(m²)          float64
bedroom                  float64
bathroom                 float64
garage                   float64
garden                      bool
EPC(kWh/m²)              float64
renovation obligation     object
year built               float64
mobiscore                float64
url                       object
dtype: object


Unnamed: 0,zimmo code,type,price,street,number,postcode,city,living area(m²),ground area(m²),bedroom,bathroom,garage,garden,EPC(kWh/m²),renovation obligation,year built,mobiscore,url
0,L97OB,Vakantiewoning (Huis),25000.0,,,8620,Nieuwpoort,35.0,128.0,2.0,1.0,,False,,False,,7.0,https://www.zimmo.be/nl/nieuwpoort-8620/te-koo...
1,L9SVC,Appartement,45000.0,,,5570,Beauraing,62.0,,2.0,1.0,,False,,,,,https://www.zimmo.be/nl/beauraing-5570/te-koop...
2,LA02N,Rijwoning (Huis),45000.0,Oudestraat,94.0,9600,Ronse,,232.0,2.0,1.0,,False,716.0,True,1850.0,7.3,https://www.zimmo.be/nl/ronse-9600/te-koop/hui...
3,L4X2D,Vakantiewoning (Huis),40000.0,Molenheidestraat,7.0,3530,Helchteren,45.0,,,,,False,,False,,5.3,https://www.zimmo.be/nl/helchteren-3530/te-koo...
4,L9KJ7,Eengezinswoning (Huis),49900.0,Route Napoléon,10.0,4400,Ivoz-Ramet,123.0,8885.0,2.0,1.0,,False,569.0,,,,https://www.zimmo.be/nl/ivoz-ramet-4400/te-koo...


# 2. Data cleaning 
- 2.1. Remove Duplicates
- 2.2. Handle Missing Values
- 2.3. Clean Whitespace and Fix Formatting
- 2.4. Save Cleaned Dataset (Optional)

In [None]:
#Remove duplicates
df = df.drop_duplicates()

#Remove all white spaces 
df.columns = df.columns.str.strip()
#Remoce spaces from all string cells & Replace missing values as NA
for col in df.select_dtypes(include="object"):
    df[col] = df[col].astype(str).str.strip()

df.replace("", pd.NA, inplace=True)
df.replace(" ", pd.NA, inplace=True)

#Quick diagnostic tool - See the missing values
df.isna().sum().sort_values(ascending=False)
#From results no missing values for type, price, city etc. A lot of missing values for garage, EPC, ground area.




garage                   16544
ground area(m²)          10627
EPC(kWh/m²)               7936
year built                7642
mobiscore                 5300
bathroom                  4630
bedroom                   3116
living area(m²)           2467
type                         0
zimmo code                   0
city                         0
street                       0
number                       0
postcode                     0
price                        0
garden                       0
renovation obligation        0
url                          0
dtype: int64

# 3. Exploratory data analysis (EDA) 
- 3.1. Variable Types: Quantitative vs Qualitative
- 3.2. Missing Values Overview
- 3.3. Descriptive Statistics (Mean, Median, etc.)
- 3.4. Distribution Visualizations
  - Histograms
  - Boxplots
- 3.5. Correlation Matrix & Heatmap
- 3.6. Outlier Detection

# 4. Guided analysis and visual questions  
- 4.1. Most & Least Expensive Municipalities
  - Belgium, Wallonia, Flanders
  - Avg / Median / Price per m<sup>2</sup>
-  4.2. Most Influential Variables on Price
-  4.3. Variables with Low or No Impact
-  4.4. Histogram: Properties by Surface
-  4.5. Encoding Strategy for Categorical Variables

# 5. Interpretation and Business Insights
- 5.1. Summary of Key Findings
- 5.2. Business Recommendations for ImmoEliza
- 5.3. Data Limitations

# 6. Optional bonus visualizations 
-  6.1. Geo Mapping (price per region/municipality)
-  6.2. Trendlines or Regression Analysis
-  6.3. Clustering or Time Evolution (if available)

# 7. Export and documentation  
- 7.1. Export Final Clean Dataset
- 7.2. Save Visuals and Aggregated Tables
- 7.3. Final README Content
  - Project description
  - Installation
  - Usage
  - Visual examples
  - Team & timeline