# 02 - Data Cleaning & Exploratory Data Analysis (EDA)
This notebook loads the raw data from SQLite, cleans it, enriches it with construction-period catgeories, and performs exploratory data analysis. 

We proceed in four stages:
1. Load raw data from SQLite
2. Clean and standardize scraped variables
3. Map listings to BFS Bauperiode categories
4. Compute derived variables and perform EDA

# Load Raw Data
We load:
- BFS construction period categories
- Raw scraped rental listings
from the SQLite database created in Notebook 01

In [1]:
import pandas as pd
import sqlite3

conn = sqlite3.connect("../Data/apartment_database.db")

df_bauperiode = pd.read_sql("SELECT * FROM bfs_bauperiode_categories", conn)
df_raw = pd.read_sql("SELECT * FROM rental_listings_raw", conn)

conn.close()

df_raw.head()


Unnamed: 0,web_scraper_order,id,web_scraper_start_url,link_pages,link_listings,area_m2_raw,area_m2,rooms_raw,rent_chf_raw,rent_chf,year_built_raw,address,canton
0,1764272678-4,1,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/4002530569,134 m2,134,5.5,"2,530.–",2530,1984,"Kesslernmattstr. 14, 8965 Berikon",Zurich
1,1764272681-5,2,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/4002542635,98 m2,98,3.5,"2,255.–",2255,1989,"Im Spitzler 21, 8902 Urdorf",Zurich
2,1764272684-6,3,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/4002551340,68 m2,68,3.5,"1,945.–",1945,1973,"Ferdinand Hodler-Str. 14, 8049 Zürich",Zurich
3,1764272687-7,4,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/4002567338,134 m2,134,5.5,"5,590.–",5590,1906,"Weinbergstrasse 72, 8006 Zürich",Zurich
4,1764272699-10,5,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/4002572292,94 m2,94,4.5,"2,360.–",2360,1971,"Bahnhofstr. 73, 8957 Spreitenbach",Zurich


# Inspect & Understand Raw Data
We examine the structure of the raw scraped data to plan the cleaning steps

In [2]:
df_raw.info()
df_raw.head(10)
df_raw.describe(include="all")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   web_scraper_order      120 non-null    object 
 1   id                     120 non-null    int64  
 2   web_scraper_start_url  120 non-null    object 
 3   link_pages             120 non-null    object 
 4   link_listings          120 non-null    object 
 5   area_m2_raw            120 non-null    object 
 6   area_m2                120 non-null    int64  
 7   rooms_raw              120 non-null    float64
 8   rent_chf_raw           120 non-null    object 
 9   rent_chf               120 non-null    int64  
 10  year_built_raw         120 non-null    int64  
 11  address                120 non-null    object 
 12  canton                 120 non-null    object 
dtypes: float64(1), int64(4), object(8)
memory usage: 12.3+ KB


Unnamed: 0,web_scraper_order,id,web_scraper_start_url,link_pages,link_listings,area_m2_raw,area_m2,rooms_raw,rent_chf_raw,rent_chf,year_built_raw,address,canton
count,120,120.0,120,120,120,120,120.0,120.0,120,120.0,120.0,120,120
unique,120,,5,10,120,59,,,102,,,110,3
top,1764272678-4,,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/real-estate/canto...,https://www.homegate.ch/rent/4002530569,70 m2,,,"2,860.–",,,"Stuckweg 4, 8305 Dietlikon",Zurich
freq,1,,40,20,1,6,,,2,,,2,40
mean,,60.5,,,,,90.191667,3.654167,,2599.975,1989.466667,,
std,,34.785054,,,,,25.034194,0.774585,,1037.84266,31.661533,,
min,,1.0,,,,,40.0,2.5,,1180.0,1890.0,,
25%,,30.75,,,,,71.0,3.5,,1800.0,1964.75,,
50%,,60.5,,,,,86.0,3.5,,2277.5,1994.0,,
75%,,90.25,,,,,102.0,4.5,,3027.5,2016.5,,


# Clean Numerical Fields
We clean:
- rent (CHF)
- area (m2)
- rooms
Convert string values (e.g., "1'900.-") to numeric.