# 01 - Data Collection 

This notebook handles:
1. Loading BFS building-age datasets
2. Scraping approx. 50 rental listings (Homegate or Immoscout24)
3. Saving raw scraped data
4. Writing everything into a SQLite database

In [2]:
import pandas as pd 
import numpy as np 
import sqlite3
from bs4 import BeautifulSoup
import requests
import time 

import matplotlib.pyplot as plt 

# Load BFS Age Datasets

We load the four BFS CSVs and combine them into one dataframe 

In [3]:
df_bfs_1 = pd.read_csv("../Data/bau515od5155.csv")
df_bfs_2 = pd.read_csv("../Data/bau515od5156.csv")
df_bfs_3 = pd.read_csv("../Data/bau515od5157.csv")
df_bfs_4 = pd.read_csv("../Data/bau515od5158.csv")

df_bfs = pd.concat([df_bfs_1, df_bfs_2, df_bfs_3, df_bfs_4], ignore_index=True)
df_bfs.head()

Unnamed: 0,Stichtagdatjahr,DatenstandCd,HAArtLevel1Sort,HAArtLevel1Cd,HAArtLevel1Lang,HASTWESort,HASTWECd,HASTWELang,RaumSort,RaumCd,...,AnzZimmerLevel2Sort_noDM,AnzZimmerLevel2Cd_noDM,AnzZimmerLevel2Lang_noDM,AnzHA,HAPreisWohnflaeche,HAMedianPreis,HASumPreis,BaualterSort_noDM,BaualterCd_noDM,BaualterLang_noDM
0,2009,D,1,22,Kauf,1,J,Ja,0.0,0.0,...,1.0,1.0,1-Zimmer,31,8552.0,265000.0,11965926,,,
1,2009,D,1,22,Kauf,1,J,Ja,0.0,0.0,...,2.0,2.0,2-Zimmer,89,7800.0,505000.0,54240051,,,
2,2009,D,1,22,Kauf,1,J,Ja,0.0,0.0,...,3.0,3.0,3-Zimmer,143,7389.0,698550.0,116057305,,,
3,2009,D,1,22,Kauf,1,J,Ja,0.0,0.0,...,4.0,4.0,4-Zimmer,208,7577.0,855750.0,203086012,,,
4,2009,D,1,22,Kauf,1,J,Ja,0.0,0.0,...,5.0,5.0,5-Zimmer,83,9117.0,1312500.0,113148986,,,


## Inspect BFS Data
Check column names and available building age categories

In [7]:
df_bfs["BaualterLang_noDM"].unique()


array([nan, 'Neubauten (0–1 Jahre)', '2–9 Jahre', '10–19 Jahre',
       'Altbauten (umgebaut)', 'Altbauten (nicht umgebaut)', 'Total'],
      dtype=object)

## Save BFS Data to SQLite

We store the BFS building age dataset in a SQLite database so that it can be joined, analyzed, or referenced later during the cleaning and modeling steps

In [8]:
import sqlite3

conn = sqlite3.connect("../Data/apartment_database.db")
df_bfs.to_sql("bfs_buildings", conn, if_exists="replace", index=False)
conn.close()


### Subset BFS data to selected cantons

In [None]:
selected_cantons = [1, 12, 2]   # Replace with the canton codes you want
df_bfs = df_bfs[df_bfs["KantonCd"].isin(selected_cantons)]
df_bfs.shape


# Web Scraping

We now scrape ~50 rental listings from ImmoScout24 (or Homegate) for the
cantons we selected. We collect:

- Rent (CHF)
- Area (m²)
- Rooms
- Address
- Year built (if available)
- Canton

The raw data will later be cleaned and mapped to BFS age categories.


## Scraper Function

We define a function that extracts rent, area, address, and (if available) rooms 
from one ImmoScout24 results page. The structure of the HTML may vary between 
listings, so some entries may not be scraped successfully.
