# Download OSM building using PYROSM

As per title...

See https://pyrosm.readthedocs.io/en/latest/basics.html

Only 17% of buildings have postal codes.

In [3]:
%load_ext kedro.ipython
%reload_kedro --env=test
%load_ext autoreload
%autoreload 2
%config IPCompleter.use_jedi=False
from IPython.core.interactiveshell import InteractiveShell
import os
import pandas as pd
InteractiveShell.ast_node_interactivity = "all"
os.chdir(context.project_path)
catalog = context.catalog
params = context.params

The kedro.ipython extension is already loaded. To reload it, use:
  %reload_ext kedro.ipython


## Download data

In [8]:
import pyrosm
fp = pyrosm.get_data("Singapore", directory="data/01_raw/") # guessing...
osm = pyrosm.OSM(fp)
buildings = osm.get_buildings()

In [9]:
buildings.shape

[1m([0m[1;36m154241[0m, [1;36m41[0m[1m)[0m

In [25]:
buildings.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 154241 entries, 0 to 154240
Data columns (total 42 columns):
 #   Column              Non-Null Count   Dtype   
---  ------              --------------   -----   
 0   addr:city           49730 non-null   object  
 1   addr:country        37283 non-null   object  
 2   addr:housenumber    70622 non-null   object  
 3   addr:housename      1274 non-null    object  
 4   addr:postcode       24979 non-null   object  
 5   addr:place          238 non-null     object  
 6   addr:street         71637 non-null   object  
 7   email               132 non-null     object  
 8   name                10105 non-null   object  
 9   opening_hours       319 non-null     object  
 10  operator            411 non-null     object  
 11  phone               322 non-null     object  
 12  ref                 93 non-null      object  
 13  url                 4 non-null       object  
 14  visible             153051 non-null  object  
 15  website  

# Inspect postcodes

In [19]:
buildings = buildings.assign(postcode = buildings["addr:postcode"].astype("string").str.zfill(6))
print(buildings["postcode"].info())
print("\n Is NaN: ", buildings["postcode"].isna().sum() / buildings.shape[0])

<class 'pandas.core.series.Series'>
RangeIndex: 154241 entries, 0 to 154240
Series name: postcode
Non-Null Count  Dtype 
--------------  ----- 
24979 non-null  string
dtypes: string(1)
memory usage: 1.2 MB
None

 Is NaN:  0.8380521391847823


## Other fields

Maybe it's contained in other fields?

But, let's see if there is other useful address info

In [27]:
# Specify the columns to check
columns_to_check = ["addr:housenumber", "addr:housename", "addr:place", "addr:postcode", "addr:street"]

# Check for non-empty values and count them
buildings = buildings.assign(**{"non_empty_count": buildings[columns_to_check].notna().sum(axis=1)})

# Group by the count of non-empty values
grouped = buildings.groupby('non_empty_count').size()

# Print the results
print(grouped / buildings.shape[0])

non_empty_count
0    0.496574
1    0.050447
2    0.321503
3    0.125356
4    0.006055
5    0.000065
dtype: float64


About 50% of records have no address related info. Note that there are over 120'000 postcodes, so there could be some info missing in this case.