# Methodology

This notebook shows the methodology for the paper "Measuring OpenStreetMap building footprint completeness using human settlement layers".

## Setup

We import all of the relevant packages as well as download the datasets.

For reference, here are the original download links for the datasets:
1. High Resolution Settlement Layer (HRSL) ([Philippines](https://data.humdata.org/dataset/philippines-high-resolution-population-density-maps-demographic-estimates)) ([Madagascar](https://data.humdata.org/dataset/highresolutionpopulationdensitymaps-mdg))
2. Administrative Boundaries ([Philippines](https://data.humdata.org/dataset/philippines-administrative-levels-0-to-3)) ([Madagascar](https://data.humdata.org/dataset/madagascar-administrative-level-0-4-boundaries))
3. OpenStreetMap (OSM) ([Philippines](https://download.geofabrik.de/asia/philippines.html)) ([Madagascar](https://download.geofabrik.de/africa/madagascar.html))

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import shapely
import geopandas as gpd
import rasterio
import rasterio.features

import wget

import os
import glob
from zipfile import ZipFile

In [2]:
try:
    os.mkdir("../download_data")
except Exception:
    pass

### HRSL download

Uncomment the cells below if you have not yet downloaded the HRSL datasets.

In [5]:
hrsl_phl_men_url = "https://data.humdata.org/dataset/6d9f35c0-4764-49ee-b364-329db0b7a47d/resource/5a13bb60-4506-42a5-a08a-7ccf20413179/download/phl_men_2019-06-01_geotiff.zip"
hrsl_phl_women_url = "https://data.humdata.org/dataset/6d9f35c0-4764-49ee-b364-329db0b7a47d/resource/4aff438c-43d9-47d0-853f-5a6b6ae28223/download/phl_women_2019-06-01_geotiff.zip"

In [6]:
wget.download(hrsl_phl_men_url, '../download_data/phl_hrsl_men_jun_2019.zip')

'../download_data/phl_hrsl_men_jun_2019.zip'

In [7]:
wget.download(hrsl_phl_women_url, '../download_data/phl_hrsl_women_jun_2019.zip')

'../download_data/phl_hrsl_women_jun_2019.zip'

### Admin boundary download

We note that the level 4 (barangay) admin boundary dataset for the Philippines is not available in the HDX website. We provide an external download link so that our results can still be reproduced.

Uncomment the cells below if you have not yet downloaded the admin boundary datasets.

In [None]:
adm_phl_url = "https://data.humdata.org/dataset/caf116df-f984-4deb-85ca-41b349d3f313/resource/12457689-6a86-4474-8032-5ca9464d38a8/download/phl_adm_psa_namria_20200529_shp.zip"

In [None]:
wget.download(adm_phl_url, '../download_data/phl_adm_all.zip')

In [None]:
adm4_phl_url = "https://storage.googleapis.com/osm-completeness-thinkingmachines/phl_adm_2015_level4_barangay.gpkg.zip"

In [None]:
wget.download(adm4_phl_url, '../download_data/phl_adm_2015_level4_barangay.gpkg.zip')

### OSM download

Uncomment the cells below if you have not yet downloaded the OSM datasets.

In [None]:
##### 2020 OSM Dataset
# osm_phl_url = "https://storage.googleapis.com/osm-completeness-thinkingmachines/phl_osm_jan_2020_buildings.gpkg.zip"

In [None]:
##### 2021 OSM Dataset
osm_phl_url = "https://download.geofabrik.de/asia/philippines-latest-free.shp.zip"

In [None]:
# wget.download(osm_phl_url, '../download_data/phl_osm_jan_2020_buildings.gpkg.zip')

In [None]:
wget.download(osm_phl_url, '../download_data/philippines-latest-free.shp.zip')

### Unzip all datasets

In [None]:
for i in glob.glob("../download_data/*.zip"):
    if os.path.isdir(os.path.splitext(i)[0]):
        pass
    else:
        with ZipFile(i) as myzip:
            myzip.extractall(os.path.splitext(i)[0])

## What do we want our final output to look like?

In [15]:
wget.download('https://storage.googleapis.com/osm-completeness-thinkingmachines/mapthegap-phl-adm2-2021-05-29.csv', '../download_data/mapthegap-phl-adm2-2021-05-29.csv')
wget.download('https://storage.googleapis.com/osm-completeness-thinkingmachines/mapthegap-phl-adm3-2021-05-29.csv', '../download_data/mapthegap-phl-adm3-2021-05-29.csv')
wget.download('https://storage.googleapis.com/osm-completeness-thinkingmachines/mapthegap-phl-adm4-2021-05-29.csv', '../download_data/mapthegap-phl-adm4-2021-05-29.csv')

'../download_data/mapthegap-phl-adm4-2021-05-29.csv'

In [16]:
province_output = pd.read_csv('../download_data/mapthegap-phl-adm2-2021-05-29.csv')

In [17]:
province_output.head()

Unnamed: 0,ADM2_EN,ADM2_PCODE,ADM2_REF,ADM2ALT1EN,ADM2ALT2EN,pixels_withbuilding_june2020,pixels_nobuilding_june2020,percentage_completeness_june2020,pixels_withbuilding_may2021,pixels_nobuilding_may2021,percentage_completeness_may2021
0,Abra,PH140100000,,,,13790,5152,72.801183,13862,5080,73.18129
1,Agusan del Norte,PH160200000,,,,4202,35688,10.533968,15586,24304,39.072449
2,Agusan del Sur,PH160300000,,,,2953,38147,7.184915,3422,37678,8.326034
3,Aklan,PH060400000,,,,14693,31978,31.482077,21519,25152,46.107861
4,Albay,PH050500000,,,,27168,31532,46.282794,32756,25944,55.802385


In [18]:
citymuni_output = pd.read_csv('../download_data/mapthegap-phl-adm3-2021-05-29.csv')

In [19]:
citymuni_output.head()

Unnamed: 0,ADM3_EN,ADM3_PCODE,ADM3_REF,ADM3ALT1EN,ADM3ALT2EN,pixels_withbuilding_june2020,pixels_nobuilding_june2020,percentage_completeness_june2020,pixels_withbuilding_may2021,pixels_nobuilding_may2021,percentage_completeness_may2021
0,Aborlan,PH175301000,,,,594,3599,14.166468,652,3541,15.549726
1,Abra de Ilog,PH175101000,,,,1315,729,64.334638,1316,728,64.383562
2,Abucay,PH030801000,,,,1996,646,75.548827,1993,649,75.435276
3,Abulug,PH021501000,,,,3396,927,78.556558,3405,918,78.764747
4,Abuyog,PH083701000,,,,1456,1005,59.162942,1473,988,59.853718


In [20]:
brgy_output = pd.read_csv('../download_data/mapthegap-phl-adm4-2021-05-29.csv')

In [21]:
brgy_output.head()

Unnamed: 0,Reg_Code,Reg_Name,Pro_Code,Pro_Name,Mun_Code,Mun_Name,Bgy_Code,Bgy_Name,RURBAN,ADM4_PCODE_NAME,pixels_withbuilding_june2020,pixels_nobuilding_june2020,percentage_completeness_june2020,ADM4_PCODE,pixels_withbuilding_may2021,pixels_nobuilding_may2021,percentage_completeness_may2021
0,110000000,REGION XI (DAVAO REGION),118200000,COMPOSTELA VALLEY,118206000.0,MAWAB,118206010.0,Sawangan,R,118206010_Sawangan,3,71,4.054054,118206010.0,6,68,8.108108
1,180000000,NEGROS ISLAND REGION (NIR),184500000,NEGROS OCCIDENTAL,184501000.0,BACOLOD CITY (Capital),184501048.0,Felisa,U,184501048_Felisa,2,272,0.729927,184501048.0,3,271,1.094891
2,120000000,REGION XII (SOCCSKSARGEN),126300000,SOUTH COTABATO,126311000.0,NORALA,126311020.0,Simsiman,R,126311020_Simsiman,116,90,56.31068,126311020.0,147,59,71.359223
3,150000000,AUTONOMOUS REGION IN MUSLIM MINDANAO (ARMM),157000000,TAWI-TAWI,157001000.0,PANGLIMA SUGALA (BALIMBING),157001001.0,Balimbing Proper,R,157001001_Balimbing Proper,0,93,0.0,157001001.0,5,88,5.376344
4,150000000,AUTONOMOUS REGION IN MUSLIM MINDANAO (ARMM),157000000,TAWI-TAWI,157001000.0,PANGLIMA SUGALA (BALIMBING),157001002.0,Batu-batu (Pob.),R,157001002_Batu-batu (Pob.),11,182,5.699482,157001002.0,25,168,12.953368


## Get intersection of HRSL pixels and OSM buildings

We use Facebook’s High Resolution Settlement Layer (HRSL) as a proxy ground truth for building footprints. We then measure data completeness by getting the “percentage completeness” of pixels which is computed using the total percentage of pixels within the intersection of the human settlement layer and the OSM building footprints.

![](../assets/formula.png)

Pixels that intersect OSM buildings are *mapped*.

Pixels that do not intersect OSM buildings are *unmapped*.

### Philippines

Uncomment the cells down below if you have not yet saved the `with_buildings` and `no_buildings` GPKG files.

#### Load HRSL dataset

In [None]:
# hrsl_phl_men = rasterio.open(
#     "../download_data/phl_hrsl_men_jun_2019/PHL_men_2019-06-01.tif"
# )

# hrsl_phl_women = rasterio.open(
#     "../download_data/phl_hrsl_women_jun_2019/PHL_women_2019-06-01.tif"
# )

In [None]:
# hrsl_phl_crs = hrsl_phl_men.crs

#### Add male and female population to get total population

In [None]:
# hrsl_phl_men_band1_mask = hrsl_phl_men.read_masks(1)

In [None]:
# hrsl_phl_women_band1_mask = hrsl_phl_women.read_masks(1)

In [None]:
# hrsl_phl_band1_mask = hrsl_phl_men_band1_mask + hrsl_phl_women_band1_mask

#### Convert HRSL dataset from raster to vector

In [None]:
# hrsl_phl_rand = np.random.rand(
#     np.shape(hrsl_phl_band1_mask)[0], np.shape(hrsl_phl_band1_mask)[1]
# )
# hrsl_phl_rand = hrsl_phl_rand.astype("float32")

In [None]:
# hrsl_phl_band1_poly = list(
#     rasterio.features.shapes(
#         hrsl_phl_rand, transform=hrsl_phl_men.transform, mask=hrsl_phl_band1_mask
#     )
# )

In [None]:
# hrsl_phl_geom = []
# for geom, value in hrsl_phl_band1_poly:
#     geom = shapely.geometry.shape(geom)
#     hrsl_phl_geom.append(geom)

In [None]:
# hrsl_phl_gdf = pd.DataFrame(hrsl_phl_geom)
# hrsl_phl_gdf = gpd.GeoDataFrame(hrsl_phl_gdf, geometry=hrsl_phl_gdf[0], crs="EPSG:4326")
# hrsl_phl_gdf.drop(columns=[0], inplace=True)
# hrsl_phl_gdf.reset_index(level=0, inplace=True)

In [None]:
# hrsl_phl_gdf.to_file('../data/hrsl_phl.gpkg', driver='GPKG')

#### Load OSM dataset

In [None]:
# osm_phl = gpd.read_file(
#     "../download_data/phl_osm_jan_2020_buildings.gpkg/phl_osm_jan_2020_buildings.gpkg",
#     driver="GPKG",
# )

In [None]:
osm_phl = gpd.read_file(
    "../download_data/philippines-latest-free.shp/gis_osm_buildings_a_free_1.shp",
    driver="GPKG",
)

#### Get mapped pixels

In [None]:
# # Just run this so you don't have to rerun everything up top
# hrsl_phl_gdf = gpd.read_file('../data/hrsl_phl.gpkg', driver='GPKG')

In [None]:
# phl_pixels_with_buildings = gpd.sjoin(
#     hrsl_phl_gdf, osm_phl, how="inner", op="intersects"
# )

In [None]:
# phl_pixels_with_buildings = phl_pixels_with_buildings.drop_duplicates(subset='index')

In [None]:
# phl_pixels_with_buildings.drop(columns=['index_right', 'osm_id', 'code', 'fclass', 'name', 'type'], inplace=True)

In [None]:
# phl_pixels_with_buildings.to_file('../data/phl_pixels_with_buildings.gpkg', driver='GPKG')

#### Get unmapped pixels

In [None]:
# phl_pixels_no_buildings = pd.merge(hrsl_phl_gdf, phl_pixels_with_buildings, how='outer', indicator=True)

In [None]:
# phl_pixels_no_buildings = phl_pixels_no_buildings[phl_pixels_no_buildings['_merge'] == 'left_only']

In [None]:
# phl_pixels_no_buildings.drop(columns=['_merge'], inplace=True)

In [None]:
# phl_pixels_no_buildings.to_file('../data/phl_pixels_no_buildings.gpkg', driver='GPKG')

#### Calculate percentage completeness

In 2020, it was 31.39%. What is it now?

In [None]:
# len(phl_pixels_with_buildings) / (len(phl_pixels_with_buildings) + len(phl_pixels_no_buildings)) * 100

## Aggregate to different admin boundaries

### Philippines

#### Turn mapped pixels from a polygon layer to a point layer

In [None]:
# Just run this so you don't have to rerun everything up top
phl_pixels_with_buildings = gpd.read_file(
    "../data/phl_pixels_with_buildings.gpkg", driver="GPKG"
)

In [None]:
phl_pixels_with_buildings["geometry"] = phl_pixels_with_buildings["geometry"].centroid

#### Turn unmapped pixels from a polygon layer to a point layer

In [None]:
# Just run this so you don't have to rerun everything up top
phl_pixels_no_buildings = gpd.read_file(
    "../data/phl_pixels_no_buildings.gpkg", driver="GPKG"
)

In [None]:
phl_pixels_no_buildings["geometry"] = phl_pixels_no_buildings["geometry"].centroid

#### Load level 4 admin boundary

In [None]:
phl_adm4 = gpd.read_file(
    "../download_data/phl_adm_2015_level4_barangay.gpkg/phl_adm_2015_level4_barangay.gpkg"
)

#### Create index for level 4 admin boundary

In [None]:
phl_adm4["ADM4_PCODE_NAME"] = phl_adm4["Bgy_Code"] + "_" + phl_adm4["Bgy_Name"]

#### Find intersection of mapped pixels and level 4 admin boundary

In [None]:
phl_pixels_with_buildings_sjoin_adm4 = gpd.sjoin(
    phl_pixels_with_buildings, phl_adm4, how="left", op="within"
)

In [None]:
phl_pixels_with_buildings_sjoin_adm4.drop(
    columns=[
        "index_right",
        "Reg_Code",
        "Reg_Name",
        "Pro_Code",
        "Pro_Name",
        "Mun_Code",
        "Mun_Name",
        "Bgy_Code",
        "Bgy_Name",
    ],
    inplace=True,
)

In [None]:
phl_pixels_with_buildings_sjoin_adm4.to_file(
    "../data/phl_pixels_with_buildings_sjoin_adm4.gpkg", driver="GPKG"
)

#### Find intersection of unmapped pixels and level 4 admin boundary

In [None]:
phl_pixels_no_buildings_sjoin_adm4 = gpd.sjoin(
    phl_pixels_no_buildings, phl_adm4, how="left", op="within"
)

In [None]:
phl_pixels_no_buildings_sjoin_adm4.drop(
    columns=[
        "index_right",
        "Reg_Code",
        "Reg_Name",
        "Pro_Code",
        "Pro_Name",
        "Mun_Code",
        "Mun_Name",
        "Bgy_Code",
        "Bgy_Name",
    ],
    inplace=True,
)

In [None]:
phl_pixels_no_buildings_sjoin_adm4.to_file(
    "../data/phl_pixels_no_buildings_sjoin_adm4.gpkg", driver="GPKG"
)

#### Load level 3 admin boundary

In [None]:
phl_adm3 = gpd.read_file(
    "../download_data/phl_adm_all/phl_admbnda_adm3_psa_namria_20200529.shp"
)

#### Find intersection of mapped pixels and level 3 admin boundary

In [None]:
phl_pixels_with_buildings_sjoin_adm3 = gpd.sjoin(
    phl_pixels_with_buildings, phl_adm3, how="left", op="within"
)

In [None]:
phl_pixels_with_buildings_sjoin_adm3.drop(
    columns=[
        "index_right",
        "Shape_Leng",
        "Shape_Area",
        "ADM3_EN",
        "ADM3_REF",
        "ADM3ALT1EN",
        "ADM3ALT2EN",
        "ADM2_EN",
        "ADM2_PCODE",
        "ADM1_EN",
        "ADM1_PCODE",
        "ADM0_EN",
        "ADM0_PCODE",
        "date",
        "validOn",
        "validTo",
    ],
    inplace=True,
)

In [None]:
phl_pixels_with_buildings_sjoin_adm3.to_file(
    "../data/phl_pixels_with_buildings_sjoin_adm3.gpkg", driver="GPKG"
)

#### Find intersection of unmapped pixels and level 3 admin boundary

In [None]:
phl_pixels_no_buildings_sjoin_adm3 = gpd.sjoin(
    phl_pixels_no_buildings, phl_adm3, how="left", op="within"
)

In [None]:
phl_pixels_no_buildings_sjoin_adm3.drop(
    columns=[
        "index_right",
        "Shape_Leng",
        "Shape_Area",
        "ADM3_EN",
        "ADM3_REF",
        "ADM3ALT1EN",
        "ADM3ALT2EN",
        "ADM2_EN",
        "ADM2_PCODE",
        "ADM1_EN",
        "ADM1_PCODE",
        "ADM0_EN",
        "ADM0_PCODE",
        "date",
        "validOn",
        "validTo",
    ],
    inplace=True,
)

In [None]:
phl_pixels_no_buildings_sjoin_adm3.to_file(
    "../data/phl_pixels_no_buildings_sjoin_adm3.gpkg", driver="GPKG"
)

#### Find intersection of unmapped pixels and level 2 admin boundary

Please add this as well!

#### Merge and concatenate dataframes

Uncomment the cell below if you have not yet loaded these variables.

In [None]:
# # You can run this so you don't have to rerun everything up top
# phl_pixels_with_buildings_sjoin_adm4 = gpd.read_file(
#     "../data/phl_pixels_with_buildings_sjoin_adm4.gpkg", driver="GPKG"
# )
# phl_pixels_with_buildings_sjoin_adm3 = gpd.read_file(
#     "../data/phl_pixels_with_buildings_sjoin_adm3.gpkg", driver="GPKG"
# )
# phl_pixels_no_buildings_sjoin_adm4 = gpd.read_file(
#     "../data/phl_pixels_no_buildings_sjoin_adm4.gpkg", driver="GPKG"
# )
# phl_pixels_no_buildings_sjoin_adm3 = gpd.read_file(
#     "../data/phl_pixels_no_buildings_sjoin_adm3.gpkg", driver="GPKG"
# )

Please merge and concatenate for level 2 admin boundaries as well!

In [None]:
phl_pixels_with_buildings_sjoin = phl_pixels_with_buildings_sjoin_adm3.merge(
    phl_pixels_with_buildings_sjoin_adm4[["index", "RURBAN", "ADM4_PCODE_NAME"]],
    how="left",
    left_on="index",
    right_on="index",
)

In [None]:
phl_pixels_with_buildings_sjoin.dropna(
    subset=["ADM3_PCODE", "RURBAN", "ADM4_PCODE_NAME"], inplace=True
)

In [None]:
phl_pixels_with_buildings_sjoin["status"] = "mapped"

In [None]:
phl_pixels_no_buildings_sjoin = phl_pixels_no_buildings_sjoin_adm3.merge(
    phl_pixels_no_buildings_sjoin_adm4[["index", "RURBAN", "ADM4_PCODE_NAME"]],
    how="left",
    left_on="index",
    right_on="index",
)

In [None]:
phl_pixels_no_buildings_sjoin.dropna(
    subset=["ADM3_PCODE", "RURBAN", "ADM4_PCODE_NAME"], inplace=True
)

In [None]:
phl_pixels_no_buildings_sjoin["status"] = "unmapped"

In [None]:
phl_pixels_all = pd.concat(
    [phl_pixels_with_buildings_sjoin, phl_pixels_no_buildings_sjoin]
)

In [None]:
phl_pixels_all.to_file("../data/phl_pixels_all.gpkg", driver="GPKG")

In [None]:
phl_pixels_all.drop(columns=["geometry"], inplace=True)

In [None]:
phl_pixels_all.to_csv("../data/phl_pixels_all.csv", index=False)