# Methodology

This notebook shows the methodology for the paper "Measuring OpenStreetMap building footprint completeness using human settlement layers".

The methodology has four main parts
1. Download relevant data
2. Get intersection of HRSL pixels and OSM buildings
3. Aggregate mapped/unmapped pixels to admin boundaries
4. Calculate percent completeness on admin boundaries

## Setup

We import all of the relevant packages as well as download the datasets.

For reference, here are the original download links for the datasets:
1. *High Resolution Settlement Layer* (HRSL) ([Philippines](https://data.humdata.org/dataset/philippines-high-resolution-population-density-maps-demographic-estimates)) ([Madagascar](https://data.humdata.org/dataset/highresolutionpopulationdensitymaps-mdg))
2. *Administrative Boundaries* ([Philippines](https://data.humdata.org/dataset/philippines-administrative-levels-0-to-3)) ([Madagascar](https://data.humdata.org/dataset/madagascar-administrative-level-0-4-boundaries))
- To search for other countries, you may find latest HRSL / Admin Boundaries datasets can be found in the Humanitarian Data Exchange [website](https://data.humdata.org/)
3. *OpenStreetMap (OSM)* ([Philippines](https://download.geofabrik.de/asia/philippines.html)) ([Madagascar](https://download.geofabrik.de/africa/madagascar.html))

The notebook will use Madagascar/Philippines data for the download links. You may change this for your own use case!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import shapely
import geopandas as gpd
import rasterio
import rasterio.features

import wget

import os
import glob
from zipfile import ZipFile

from datetime import datetime

Create two folders in your working directory: `download_data` which will contains all raw downloaded data, and `data` which will contain all processed data.

In [3]:
try:
    os.mkdir("../download_data")
except Exception:
    pass

try:
    os.mkdir("../data")
except Exception:
    pass

### Filename set-up

In the following cell, define filename labels for the country, admin level, month, and year. This will be reflected in the output files.

In [129]:
# Change these to your desired filename labels
country_code = "mdg"
adm_code = "adm3"

# Time labels
# Add leading zero as needed
month = "november"
month_num = "11" #Two digits expected
year = "2021" 
day = "23" # Two digits expected

### HRSL download

For the HRSL layer, we get distribution of the whole population of Madagascar for 2018. Change this to the relevant HRSL layer for your target country.

Uncomment the cells below if you have not yet downloaded the HRSL datasets.

In [8]:
# Change this url for your own country
# hrsl_url = "https://data.humdata.org/dataset/9e7ff424-7b9c-42cc-b869-5756fcad0956/resource/1fafdd04-8e0b-4c2a-b4dc-8f3ff39e3015/download/population_mdg_2018-10-01.zip"

# Update for November 23, 2021
hrsl_url = "https://data.humdata.org/dataset/9e7ff424-7b9c-42cc-b869-5756fcad0956/resource/1fafdd04-8e0b-4c2a-b4dc-8f3ff39e3015/download/mdg_general_2020_geotiff.zip"

In [9]:
'../download_data/hrsl_{}_{}_{}.zip'.format(country_code,month,year)

'../download_data/hrsl_mdg_november_2021.zip'

In [10]:
# Download and save to ../download_data
wget.download(hrsl_url, '../download_data/hrsl_{}_{}_{}.zip'.format(country_code,month,year))

'../download_data/hrsl_mdg_november_2021.zip'

### Admin boundary download

Uncomment the cells below if you have not yet downloaded the admin boundary datasets.

In [11]:
# Change this url for your own country
adm_url = "https://data.humdata.org/dataset/26fa506b-0727-4d9d-a590-d2abee21ee22/resource/ed94d52e-349e-41be-80cb-62dc0435bd34/download/mdg_adm_bngrc_ocha_20181031_shp.zip"

In [12]:
# Download and save to ../download_data
wget.download(adm_url, '../download_data/{}_adm_all.zip'.format(country_code))

'../download_data/mdg_adm_all.zip'

### OSM download

Uncomment the cells below if you have not yet downloaded the OSM datasets.

In [13]:
# Change this url for your own country
osm_url = "https://download.geofabrik.de/africa/madagascar-latest-free.shp.zip"

In [14]:
# Download and save to ../download_data
wget.download(osm_url, '../download_data/{}-latest-free.shp.zip'.format(country_code))

'../download_data/mdg-latest-free.shp.zip'

In [11]:
## 210701 update, for sense-checking
# Change this url for your own country
osm_url = "https://download.geofabrik.de/africa/madagascar-210101-free.shp.zip"

# Download and save to ../download_data
wget.download(osm_url, '../download_data/{}-210701-free.shp.zip'.format(country_code))

'../download_data/mdg-210701-free.shp.zip'

### Unzip all datasets

In [12]:
for i in glob.glob("../download_data/*.zip"):
    if os.path.isdir(os.path.splitext(i)[0]):
        pass
    else:
        with ZipFile(i) as myzip:
            myzip.extractall(os.path.splitext(i)[0])

## Get intersection of HRSL pixels and OSM buildings

We use Facebook’s High Resolution Settlement Layer (HRSL) as a proxy ground truth for building footprints. We then measure data completeness by getting the “percentage completeness” of pixels which is computed using the total percentage of pixels within the intersection of the human settlement layer and the OSM building footprints.

![](../assets/formula.png)

Pixels that intersect OSM buildings are *mapped*.

Pixels that do not intersect OSM buildings are *unmapped*.

In [3]:
# # Open tiff file
# hrsl_layer = rasterio.open(
#     "../download_data/hrsl_{}_{}_{}/population_mdg_2018-10-01.tif"\
#     .format(country_code,month,year)
# )

# # Update for November 23, 2021
# hrsl_layer = rasterio.open(
#     "../download_data/.tif"\
#     .format(country_code,month,year)
# )

In [3]:
# Older HRSL file from 2018 (debug)
hrsl_layer = rasterio.open(
    "../download_data/2021-07-13-osm-completeness-madagascar_population_mdg_2018-10-01.tif"\
    .format(country_code,month,year)
)

In [4]:
# Read first band of tiff file
hrsl_band1_mask = hrsl_layer.read_masks(1)

### Convert HRSL dataset from raster to vector

In [5]:
hrsl_rand = np.random.rand(
    np.shape(hrsl_band1_mask)[0], np.shape(hrsl_band1_mask)[1]
)
hrsl_rand = hrsl_rand.astype("float32")

In [6]:
hrsl_band1_poly = list(
    rasterio.features.shapes(
        hrsl_rand, transform=hrsl_layer.transform, mask=hrsl_band1_mask
    )
)

In [7]:
hrsl_geom = []
for geom, value in hrsl_band1_poly:
    geom = shapely.geometry.shape(geom)
    hrsl_geom.append(geom)

In [8]:
hrsl_gdf = pd.DataFrame(hrsl_geom)
hrsl_gdf = gpd.GeoDataFrame(hrsl_gdf, geometry=hrsl_gdf[0], crs="EPSG:4326")
hrsl_gdf.drop(columns=[0], inplace=True)
hrsl_gdf.reset_index(level=0, inplace=True)

In [9]:
hrsl_gdf.to_file('../data/hrsl_{}.gpkg'.format(country_code), driver='GPKG')

### Load OSM dataset

In [32]:
osm_gdf = gpd.read_file(
    "../download_data/{}-latest-free.shp/gis_osm_buildings_a_free_1.shp".format(country_code),
    driver="shp",
)

In [33]:
osm_gdf.to_file('../data/{}-latest-free.gpkg'.format(country_code),driver = 'GPKG')

In [13]:
# # 210701
# osm_gdf = gpd.read_file(
#     "../download_data/{}-210701-free.shp/gis_osm_buildings_a_free_1.shp".format(country_code),
#     driver="shp",
# )

# osm_gdf.to_file('../data/{}-210701-free.gpkg'.format(country_code),driver = 'GPKG')

### Get mapped pixels

In [10]:
# Just run this so you don't have to rerun everything up top
hrsl_gdf = gpd.read_file('../data/hrsl_{}.gpkg'.format(country_code), driver='GPKG')

In [11]:
hrsl_gdf

Unnamed: 0,index,geometry
0,0,"POLYGON ((49.24486 -11.95569, 49.24486 -11.955..."
1,1,"POLYGON ((49.27153 -11.95597, 49.27153 -11.956..."
2,2,"POLYGON ((49.26042 -11.97458, 49.26042 -11.974..."
3,3,"POLYGON ((49.26458 -11.97986, 49.26458 -11.980..."
4,4,"POLYGON ((49.26458 -11.98014, 49.26458 -11.980..."
...,...,...
1583561,1583561,"POLYGON ((45.28903 -25.57153, 45.28903 -25.571..."
1583562,1583562,"POLYGON ((45.30958 -25.57208, 45.30958 -25.572..."
1583563,1583563,"POLYGON ((45.31486 -25.57319, 45.31486 -25.573..."
1583564,1583564,"POLYGON ((45.21764 -25.57431, 45.21764 -25.574..."


In [12]:
# 2021 OSM File
osm_gdf = gpd.read_file('../data/{}-latest-free.gpkg'.format(country_code),driver = 'GPKG')

In [15]:
# # 210701 
# osm_gdf = gpd.read_file('../data/{}-210701-free.gpkg'.format(country_code),driver = 'GPKG')

In [13]:
pixels_with_buildings = gpd.sjoin(
    hrsl_gdf, osm_gdf, how="inner", predicate="intersects"
)

In [14]:
pixels_with_buildings = pixels_with_buildings.drop_duplicates(subset='index')

In [15]:
# Show the result
pixels_with_buildings.head(5)

Unnamed: 0,index,geometry,index_right,osm_id,code,fclass,name,type
1,1,"POLYGON ((49.27153 -11.95597, 49.27153 -11.956...",574,169708515,1500,building,,
5,5,"POLYGON ((49.25264 -11.99097, 49.25264 -11.991...",242796,499544846,1500,building,,
6,6,"POLYGON ((49.22847 -12.04597, 49.22847 -12.046...",714005,717030538,1500,building,,
7,7,"POLYGON ((49.22847 -12.04625, 49.22847 -12.046...",714005,717030538,1500,building,,
9,9,"POLYGON ((49.23264 -12.05403, 49.23264 -12.054...",293027,541953095,1500,building,,


In [16]:
pixels_with_buildings.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 323972 entries, 1 to 1583555
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype   
---  ------       --------------   -----   
 0   index        323972 non-null  int64   
 1   geometry     323972 non-null  geometry
 2   index_right  323972 non-null  int64   
 3   osm_id       323972 non-null  object  
 4   code         323972 non-null  int64   
 5   fclass       323972 non-null  object  
 6   name         3215 non-null    object  
 7   type         8985 non-null    object  
dtypes: geometry(1), int64(3), object(4)
memory usage: 22.2+ MB


In [17]:
# Drop unnecessary columns
pixels_with_buildings.drop(columns=['index_right', 'osm_id', 'code', 'fclass', 'name', 'type'], inplace=True)

In [18]:
# Save to file
pixels_with_buildings.to_file('../data/{}_pixels_with_buildings.gpkg'.format(country_code), driver='GPKG')

### Get unmapped pixels

In [19]:
pixels_no_buildings = pd.merge(hrsl_gdf, pixels_with_buildings, how='outer', indicator=True)

In [20]:
pixels_no_buildings

Unnamed: 0,index,geometry,_merge
0,0,"POLYGON ((49.24486 -11.95569, 49.24486 -11.955...",left_only
1,1,"POLYGON ((49.27153 -11.95597, 49.27153 -11.956...",both
2,2,"POLYGON ((49.26042 -11.97458, 49.26042 -11.974...",left_only
3,3,"POLYGON ((49.26458 -11.97986, 49.26458 -11.980...",left_only
4,4,"POLYGON ((49.26458 -11.98014, 49.26458 -11.980...",left_only
...,...,...,...
1583561,1583561,"POLYGON ((45.28903 -25.57153, 45.28903 -25.571...",left_only
1583562,1583562,"POLYGON ((45.30958 -25.57208, 45.30958 -25.572...",left_only
1583563,1583563,"POLYGON ((45.31486 -25.57319, 45.31486 -25.573...",left_only
1583564,1583564,"POLYGON ((45.21764 -25.57431, 45.21764 -25.574...",left_only


In [21]:
pixels_no_buildings = pixels_no_buildings[pixels_no_buildings['_merge'] == 'left_only']

In [22]:
pixels_no_buildings.drop(columns=['_merge'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [44]:
# 2021
pixels_no_buildings.to_file('../data/{}_pixels_no_buildings.gpkg'.format(country_code), driver='GPKG')

### Calculate percent completeness

In [23]:
len(pixels_with_buildings) / (len(pixels_with_buildings) + len(pixels_no_buildings)) * 100

20.458383168115507

## Aggregate to different admin boundaries

In this section, we label each pixels based on which administrative region they belong to. This data will allows us to calculate percent completeness per region

### Turn mapped pixels from a polygon layer to a point layer

In [47]:
# # Just run this so you don't have to rerun everything up top
pixels_with_buildings = gpd.read_file(
    "../data/{}_pixels_with_buildings.gpkg".format(country_code), driver="GPKG"
)

In [48]:
pixels_with_buildings["geometry"] = pixels_with_buildings["geometry"].centroid


  """Entry point for launching an IPython kernel.


### Turn unmapped pixels from a polygon layer to a point layer

In [49]:
# # Just run this so you don't have to rerun everything up top
pixels_no_buildings = gpd.read_file(
    "../data/{}_pixels_no_buildings.gpkg".format(country_code), driver="GPKG"
)

In [50]:
pixels_no_buildings["geometry"] = pixels_no_buildings["geometry"].centroid


  """Entry point for launching an IPython kernel.


### Load admin boundary

For this section, we aggregate it using the level 4 admin boundaries.

Note: It is generally preferrable to use the **lowest/most granular**  boundary level. For admin boundaries available in HDX, the lowest level boundary also contains data on which higher level boundaries they belong to. This allows us tag each pixel based multiple admin levels in one `.sjoin()`. However, code runtime will be much longer compared to using higher levels.

In [51]:
# Change this link depending on the file name of your 
# chosen admin boundary
adm_gdf = gpd.read_file(
    "../download_data/{}_adm_all/mdg_admbnda_adm4_BNGRC_OCHA_20181031.shp".format(country_code)
)

In [53]:
# Show the admin boundaries structure
adm_gdf.head()

Unnamed: 0,ADM0_PCODE,ADM0_EN,ADM1_PCODE,ADM1_EN,ADM1_TYPE,ADM2_PCODE,ADM2_EN,ADM2_TYPE,ADM3_PCODE,ADM3_EN,ADM3_TYPE,ADM4_PCODE,ADM4_EN,ADM4_TYPE,PROV_CODE_,OLD_PROVIN,PROV_TYPE,NOTES,SOURCE,geometry
0,MG,Madagascar,MG22,Amoron I Mania,Region,MG22203,Ambositra,District,MG22203090,Andina,Commune,MG22203090008,Ampasina,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((47.10540 -20.51447, 47.11485 -20.514..."
1,MG,Madagascar,MG22,Amoron I Mania,Region,MG22203,Ambositra,District,MG22203090,Andina,Commune,MG22203090011,Antanifotsy,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((47.13580 -20.54133, 47.13600 -20.544..."
2,MG,Madagascar,MG24,Ihorombe,Region,MG24216,Ihosy,District,MG24216192,Andiolava,Commune,MG24216192006,Amboloando,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((45.58350 -22.53003, 45.59102 -22.534..."
3,MG,Madagascar,MG24,Ihorombe,Region,MG24216,Ihosy,District,MG24216192,Andiolava,Commune,MG24216192002,Vatambe Nanarena,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((45.51825 -22.27433, 45.51900 -22.274..."
4,MG,Madagascar,MG24,Ihorombe,Region,MG24216,Ihosy,District,MG24216192,Andiolava,Commune,MG24216192003,Vohimary,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((45.62246 -22.40084, 45.62173 -22.414..."


In [54]:
adm_gdf[adm_gdf.ADM1_EN.isin(['Sava','Vatovavy-Fitovanany'])]

Unnamed: 0,ADM0_PCODE,ADM0_EN,ADM1_PCODE,ADM1_EN,ADM1_TYPE,ADM2_PCODE,ADM2_EN,ADM2_TYPE,ADM3_PCODE,ADM3_EN,ADM3_TYPE,ADM4_PCODE,ADM4_EN,ADM4_TYPE,PROV_CODE_,OLD_PROVIN,PROV_TYPE,NOTES,SOURCE,geometry
84,MG,Madagascar,MG72,Sava,Region,MG72711,Sambava,District,MG72711330,Andrahanjo,Commune,MG72711330006,Ambohimanarina,Fokontany,7,Antsiranana,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((49.81025 -14.28598, 49.81690 -14.281..."
85,MG,Madagascar,MG72,Sava,Region,MG72711,Sambava,District,MG72711330,Andrahanjo,Commune,MG72711330007,Tananambo,Fokontany,7,Antsiranana,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((49.84363 -14.29923, 49.83757 -14.307..."
86,MG,Madagascar,MG72,Sava,Region,MG72711,Sambava,District,MG72711330,Andrahanjo,Commune,MG72711330008,Ambavala,Fokontany,7,Antsiranana,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((49.86035 -14.28788, 49.86468 -14.292..."
94,MG,Madagascar,MG72,Sava,Region,MG72711,Sambava,District,MG72711230,Anjangoveratra,Commune,MG72711230003,Mananjaran'ifosa,Fokontany,7,Antsiranana,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((50.10627 -14.01733, 50.10921 -14.024..."
95,MG,Madagascar,MG72,Sava,Region,MG72711,Sambava,District,MG72711230,Anjangoveratra,Commune,MG72711230004,Maevatananan'ifosa,Fokontany,7,Antsiranana,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((50.11735 -14.05690, 50.08215 -14.067..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17295,MG,Madagascar,MG72,Sava,Region,MG72712,Andapa,District,MG72712070,Belaoka Marovato,Commune,MG72712070001,Belaoka Marovato,Fokontany,7,Antsiranana,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((49.61970 -14.59737, 49.62428 -14.606..."
17296,MG,Madagascar,MG72,Sava,Region,MG72712,Andapa,District,MG72712070,Belaoka Marovato,Commune,MG72712070003,Morafeno,Fokontany,7,Antsiranana,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((49.61117 -14.59053, 49.60895 -14.594..."
17297,MG,Madagascar,MG72,Sava,Region,MG72712,Andapa,District,MG72712070,Belaoka Marovato,Commune,MG72712070002,Antanambaobe,Fokontany,7,Antsiranana,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((49.59015 -14.58973, 49.59274 -14.590..."
17298,MG,Madagascar,MG72,Sava,Region,MG72712,Andapa,District,MG72712070,Belaoka Marovato,Commune,MG72712070005,Tsaratanana,Fokontany,7,Antsiranana,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((49.62116 -14.63339, 49.62068 -14.635..."


In [55]:
# Check for unique values
adm_gdf.nunique()

ADM0_PCODE        1
ADM0_EN           1
ADM1_PCODE       22
ADM1_EN          22
ADM1_TYPE         1
ADM2_PCODE      119
ADM2_EN         119
ADM2_TYPE         1
ADM3_PCODE     1579
ADM3_EN        1429
ADM3_TYPE         1
ADM4_PCODE    17465
ADM4_EN       10977
ADM4_TYPE         1
PROV_CODE_        6
OLD_PROVIN        6
PROV_TYPE         1
NOTES             1
SOURCE           10
geometry      17465
dtype: int64

Note: Make sure that the target PCODEs for the geometries are unique or 1:1. If not, you may need to create a unique index to label the boundaries.

### Find intersection of mapped pixels and level 4 admin boundary

In [56]:
pixels_with_buildings_sjoin_adm = gpd.sjoin(
    pixels_with_buildings, adm_gdf, how="left", op="within"
)

  if (await self.run_code(code, result,  async_=asy)):


In [57]:
# Drop all columns except for geometry and PCODEs
pixels_with_buildings_sjoin_adm.drop(
    columns=[
        'index_right', 
        'ADM0_PCODE',
        'ADM0_EN',
        'ADM1_EN',
        'ADM1_TYPE',
        'ADM2_EN',
        'ADM2_TYPE',
        'ADM3_EN',
        'ADM3_TYPE',
        'ADM4_EN',
        'ADM4_TYPE',
        'PROV_CODE_',
        'OLD_PROVIN',
        'PROV_TYPE',
        'NOTES',
        'SOURCE'
    ],
    inplace=True,
)

In [58]:
# Show the output
pixels_with_buildings_sjoin_adm.head()

Unnamed: 0,index,geometry,ADM1_PCODE,ADM2_PCODE,ADM3_PCODE,ADM4_PCODE
0,1,POINT (49.27167 -11.95611),MG71,MG71713,MG71713170,MG71713170002
1,5,POINT (49.25278 -11.99111),MG71,MG71713,MG71713170,MG71713170002
2,6,POINT (49.22861 -12.04611),MG71,MG71713,MG71713170,MG71713170001
3,7,POINT (49.22861 -12.04639),MG71,MG71713,MG71713170,MG71713170001
4,9,POINT (49.23278 -12.05417),MG71,MG71713,MG71713170,MG71713170001


In [59]:
pixels_with_buildings_sjoin_adm.to_file(
    "../data/{}_pixels_with_buildings_sjoin_{}.gpkg".format(country_code,adm_code), driver="GPKG"
)

### Find intersection of mapped pixels and level 4 admin boundary

In [60]:
pixels_no_buildings_sjoin_adm = gpd.sjoin(
    pixels_no_buildings, adm_gdf, how="left", op="within"
)

  if (await self.run_code(code, result,  async_=asy)):


In [61]:
# Drop all columns except for geometry and PCODEs
pixels_no_buildings_sjoin_adm.drop(
    columns=[
        'index_right', 
        'ADM0_PCODE',
        'ADM0_EN',
        'ADM1_EN',
        'ADM1_TYPE',
        'ADM2_EN',
        'ADM2_TYPE',
        'ADM3_EN',
        'ADM3_TYPE',
        'ADM4_EN',
        'ADM4_TYPE',
        'PROV_CODE_',
        'OLD_PROVIN',
        'PROV_TYPE',
        'NOTES',
        'SOURCE'
    ],
    inplace=True,
)

In [62]:
# Show the output
pixels_no_buildings_sjoin_adm.head()

Unnamed: 0,index,geometry,ADM1_PCODE,ADM2_PCODE,ADM3_PCODE,ADM4_PCODE
0,0,POINT (49.24500 -11.95583),MG71,MG71713,MG71713170,MG71713170002
1,2,POINT (49.26056 -11.97472),MG71,MG71713,MG71713170,MG71713170002
2,3,POINT (49.26472 -11.98000),MG71,MG71713,MG71713170,MG71713170002
3,4,POINT (49.26472 -11.98028),MG71,MG71713,MG71713170,MG71713170002
4,8,POINT (49.22889 -12.04639),MG71,MG71713,MG71713170,MG71713170001


In [63]:
pixels_no_buildings_sjoin_adm.to_file(
    "../data/{}_pixels_no_buildings_sjoin_{}.gpkg".format(country_code,adm_code), driver="GPKG"
)

## Saving pixel output

The final pixel output will have each row as a pixel, labelled by admin boundary PCODE and mapped/unmapped status.

In [64]:
# # Just run this so you don't have to rerun everything up top
# pixels_with_buildings_sjoin_adm.to_file(
#     "../data/{}_pixels_with_buildings_sjoin_{}.gpkg".format(country_code,adm_code), driver="GPKG"
# )

In [65]:
# # Just run this so you don't have to rerun everything up top
# pixels_no_buildings_sjoin_adm.to_file(
#     "../data/{}_pixels_no_buildings_sjoin_{}.gpkg".format(country_code,adm_code), driver="GPKG"
# )

In [66]:
# Drop nan values in the dataframe
# Specify columns you want to check
pixels_no_buildings_sjoin_adm.dropna(
    subset=["ADM4_PCODE","ADM3_PCODE","ADM2_PCODE","ADM1_PCODE"], inplace=True
)

pixels_with_buildings_sjoin_adm.dropna(
    subset=["ADM4_PCODE","ADM3_PCODE","ADM2_PCODE","ADM1_PCODE"], inplace=True
)

In [67]:
# Adding a column to indicate that these pixels are mapped/unmapped
pixels_with_buildings_sjoin_adm["status"] = "mapped"
pixels_no_buildings_sjoin_adm["status"] = "unmapped"

In [68]:
# Join the mapped and unppaed pixels together
pixels_all = pd.concat(
    [pixels_with_buildings_sjoin_adm, pixels_no_buildings_sjoin_adm]
)

The final output looks like this:

In [69]:
pixels_all.head()

Unnamed: 0,index,geometry,ADM1_PCODE,ADM2_PCODE,ADM3_PCODE,ADM4_PCODE,status
0,1,POINT (49.27167 -11.95611),MG71,MG71713,MG71713170,MG71713170002,mapped
1,5,POINT (49.25278 -11.99111),MG71,MG71713,MG71713170,MG71713170002,mapped
2,6,POINT (49.22861 -12.04611),MG71,MG71713,MG71713170,MG71713170001,mapped
3,7,POINT (49.22861 -12.04639),MG71,MG71713,MG71713170,MG71713170001,mapped
4,9,POINT (49.23278 -12.05417),MG71,MG71713,MG71713170,MG71713170001,mapped


In [70]:
# 2020
# Save all columns (including geometry) to gkpg
pixels_all.to_file("../data/{}_pixels_all_{}_{}.gpkg".format(country_code,month,year), driver="GPKG")

In [71]:
# 2020
# Drop geometry column
pixels_all.drop(columns=["geometry"], inplace=True)

In [72]:
# 2020
# Save remaining columns as csv 
pixels_all.to_csv("../data/{}_pixels_all_{}_{}.csv".format(country_code,month,year), index=False)

In [73]:
type(pixels_all)

geopandas.geodataframe.GeoDataFrame

## Calculate percent completeness per admin boundary level

In the following code snippets, we use a pivot table to get the number of mapped and unmapped pixels for each administrative region, then calculate the percent completeness

We use `ADM4_PCODE` as the index for each admin region. Change this depending on your own use case.

In [132]:
adm_code

'adm3'

In [133]:
# Just run this so you don't have to rerun everything up top
pixels_all = pd.read_csv("../data/{}_pixels_all_{}_{}.csv".format(country_code,month,year))

In [134]:
pixels_all.head()

Unnamed: 0,index,ADM1_PCODE,ADM2_PCODE,ADM3_PCODE,ADM4_PCODE,status
0,1,MG71,MG71713,MG71713170,MG71713170002,mapped
1,5,MG71,MG71713,MG71713170,MG71713170002,mapped
2,6,MG71,MG71713,MG71713170,MG71713170001,mapped
3,7,MG71,MG71713,MG71713170,MG71713170001,mapped
4,9,MG71,MG71713,MG71713170,MG71713170001,mapped


In [135]:
# Load admin boundaries with relevant information
adm_gdf = gpd.read_file(
    "../download_data/{}_adm_all/mdg_admbnda_adm3_BNGRC_OCHA_20181031.shp".format(country_code)
)

In [136]:
adm_gdf.head()

Unnamed: 0,ADM0_PCODE,ADM0_EN,ADM1_PCODE,ADM1_EN,ADM1_TYPE,ADM2_PCODE,ADM2_EN,ADM2_TYPE,ADM3_PCODE,ADM3_EN,ADM3_TYPE,PROV_CODE_,OLD_PROVIN,PROV_TYPE,NOTES,SOURCE,geometry
0,MG,Madagascar,MG11,Analamanga,Region,MG11101001A,1er Arrondissement,District,MG11101001,1er Arrondissement,Commune,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.50556 -18.89146, 47.50563 -18.891..."
1,MG,Madagascar,MG11,Analamanga,Region,MG11101002A,2e Arrondissement,District,MG11101002,2e Arrondissement,Commune,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.55842 -18.91178, 47.55857 -18.911..."
2,MG,Madagascar,MG11,Analamanga,Region,MG11101003A,3e Arrondissement,District,MG11101003,3e Arrondissement,Commune,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.51365 -18.87834, 47.51775 -18.879..."
3,MG,Madagascar,MG11,Analamanga,Region,MG11101004A,4e Arrondissement,District,MG11101004,4e Arrondissement,Commune,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.50262 -18.91043, 47.50261 -18.910..."
4,MG,Madagascar,MG11,Analamanga,Region,MG11101005A,5e Arrondissement,District,MG11101005,5e Arrondissement,Commune,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.53500 -18.85464, 47.53518 -18.854..."


### Pivot table

In [137]:
adm_pivot = pd.pivot_table(pixels_all[["ADM3_PCODE","status", "index"]], index = ["ADM3_PCODE"], columns = ["status"], aggfunc="count", fill_value=0)

In [138]:
adm_pivot.head()

Unnamed: 0_level_0,index,index
status,mapped,unmapped
ADM3_PCODE,Unnamed: 1_level_2,Unnamed: 2_level_2
MG11101001,3241,965
MG11101002,1516,3914
MG11101003,1458,1740
MG11101004,2803,2290
MG11101005,3043,6098


In [139]:
adm_pivot.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1579 entries, MG11101001 to MG72716350
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   (index, mapped)    1579 non-null   int64
 1   (index, unmapped)  1579 non-null   int64
dtypes: int64(2)
memory usage: 37.0+ KB


The resulting dataframe has a multiindex, which we will fix in the next code blocks

In [140]:
# Reset the multiindex
adm_pivot.columns = adm_pivot.columns.get_level_values(1)
adm_pivot = adm_pivot.reset_index().reset_index()

In [141]:
# Drop the index column 
adm_pivot.drop(["index"],axis = 1,inplace=True)

In [142]:
# Dataframe now has a regular index!
adm_pivot.head()

status,ADM3_PCODE,mapped,unmapped
0,MG11101001,3241,965
1,MG11101002,1516,3914
2,MG11101003,1458,1740
3,MG11101004,2803,2290
4,MG11101005,3043,6098


In [143]:
# Renaming mapped and unmapped columns
adm_pivot.rename(columns = {
    'mapped':'pixels_withbuilding_{}{}'.format(month,year),
    'unmapped':'pixels_nobuilding_{}{}'.format(month,year)
    },
    inplace=True
)

# Adding a column for percent completeness
adm_pivot['percentage_completeness_{}{}'.format(month,year)] = (adm_pivot['pixels_withbuilding_{}{}'.format(month,year)]/(adm_pivot['pixels_withbuilding_{}{}'.format(month,year)] + adm_pivot['pixels_nobuilding_{}{}'.format(month,year)])) * 100

In [144]:
adm_pivot.head()

status,ADM3_PCODE,pixels_withbuilding_november2021,pixels_nobuilding_november2021,percentage_completeness_november2021
0,MG11101001,3241,965,77.056586
1,MG11101002,1516,3914,27.918969
2,MG11101003,1458,1740,45.590994
3,MG11101004,2803,2290,55.036324
4,MG11101005,3043,6098,33.289574


By this point, we've successfully calculated the percentage completeness for each region! Each row corresponds to a unique boundary, with the columns corresponding to mapped/unmapped pixels with corresponding percetn completeness.

Finally, we will join the calculated data with the admin boundaries information. Take note of which columns you would require for the final output (usually PCODE, name, and geometry).

In [145]:
# Create new dataframe that will store adm_gdf
# with percentage completness output
adm_gdf_with_output = adm_gdf

# Get only the columns we need to identify region
# Add/subtract from the columns list as needed
adm_gdf_with_output = adm_gdf_with_output[[
    'ADM3_EN',
    'ADM3_PCODE',
    'geometry'
 ]]

adm_gdf_with_output.head()

Unnamed: 0,ADM3_EN,ADM3_PCODE,geometry
0,1er Arrondissement,MG11101001,"POLYGON ((47.50556 -18.89146, 47.50563 -18.891..."
1,2e Arrondissement,MG11101002,"POLYGON ((47.55842 -18.91178, 47.55857 -18.911..."
2,3e Arrondissement,MG11101003,"POLYGON ((47.51365 -18.87834, 47.51775 -18.879..."
3,4e Arrondissement,MG11101004,"POLYGON ((47.50262 -18.91043, 47.50261 -18.910..."
4,5e Arrondissement,MG11101005,"POLYGON ((47.53500 -18.85464, 47.53518 -18.854..."


In [146]:
# Left joining percentage completness values 
# to their respective regions
adm_gdf_with_output = pd.merge(adm_gdf_with_output,adm_pivot,how="left", on = "ADM3_PCODE")

In [147]:
adm_gdf_with_output.head()

Unnamed: 0,ADM3_EN,ADM3_PCODE,geometry,pixels_withbuilding_november2021,pixels_nobuilding_november2021,percentage_completeness_november2021
0,1er Arrondissement,MG11101001,"POLYGON ((47.50556 -18.89146, 47.50563 -18.891...",3241,965,77.056586
1,2e Arrondissement,MG11101002,"POLYGON ((47.55842 -18.91178, 47.55857 -18.911...",1516,3914,27.918969
2,3e Arrondissement,MG11101003,"POLYGON ((47.51365 -18.87834, 47.51775 -18.879...",1458,1740,45.590994
3,4e Arrondissement,MG11101004,"POLYGON ((47.50262 -18.91043, 47.50261 -18.910...",2803,2290,55.036324
4,5e Arrondissement,MG11101005,"POLYGON ((47.53500 -18.85464, 47.53518 -18.854...",3043,6098,33.289574


In [148]:
adm_gdf_with_output.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 1579 entries, 0 to 1578
Data columns (total 6 columns):
 #   Column                                Non-Null Count  Dtype   
---  ------                                --------------  -----   
 0   ADM3_EN                               1579 non-null   object  
 1   ADM3_PCODE                            1579 non-null   object  
 2   geometry                              1579 non-null   geometry
 3   pixels_withbuilding_november2021      1579 non-null   int64   
 4   pixels_nobuilding_november2021        1579 non-null   int64   
 5   percentage_completeness_november2021  1579 non-null   float64 
dtypes: float64(1), geometry(1), int64(2), object(2)
memory usage: 86.4+ KB


### Saving percent completeness per region

In [149]:
# To gpkg
filename = "../data/mapthegap-{}-{}-{}-{}-{}.gpkg"\
.format(country_code,adm_code,year,month_num,day)
adm_gdf_with_output.to_file(filename,driver='GPKG')
filename

'../data/mapthegap-mdg-adm3-2021-11-23.gpkg'

In [150]:
# To csv
filename = "../data/mapthegap-{}-{}-{}-{}-{}.csv"\
.format(country_code,adm_code,year,month_num,day)
adm_gdf_with_output.to_csv(filename,index=False)
filename

'../data/mapthegap-mdg-adm3-2021-11-23.csv'

# Finish!
By this point, you should already have the percent completeness for the whole country and per admin region.

To convert this into tilesets for plotting, refer to `2_CreateTileset.ipynb`

In [2]:
! gsutil cp -n ../data/mapthegap-mdg-adm3-2021-11-23.gpkg gs://tm-ardie

Copying file://../data/mapthegap-mdg-adm3-2021-11-23.gpkg [Content-Type=application/octet-stream]...
- [1 files][ 23.9 MiB/ 23.9 MiB]                                                
Operation completed over 1 objects/23.9 MiB.                                     


In [158]:
! gsutil cp -n ~/osm-completeness/data/mapthegap-mdg-adm2-2021-11-23.gpkg gs://tm-ardie/2021-07-13-osm-completeness-madagascar/

AccessDeniedException: 403 jc-osm-completeness@tm-geospatial.iam.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object.
