# Methodology (MDG)

This notebook shows the methodology for the paper "Measuring OpenStreetMap building footprint completeness using human settlement layers", specifically for the Madagascar

Note: the notebook processes OSM completeness for two dates: January 2020 and July 12, 2021, and combines these results together.

## Setup

We import all of the relevant packages as well as download the datasets.

For reference, here are the original download links for the datasets:
1. High Resolution Settlement Layer (HRSL) ([Philippines](https://data.humdata.org/dataset/philippines-high-resolution-population-density-maps-demographic-estimates)) ([Madagascar](https://data.humdata.org/dataset/highresolutionpopulationdensitymaps-mdg))
2. Administrative Boundaries ([Philippines](https://data.humdata.org/dataset/philippines-administrative-levels-0-to-3)) ([Madagascar](https://data.humdata.org/dataset/madagascar-administrative-level-0-4-boundaries))
3. OpenStreetMap (OSM) ([Philippines](https://download.geofabrik.de/asia/philippines.html)) ([Madagascar](https://download.geofabrik.de/africa/madagascar.html))

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import shapely
import geopandas as gpd
import rasterio
import rasterio.features

import wget

import os
import glob
from zipfile import ZipFile

In [3]:
try:
    os.mkdir("../download_data")
except Exception:
    pass

### HRSL download

Uncomment the cells below if you have not yet downloaded the HRSL datasets.

In [8]:
hrsl_mdg_url = "https://data.humdata.org/dataset/9e7ff424-7b9c-42cc-b869-5756fcad0956/resource/1fafdd04-8e0b-4c2a-b4dc-8f3ff39e3015/download/population_mdg_2018-10-01.zip"

In [15]:
wget.download(hrsl_mdg_url, '../download_data/mdg_hrsl_oct_2018.zip')

'../download_data/mdg_hrsl_oct_2018.zip'

### Admin boundary download

Uncomment the cells below if you have not yet downloaded the admin boundary datasets.

In [40]:
adm_mdg_url = "https://data.humdata.org/dataset/26fa506b-0727-4d9d-a590-d2abee21ee22/resource/ed94d52e-349e-41be-80cb-62dc0435bd34/download/mdg_adm_bngrc_ocha_20181031_shp.zip"

In [43]:
wget.download(adm_mdg_url, '../download_data/mdg_adm_all.zip')

'../download_data/mdg_adm_all.zip'

### OSM download

Uncomment the cells below if you have not yet downloaded the OSM datasets. We download both 2020 and 2021 datasets.

In [65]:
##### 2020 OSM Dataset
osm_mdg_url = "https://storage.googleapis.com/osm-completeness-thinkingmachines/mdg_osm_jan_2020_buildings.gpkg.zip"

In [66]:
wget.download(osm_mdg_url, '../download_data/mdg_osm_jan_2020_buildings.gpkg.zip')

'../download_data/mdg_osm_jan_2020_buildings.gpkg.zip'

In [10]:
##### 2021 OSM Dataset
osm_mdg_url = "https://download.geofabrik.de/africa/madagascar-latest-free.shp.zip"

In [11]:
wget.download(osm_mdg_url, '../download_data/madagascar-latest-free.shp.zip')

'../download_data/madagascar-latest-free.shp.zip'

### Unzip all datasets

In [67]:
for i in glob.glob("../download_data/*.zip"):
    if os.path.isdir(os.path.splitext(i)[0]):
        pass
    else:
        with ZipFile(i) as myzip:
            myzip.extractall(os.path.splitext(i)[0])

### Load HRSL dataset

Note: unlike Philippines, Madagascar has no separate male and female layers, so the "Add male and female population to get total population" step is skipped

In [2]:
hrsl_mdg = rasterio.open(
    "../download_data/mdg_hrsl_oct_2018/population_mdg_2018-10-01.tif"
)

In [3]:
hrsl_mdg_band1_mask = hrsl_mdg.read_masks(1)

### Convert HRSL dataset from raster to vector

In [4]:
hrsl_mdg_rand = np.random.rand(
    np.shape(hrsl_mdg_band1_mask)[0], np.shape(hrsl_mdg_band1_mask)[1]
)
hrsl_mdg_rand = hrsl_mdg_rand.astype("float32")

In [5]:
hrsl_mdg_band1_poly = list(
    rasterio.features.shapes(
        hrsl_mdg_rand, transform=hrsl_mdg.transform, mask=hrsl_mdg_band1_mask
    )
)

In [6]:
hrsl_mdg_geom = []
for geom, value in hrsl_mdg_band1_poly:
    geom = shapely.geometry.shape(geom)
    hrsl_mdg_geom.append(geom)

In [7]:
hrsl_mdg_gdf = pd.DataFrame(hrsl_mdg_geom)
hrsl_mdg_gdf = gpd.GeoDataFrame(hrsl_mdg_gdf, geometry=hrsl_mdg_gdf[0], crs="EPSG:4326")
hrsl_mdg_gdf.drop(columns=[0], inplace=True)
hrsl_mdg_gdf.reset_index(level=0, inplace=True)

In [74]:
hrsl_mdg_gdf.to_file('../data/hrsl_mdg.gpkg', driver='GPKG')

Note: At this point in the code, processing will be split into separate portions for 2020 and 2021 respectively. 

# 2021

### Get intersection of HRSL pixels and OSM Buildings

#### Load OSM dataset

In [28]:
# The 2021 zip file contains many shapefiles for points of interests, landuse, waterways, etc. 
# We are only interested in loading the building footprint data
osm_mdg = gpd.read_file(
    "../download_data/madagascar-latest-free.shp/gis_osm_buildings_a_free_1.shp",
    driver="shp",
)

In [29]:
osm_mdg.to_file('../data/madagascar-latest-free.gpkg',driver = 'GPKG')

#### Get mapped pixels

In [30]:
# Just run this so you don't have to rerun everything up top
hrsl_mdg_gdf = gpd.read_file('../data/hrsl_mdg.gpkg', driver='GPKG')

In [30]:
# 2021 OSM File
# Just run this so you don't have to rerun everything up top
osm_mdg = gpd.read_file('../data/madagascar-latest-free.gpkg',driver = 'GPKG')

In [76]:
mdg_pixels_with_buildings = gpd.sjoin(
    hrsl_mdg_gdf, osm_mdg, how="inner", op="intersects"
)

In [77]:
mdg_pixels_with_buildings = mdg_pixels_with_buildings.drop_duplicates(subset='index')

In [78]:
mdg_pixels_with_buildings.drop(columns=['index_right', 'osm_id', 'code', 'fclass', 'name', 'type'], inplace=True)

In [46]:
# 2021
mdg_pixels_with_buildings.to_file('../data/mdg_pixels_with_buildings.gpkg', driver='GPKG')

#### Get unmapped pixels

In [80]:
mdg_pixels_no_buildings = pd.merge(hrsl_mdg_gdf, mdg_pixels_with_buildings, how='outer', indicator=True)

In [81]:
mdg_pixels_no_buildings = mdg_pixels_no_buildings[mdg_pixels_no_buildings['_merge'] == 'left_only']

In [82]:
mdg_pixels_no_buildings.drop(columns=['_merge'], inplace=True)

In [38]:
# 2021
mdg_pixels_no_buildings.to_file('../data/mdg_pixels_no_buildings.gpkg', driver='GPKG')

#### Calculate percent completeness

In [84]:
len(mdg_pixels_with_buildings) / (len(mdg_pixels_with_buildings) + len(mdg_pixels_no_buildings)) * 100

10.89030131380777

### Aggregate to different admin boundaries

#### Turn mapped pixels from a polygon layer to a point layer

In [8]:
# # 2021
# # Just run this so you don't have to rerun everything up top
mdg_pixels_with_buildings = gpd.read_file(
    "../data/mdg_pixels_with_buildings.gpkg", driver="GPKG"
)

In [85]:
mdg_pixels_with_buildings["geometry"] = mdg_pixels_with_buildings["geometry"].centroid


  """Entry point for launching an IPython kernel.


#### Turn unmapped pixels from a polygon layer to a point layer

In [10]:
# # 2021
# # Just run this so you don't have to rerun everything up top
mdg_pixels_no_buildings = gpd.read_file(
    "../data/mdg_pixels_no_buildings.gpkg", driver="GPKG"
)

In [86]:
mdg_pixels_no_buildings["geometry"] = mdg_pixels_no_buildings["geometry"].centroid


  """Entry point for launching an IPython kernel.


#### Load level 4 admin boundary

In [12]:
mdg_adm4 = gpd.read_file(
    "../download_data/mdg_adm_all/mdg_admbnda_adm4_BNGRC_OCHA_20181031.shp"
)

In [44]:
mdg_adm4.head()

Unnamed: 0,ADM0_PCODE,ADM0_EN,ADM1_PCODE,ADM1_EN,ADM1_TYPE,ADM2_PCODE,ADM2_EN,ADM2_TYPE,ADM3_PCODE,ADM3_EN,ADM3_TYPE,ADM4_PCODE,ADM4_EN,ADM4_TYPE,PROV_CODE_,OLD_PROVIN,PROV_TYPE,NOTES,SOURCE,geometry
0,MG,Madagascar,MG22,Amoron I Mania,Region,MG22203,Ambositra,District,MG22203090,Andina,Commune,MG22203090008,Ampasina,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((47.10540 -20.51447, 47.11485 -20.514..."
1,MG,Madagascar,MG22,Amoron I Mania,Region,MG22203,Ambositra,District,MG22203090,Andina,Commune,MG22203090011,Antanifotsy,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((47.13580 -20.54133, 47.13600 -20.544..."
2,MG,Madagascar,MG24,Ihorombe,Region,MG24216,Ihosy,District,MG24216192,Andiolava,Commune,MG24216192006,Amboloando,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((45.58350 -22.53003, 45.59102 -22.534..."
3,MG,Madagascar,MG24,Ihorombe,Region,MG24216,Ihosy,District,MG24216192,Andiolava,Commune,MG24216192002,Vatambe Nanarena,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((45.51825 -22.27433, 45.51900 -22.274..."
4,MG,Madagascar,MG24,Ihorombe,Region,MG24216,Ihosy,District,MG24216192,Andiolava,Commune,MG24216192003,Vohimary,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((45.62246 -22.40084, 45.62173 -22.414..."


In [87]:
mdg_adm4.nunique()

ADM0_PCODE        1
ADM0_EN           1
ADM1_PCODE       22
ADM1_EN          22
ADM1_TYPE         1
ADM2_PCODE      119
ADM2_EN         119
ADM2_TYPE         1
ADM3_PCODE     1579
ADM3_EN        1429
ADM3_TYPE         1
ADM4_PCODE    17465
ADM4_EN       10977
ADM4_TYPE         1
PROV_CODE_        6
OLD_PROVIN        6
PROV_TYPE         1
NOTES             1
SOURCE           10
geometry      17465
dtype: int64

Note: Unlike the PH boundaries, Madagascar has unique PCODES for each ADM4 region, so no need to create a unique index for level 4

#### Find intersection of mapped pixels and level 4 admin boundary

In [88]:
mdg_pixels_with_buildings_sjoin_adm4 = gpd.sjoin(
    mdg_pixels_with_buildings, mdg_adm4, how="left", op="within"
)

In [89]:
mdg_pixels_with_buildings_sjoin_adm4.drop(
    columns=[
        'index_right', 
        'ADM0_PCODE',
        'ADM0_EN',
        'ADM1_EN',
        'ADM1_TYPE',
        'ADM2_EN',
        'ADM2_TYPE',
        'ADM3_EN',
        'ADM3_TYPE',
        'ADM4_EN',
        'ADM4_TYPE',
        'PROV_CODE_',
        'OLD_PROVIN',
        'PROV_TYPE',
        'NOTES',
        'SOURCE'
    ],
    inplace=True,
)

In [90]:
mdg_pixels_with_buildings_sjoin_adm4.head()

Unnamed: 0,index,geometry,ADM1_PCODE,ADM2_PCODE,ADM3_PCODE,ADM4_PCODE
1,1,POINT (49.27167 -11.95611),MG71,MG71713,MG71713170,MG71713170002
5,5,POINT (49.25278 -11.99111),MG71,MG71713,MG71713170,MG71713170002
6,6,POINT (49.22861 -12.04611),MG71,MG71713,MG71713170,MG71713170001
7,7,POINT (49.22861 -12.04639),MG71,MG71713,MG71713170,MG71713170001
9,9,POINT (49.23278 -12.05417),MG71,MG71713,MG71713170,MG71713170001


In [26]:
# # 2021
mdg_pixels_with_buildings_sjoin_adm4.to_file(
    "../data/mdg_pixels_with_buildings_sjoin_adm4.gpkg", driver="GPKG"
)

#### Find intersection of unmapped pixels and level 4 admin boundary

In [92]:
mdg_pixels_no_buildings_sjoin_adm4 = gpd.sjoin(
    mdg_pixels_no_buildings, mdg_adm4, how="left", op="within"
)

In [93]:
mdg_pixels_no_buildings_sjoin_adm4.drop(
    columns=[
        'index_right', 
        'ADM0_PCODE',
        'ADM0_EN',
        'ADM1_EN',
        'ADM1_TYPE',
        'ADM2_EN',
        'ADM2_TYPE',
        'ADM3_EN',
        'ADM3_TYPE',
        'ADM4_EN',
        'ADM4_TYPE',
        'PROV_CODE_',
        'OLD_PROVIN',
        'PROV_TYPE',
        'NOTES',
        'SOURCE'
    ],
    inplace=True,
)

In [94]:
mdg_pixels_no_buildings_sjoin_adm4.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 1411110 entries, 0 to 1583564
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype   
---  ------      --------------    -----   
 0   index       1411110 non-null  int64   
 1   geometry    1411110 non-null  geometry
 2   ADM1_PCODE  1410004 non-null  object  
 3   ADM2_PCODE  1410004 non-null  object  
 4   ADM3_PCODE  1410004 non-null  object  
 5   ADM4_PCODE  1410004 non-null  object  
dtypes: geometry(1), int64(1), object(4)
memory usage: 75.4+ MB


In [27]:
# # 2021
mdg_pixels_no_buildings_sjoin_adm4.to_file(
    "../data/mdg_pixels_no_buildings_sjoin_adm4.gpkg", driver="GPKG"
)

### Saving final output

In [34]:
# # 2021
# # Just run this so you don't have to rerun everything up top
mdg_pixels_no_buildings_sjoin_adm4 = gpd.read_file(
    "../data/mdg_pixels_no_buildings_sjoin_adm4.gpkg", driver="GPKG"
)
mdg_pixels_with_buildings_sjoin_adm4 = gpd.read_file(
    "../data/mdg_pixels_with_buildings_sjoin_adm4.gpkg", driver="GPKG"
)

In [12]:
# Drop nan values in the dataframe
mdg_pixels_no_buildings_sjoin_adm4.dropna(
    subset=["ADM4_PCODE","ADM3_PCODE","ADM2_PCODE","ADM1_PCODE"], inplace=True
)

mdg_pixels_with_buildings_sjoin_adm4.dropna(
    subset=["ADM4_PCODE","ADM3_PCODE","ADM2_PCODE","ADM1_PCODE"], inplace=True
)

In [13]:
# Adding a column to indicate that these pixels are mapped/unmapped
mdg_pixels_with_buildings_sjoin_adm4["status"] = "mapped"
mdg_pixels_no_buildings_sjoin_adm4["status"] = "unmapped"

In [14]:
# Join the mapped and unppaed pixels together
mdg_pixels_all = pd.concat(
    [mdg_pixels_with_buildings_sjoin_adm4, mdg_pixels_no_buildings_sjoin_adm4]
)

Final output for Madagascar should be as follows:

In [15]:
mdg_pixels_all.head()

Unnamed: 0,index,ADM1_PCODE,ADM2_PCODE,ADM3_PCODE,ADM4_PCODE,geometry,status
0,1,MG71,MG71713,MG71713170,MG71713170002,POINT (49.27167 -11.95611),mapped
1,5,MG71,MG71713,MG71713170,MG71713170002,POINT (49.25278 -11.99111),mapped
2,6,MG71,MG71713,MG71713170,MG71713170001,POINT (49.22861 -12.04611),mapped
3,7,MG71,MG71713,MG71713170,MG71713170001,POINT (49.22861 -12.04639),mapped
4,9,MG71,MG71713,MG71713170,MG71713170001,POINT (49.23278 -12.05417),mapped


In [62]:
# 2021
Save all columns (including geometry) to gkpg
mdg_pixels_all.to_file("../data/mdg_pixels_all.gpkg", driver="GPKG")

In [63]:
# 2021
Drop geometry column
mdg_pixels_all.drop(columns=["geometry"], inplace=True)

In [64]:
# 2021
Save remaining columns as csv 
mdg_pixels_all.to_csv("../data/mdg_pixels_all.csv", index=False)

## Calculate percent completeness per admin boundary level

In the following code snippets, we use a pivot table to get the number of mapped and unmapped pixels for each administrative region, then calculate the percent completeness

We use `ADM2_PCODE`/`ADM3_PCODE` as the index for each admin region.

### Loading mapped/unmapped pixels

In [37]:
# Just run this so you don't have to rerun everything up top
mdg_pixels_all = pd.read_csv("../data/mdg_pixels_all.csv")

### Loading admin boundaries

In [35]:
mdg_adm2 = gpd.read_file(
    "../download_data/mdg_adm_all/mdg_admbnda_adm2_BNGRC_OCHA_20181031.shp"
)

In [36]:
mdg_adm3 = gpd.read_file(
    "../download_data/mdg_adm_all/mdg_admbnda_adm3_BNGRC_OCHA_20181031.shp"
)

#### Level 2 Boundaries (districts)

In [45]:
# fill_value = 0 fills in NaN values with 0
# If a region has no mapped pixels, it will give NaN
# which would cause a bug in calculating pct completness
adm2_df = pd.pivot_table(mdg_pixels_all[["ADM2_PCODE","status", "index"]], index = ["ADM2_PCODE"], columns = ["status"], aggfunc="count", fill_value=0)

In [46]:
adm2_df.head()

Unnamed: 0_level_0,index,index
status,mapped,unmapped
ADM2_PCODE,Unnamed: 1_level_2,Unnamed: 2_level_2
MG11101001A,3216,990
MG11101002A,1519,3911
MG11101003A,1456,1742
MG11101004A,2789,2304
MG11101005A,3037,6104


The resulting dataframe has a multiindex, which we will fix in the next code blocks

In [47]:
adm2_df.columns = adm2_df.columns.get_level_values(1)
adm2_df = adm2_df.reset_index().reset_index()

In [48]:
# Dataframe now has a regular index!
adm2_df.head()

status,index,ADM2_PCODE,mapped,unmapped
0,0,MG11101001A,3216,990
1,1,MG11101002A,1519,3911
2,2,MG11101003A,1456,1742
3,3,MG11101004A,2789,2304
4,4,MG11101005A,3037,6104


In [49]:
# Drop the index column 
adm2_df.drop(["index"],axis = 1,inplace=True)

In [50]:
# Renaming columns to fit previous convention
adm2_df.rename(columns = {
    'mapped':'pixels_withbuilding_july2021',
    'unmapped':'pixels_nobuilding_july2021'
    },
    inplace=True
)

# Adding a column for percent completeness
adm2_df['percentage_completeness_july2021'] = (adm2_df['pixels_withbuilding_july2021']/(adm2_df['pixels_withbuilding_july2021'] + adm2_df['pixels_nobuilding_july2021'])) * 100

In [51]:
adm2_df.head()

status,ADM2_PCODE,pixels_withbuilding_july2021,pixels_nobuilding_july2021,percentage_completeness_july2021
0,MG11101001A,3216,990,76.462197
1,MG11101002A,1519,3911,27.974217
2,MG11101003A,1456,1742,45.528455
3,MG11101004A,2789,2304,54.761437
4,MG11101005A,3037,6104,33.223936


In [56]:
# Create new dataframe that will store mdg_adm2 
# with percentage completness output
mdg_adm2_with_output = mdg_adm2

# Get only the columns we need to identify region
mdg_adm2_with_output = mdg_adm2_with_output[[
    'ADM2_EN',
    'ADM2_PCODE',
 ]]

In [57]:
# Left joining percentage completness values 
# to their respective regions
mdg_adm2_with_output = pd.merge(mdg_adm2_with_output,adm2_df,how="left", on = "ADM2_PCODE")

In [58]:
# Left joining the new percent completness values 
# to the existing values using ADM2_PCODE as index
# For the second dataframe, we only keep wanted columns
district_output_july2021 = pd.merge(
    mdg_adm2[['ADM2_PCODE','ADM2_EN','ADM2_TYPE','geometry']],
    mdg_adm2_with_output[['ADM2_PCODE',\
                          'pixels_withbuilding_july2021',\
                          'pixels_nobuilding_july2021',\
                          'percentage_completeness_july2021',\
                         ]],
    how="left", 
    on = "ADM2_PCODE"
)

In [59]:
district_output_july2021.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 119 entries, 0 to 118
Data columns (total 7 columns):
 #   Column                            Non-Null Count  Dtype   
---  ------                            --------------  -----   
 0   ADM2_PCODE                        119 non-null    object  
 1   ADM2_EN                           119 non-null    object  
 2   ADM2_TYPE                         119 non-null    object  
 3   geometry                          119 non-null    geometry
 4   pixels_withbuilding_july2021      119 non-null    int64   
 5   pixels_nobuilding_july2021        119 non-null    int64   
 6   percentage_completeness_july2021  119 non-null    float64 
dtypes: float64(1), geometry(1), int64(2), object(3)
memory usage: 7.4+ KB


In [60]:
# Save to .gpkg file
filename = "../data/mapthegap-mdg-adm2-2021-07-12.gpkg"
district_output_july2021.to_file(filename, driver="GPKG")

#### Level 3 (Commune)

In [61]:
# Pivot table of mapped and unmapped pixels
# fill_value = 0 fills in NaN values with 0
# If a region has no mapped pixels, it will give NaN
# which would cause a bug in calculating pct completness
adm3_df = pd.pivot_table(mdg_pixels_all[["ADM3_PCODE","status", "index"]], index = ["ADM3_PCODE"], columns = ["status"], aggfunc="count", fill_value=0)

# Fix dataframe index
adm3_df.columns = adm3_df.columns.get_level_values(1)
adm3_df = adm3_df.reset_index().reset_index()

In [62]:
# Dataframe now has a regular index!
adm3_df.head()

status,index,ADM3_PCODE,mapped,unmapped
0,0,MG11101001,3216,990
1,1,MG11101002,1519,3911
2,2,MG11101003,1456,1742
3,3,MG11101004,2789,2304
4,4,MG11101005,3037,6104


In [63]:
# Drop the index column 
adm3_df.drop(["index"],axis = 1,inplace=True)

# Renaming columns to fit previous convention
adm3_df.rename(columns = {
    'mapped':'pixels_withbuilding_july2021',
    'unmapped':'pixels_nobuilding_july2021'
    },
    inplace=True
)

# Adding a column for percent completeness
adm3_df['percentage_completeness_july2021'] = (adm3_df['pixels_withbuilding_july2021']/(adm3_df['pixels_withbuilding_july2021'] + adm3_df['pixels_nobuilding_july2021'])) * 100

In [64]:
adm3_df.head()

status,ADM3_PCODE,pixels_withbuilding_july2021,pixels_nobuilding_july2021,percentage_completeness_july2021
0,MG11101001,3216,990,76.462197
1,MG11101002,1519,3911,27.974217
2,MG11101003,1456,1742,45.528455
3,MG11101004,2789,2304,54.761437
4,MG11101005,3037,6104,33.223936


In [68]:
# Create new dataframe that will store phl_adm3
# with percentage completness output
mdg_adm3_with_output = mdg_adm3

# Get only the columns we need to identify region
mdg_adm3_with_output = mdg_adm3_with_output[[
    'ADM3_EN',
    'ADM3_PCODE',
    'ADM3_TYPE'
 ]]

In [69]:
# Left joining percentage completness values 
# to their respective regions
mdg_adm3_with_output = pd.merge(mdg_adm3_with_output,adm3_df,how="left", on = "ADM3_PCODE")

In [70]:
mdg_adm3_with_output.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1579 entries, 0 to 1578
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   ADM3_EN                           1579 non-null   object 
 1   ADM3_PCODE                        1579 non-null   object 
 2   ADM3_TYPE                         1579 non-null   object 
 3   pixels_withbuilding_july2021      1579 non-null   int64  
 4   pixels_nobuilding_july2021        1579 non-null   int64  
 5   percentage_completeness_july2021  1579 non-null   float64
dtypes: float64(1), int64(2), object(3)
memory usage: 86.4+ KB


In [71]:
mdg_adm3_with_output.head()

Unnamed: 0,ADM3_EN,ADM3_PCODE,ADM3_TYPE,pixels_withbuilding_july2021,pixels_nobuilding_july2021,percentage_completeness_july2021
0,1er Arrondissement,MG11101001,Commune,3216,990,76.462197
1,2e Arrondissement,MG11101002,Commune,1519,3911,27.974217
2,3e Arrondissement,MG11101003,Commune,1456,1742,45.528455
3,4e Arrondissement,MG11101004,Commune,2789,2304,54.761437
4,5e Arrondissement,MG11101005,Commune,3037,6104,33.223936


The last step will be to join it to the previous output and save it to file.

In [72]:
# Left joining the new percent completness values 
# to the existing values using ADM3_PCODE as index
# For the second dataframe, we only keep wanted columns
commune_output_july2021 = pd.merge(
    mdg_adm3[['ADM3_PCODE','ADM3_EN','ADM3_TYPE','geometry']],
    mdg_adm3_with_output[['ADM3_PCODE',\
                          'pixels_withbuilding_july2021',\
                          'pixels_nobuilding_july2021',\
                          'percentage_completeness_july2021'\
                         ]],
    how="left", 
    on = "ADM3_PCODE"
)

In [73]:
# Scroll through the columns to see how percentage completeness increased over time!
commune_output_july2021.head()

Unnamed: 0,ADM3_PCODE,ADM3_EN,ADM3_TYPE,geometry,pixels_withbuilding_july2021,pixels_nobuilding_july2021,percentage_completeness_july2021
0,MG11101001,1er Arrondissement,Commune,"POLYGON ((47.50556 -18.89146, 47.50563 -18.891...",3216,990,76.462197
1,MG11101002,2e Arrondissement,Commune,"POLYGON ((47.55842 -18.91178, 47.55857 -18.911...",1519,3911,27.974217
2,MG11101003,3e Arrondissement,Commune,"POLYGON ((47.51365 -18.87834, 47.51775 -18.879...",1456,1742,45.528455
3,MG11101004,4e Arrondissement,Commune,"POLYGON ((47.50262 -18.91043, 47.50261 -18.910...",2789,2304,54.761437
4,MG11101005,5e Arrondissement,Commune,"POLYGON ((47.53500 -18.85464, 47.53518 -18.854...",3037,6104,33.223936


In [74]:
# Save to .gpkg file
filename = "../data/mapthegap-mdg-adm3-2021-07-12.gpkg"
commune_output_july2021.to_file(filename, driver="GPKG")

# Checkpoint

At this point, we should have calculated the latest percent completeness as of July 12, 2020 for the whole country and per admin region. 

1. `mdg_pixels_all.csv` and `mdg_pixels_all(_jan2020).gpkg` which contains mapped/unmapped pixels, labelled by admin boundary PCODES
2. `mapthegap-mdg-adm<2/3>-2021-07-12.gpkg`, which contains percent completeness values per administrative region, for Jul 2021 **only**

The succeeding code is for running the same methodology for Jan 2020, with the same steps. Run the following code as needed. 

# 2020

### Get intersection of HRSL pixels and OSM Buildings

#### Load OSM dataset

In [8]:
# 2020 OSM File
osm_mdg = gpd.read_file(
    "../download_data/mdg_osm_jan_2020_buildings.gpkg/mdg_osm_jan_2020_buildings.gpkg",
    driver="GPKG",
)

#### Get mapped pixels

In [30]:
# Just run this so you don't have to rerun everything up top
hrsl_mdg_gdf = gpd.read_file('../data/hrsl_mdg.gpkg', driver='GPKG')

In [9]:
mdg_pixels_with_buildings = gpd.sjoin(
    hrsl_mdg_gdf, osm_mdg, how="inner", op="intersects"
)

In [10]:
mdg_pixels_with_buildings = mdg_pixels_with_buildings.drop_duplicates(subset='index')

In [11]:
mdg_pixels_with_buildings.drop(columns=['index_right', 'osm_id', 'code', 'fclass', 'name', 'type'], inplace=True)

In [12]:
# 2020
mdg_pixels_with_buildings.to_file('../data/mdg_pixels_with_buildings_jan2020.gpkg', driver='GPKG')

#### Get unmapped pixels

In [13]:
mdg_pixels_no_buildings = pd.merge(hrsl_mdg_gdf, mdg_pixels_with_buildings, how='outer', indicator=True)

In [14]:
mdg_pixels_no_buildings = mdg_pixels_no_buildings[mdg_pixels_no_buildings['_merge'] == 'left_only']

In [15]:
mdg_pixels_no_buildings.drop(columns=['_merge'], inplace=True)

In [16]:
# 2020
# mdg_pixels_no_buildings.to_file('../data/mdg_pixels_no_buildings_jan2020.gpkg', driver='GPKG')

#### Calculate percent completeness

In [17]:
len(mdg_pixels_with_buildings) / (len(mdg_pixels_with_buildings) + len(mdg_pixels_no_buildings)) * 100

10.890343831145868

### Aggregate to different admin boundaries

#### Turn mapped pixels from a polygon layer to a point layer

In [8]:
# # 2020
# Just run this so you don't have to rerun everything up top
# mdg_pixels_with_buildings = gpd.read_file(
#     "../data/mdg_pixels_with_buildings_jan2020.gpkg", driver="GPKG"
# )

In [18]:
mdg_pixels_with_buildings["geometry"] = mdg_pixels_with_buildings["geometry"].centroid


  """Entry point for launching an IPython kernel.


#### Turn unmapped pixels from a polygon layer to a point layer

In [10]:
# # 2020
# # Just run this so you don't have to rerun everything up top
# mdg_pixels_no_buildings = gpd.read_file(
#     "../data/mdg_pixels_no_buildings_jan2020.gpkg", driver="GPKG"
# )

In [19]:
mdg_pixels_no_buildings["geometry"] = mdg_pixels_no_buildings["geometry"].centroid


  """Entry point for launching an IPython kernel.


#### Load level 4 admin boundary

In [20]:
mdg_adm4 = gpd.read_file(
    "../download_data/mdg_adm_all/mdg_admbnda_adm4_BNGRC_OCHA_20181031.shp"
)

In [44]:
mdg_adm4.head()

Unnamed: 0,ADM0_PCODE,ADM0_EN,ADM1_PCODE,ADM1_EN,ADM1_TYPE,ADM2_PCODE,ADM2_EN,ADM2_TYPE,ADM3_PCODE,ADM3_EN,ADM3_TYPE,ADM4_PCODE,ADM4_EN,ADM4_TYPE,PROV_CODE_,OLD_PROVIN,PROV_TYPE,NOTES,SOURCE,geometry
0,MG,Madagascar,MG22,Amoron I Mania,Region,MG22203,Ambositra,District,MG22203090,Andina,Commune,MG22203090008,Ampasina,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((47.10540 -20.51447, 47.11485 -20.514..."
1,MG,Madagascar,MG22,Amoron I Mania,Region,MG22203,Ambositra,District,MG22203090,Andina,Commune,MG22203090011,Antanifotsy,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((47.13580 -20.54133, 47.13600 -20.544..."
2,MG,Madagascar,MG24,Ihorombe,Region,MG24216,Ihosy,District,MG24216192,Andiolava,Commune,MG24216192006,Amboloando,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((45.58350 -22.53003, 45.59102 -22.534..."
3,MG,Madagascar,MG24,Ihorombe,Region,MG24216,Ihosy,District,MG24216192,Andiolava,Commune,MG24216192002,Vatambe Nanarena,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((45.51825 -22.27433, 45.51900 -22.274..."
4,MG,Madagascar,MG24,Ihorombe,Region,MG24216,Ihosy,District,MG24216192,Andiolava,Commune,MG24216192003,Vohimary,Fokontany,2,Fianarantsoa,Old Provinces/Faritany dissolved in 2007,,BNGRC (National Disaster Management Office) Fo...,"POLYGON ((45.62246 -22.40084, 45.62173 -22.414..."


In [87]:
mdg_adm4.nunique()

ADM0_PCODE        1
ADM0_EN           1
ADM1_PCODE       22
ADM1_EN          22
ADM1_TYPE         1
ADM2_PCODE      119
ADM2_EN         119
ADM2_TYPE         1
ADM3_PCODE     1579
ADM3_EN        1429
ADM3_TYPE         1
ADM4_PCODE    17465
ADM4_EN       10977
ADM4_TYPE         1
PROV_CODE_        6
OLD_PROVIN        6
PROV_TYPE         1
NOTES             1
SOURCE           10
geometry      17465
dtype: int64

Note: Unlike the PH boundaries, Madagascar has unique PCODES for each ADM4 region, so no need to create a unique index for level 4

#### Find intersection of mapped pixels and level 4 admin boundary

In [21]:
mdg_pixels_with_buildings_sjoin_adm4 = gpd.sjoin(
    mdg_pixels_with_buildings, mdg_adm4, how="left", op="within"
)

In [22]:
mdg_pixels_with_buildings_sjoin_adm4.drop(
    columns=[
        'index_right', 
        'ADM0_PCODE',
        'ADM0_EN',
        'ADM1_EN',
        'ADM1_TYPE',
        'ADM2_EN',
        'ADM2_TYPE',
        'ADM3_EN',
        'ADM3_TYPE',
        'ADM4_EN',
        'ADM4_TYPE',
        'PROV_CODE_',
        'OLD_PROVIN',
        'PROV_TYPE',
        'NOTES',
        'SOURCE'
    ],
    inplace=True,
)

In [23]:
mdg_pixels_with_buildings_sjoin_adm4.head()

Unnamed: 0,index,geometry,ADM1_PCODE,ADM2_PCODE,ADM3_PCODE,ADM4_PCODE
1,1,POINT (49.27167 -11.95611),MG71,MG71713,MG71713170,MG71713170002
5,5,POINT (49.25278 -11.99111),MG71,MG71713,MG71713170,MG71713170002
6,6,POINT (49.22861 -12.04611),MG71,MG71713,MG71713170,MG71713170001
7,7,POINT (49.22861 -12.04639),MG71,MG71713,MG71713170,MG71713170001
9,9,POINT (49.23278 -12.05417),MG71,MG71713,MG71713170,MG71713170001


In [91]:
# 2020
mdg_pixels_with_buildings_sjoin_adm4.to_file(
    "../data/mdg_pixels_with_buildings_sjoin_adm4_jan2020.gpkg", driver="GPKG"
)

#### Find intersection of unmapped pixels and level 4 admin boundary

In [24]:
mdg_pixels_no_buildings_sjoin_adm4 = gpd.sjoin(
    mdg_pixels_no_buildings, mdg_adm4, how="left", op="within"
)

In [25]:
mdg_pixels_no_buildings_sjoin_adm4.drop(
    columns=[
        'index_right', 
        'ADM0_PCODE',
        'ADM0_EN',
        'ADM1_EN',
        'ADM1_TYPE',
        'ADM2_EN',
        'ADM2_TYPE',
        'ADM3_EN',
        'ADM3_TYPE',
        'ADM4_EN',
        'ADM4_TYPE',
        'PROV_CODE_',
        'OLD_PROVIN',
        'PROV_TYPE',
        'NOTES',
        'SOURCE'
    ],
    inplace=True,
)

In [26]:
mdg_pixels_no_buildings_sjoin_adm4.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 1411112 entries, 0 to 1583567
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype   
---  ------      --------------    -----   
 0   index       1411112 non-null  int64   
 1   geometry    1411112 non-null  geometry
 2   ADM1_PCODE  1410006 non-null  object  
 3   ADM2_PCODE  1410006 non-null  object  
 4   ADM3_PCODE  1410006 non-null  object  
 5   ADM4_PCODE  1410006 non-null  object  
dtypes: geometry(1), int64(1), object(4)
memory usage: 75.4+ MB


In [95]:
# # 2020
# mdg_pixels_no_buildings_sjoin_adm4.to_file(
#     "../data/mdg_pixels_no_buildings_sjoin_adm4_jan2020.gpkg", driver="GPKG"
# )

### Saving final output

In [11]:
# # 2020
# # Just run this so you don't have to rerun everything up top
mdg_pixels_no_buildings_sjoin_adm4 = gpd.read_file(
    "../data/mdg_pixels_no_buildings_sjoin_adm4_jan2020.gpkg", driver="GPKG"
)
mdg_pixels_with_buildings_sjoin_adm4 = gpd.read_file(
    "../data/mdg_pixels_with_buildings_sjoin_adm4_jan2020.gpkg", driver="GPKG"
)

In [27]:
# Drop nan values in the dataframe
mdg_pixels_no_buildings_sjoin_adm4.dropna(
    subset=["ADM4_PCODE","ADM3_PCODE","ADM2_PCODE","ADM1_PCODE"], inplace=True
)

mdg_pixels_with_buildings_sjoin_adm4.dropna(
    subset=["ADM4_PCODE","ADM3_PCODE","ADM2_PCODE","ADM1_PCODE"], inplace=True
)

In [28]:
# Adding a column to indicate that these pixels are mapped/unmapped
mdg_pixels_with_buildings_sjoin_adm4["status"] = "mapped"
mdg_pixels_no_buildings_sjoin_adm4["status"] = "unmapped"

In [29]:
# Join the mapped and unppaed pixels together
mdg_pixels_all = pd.concat(
    [mdg_pixels_with_buildings_sjoin_adm4, mdg_pixels_no_buildings_sjoin_adm4]
)

Final output for Madagascar should be as follows:

In [30]:
mdg_pixels_all.head()

Unnamed: 0,index,geometry,ADM1_PCODE,ADM2_PCODE,ADM3_PCODE,ADM4_PCODE,status
1,1,POINT (49.27167 -11.95611),MG71,MG71713,MG71713170,MG71713170002,mapped
5,5,POINT (49.25278 -11.99111),MG71,MG71713,MG71713170,MG71713170002,mapped
6,6,POINT (49.22861 -12.04611),MG71,MG71713,MG71713170,MG71713170001,mapped
7,7,POINT (49.22861 -12.04639),MG71,MG71713,MG71713170,MG71713170001,mapped
9,9,POINT (49.23278 -12.05417),MG71,MG71713,MG71713170,MG71713170001,mapped


In [16]:
# 2020
# Save all columns (including geometry) to gpkg
mdg_pixels_all.to_file("../data/mdg_pixels_all_jan2020.gpkg", driver="GPKG")

In [17]:
# 2020
# Drop geometry column
mdg_pixels_all.drop(columns=["geometry"], inplace=True)

In [18]:
# 2020
# Save remaining columns as csv 
mdg_pixels_all.to_csv("../data/mdg_pixels_all_jan2020.csv", index=False)

## Calculate percent completeness per admin boundary level

In the following code snippets, we use a pivot table to get the number of mapped and unmapped pixels for each administrative region, then calculate the percent completeness

We use `ADM2_PCODE`/`ADM3_PCODE` as the index for each admin region.

### Loading mapped/unmapped pixels

In [81]:
# Just run this so you don't have to rerun everything up top
mdg_pixels_all_jan2020 = pd.read_csv("../data/mdg_pixels_all_jan2020.csv")

### Loading admin boundaries

In [76]:
mdg_adm2 = gpd.read_file(
    "../download_data/mdg_adm_all/mdg_admbnda_adm2_BNGRC_OCHA_20181031.shp"
)

In [77]:
mdg_adm3 = gpd.read_file(
    "../download_data/mdg_adm_all/mdg_admbnda_adm3_BNGRC_OCHA_20181031.shp"
)

### Loading previous output

These were the files generated by the 2021 section above.

In [78]:
district_output_july2021 = gpd.read_file(
    '../data/mapthegap-mdg-adm2-2021-07-12.gpkg',
    driver='GPKG'
)
district_output_july2021.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 7 columns):
 #   Column                            Non-Null Count  Dtype   
---  ------                            --------------  -----   
 0   ADM2_PCODE                        119 non-null    object  
 1   ADM2_EN                           119 non-null    object  
 2   ADM2_TYPE                         119 non-null    object  
 3   pixels_withbuilding_july2021      119 non-null    int64   
 4   pixels_nobuilding_july2021        119 non-null    int64   
 5   percentage_completeness_july2021  119 non-null    float64 
 6   geometry                          119 non-null    geometry
dtypes: float64(1), geometry(1), int64(2), object(3)
memory usage: 6.6+ KB


In [79]:
commune_output_july2021 = gpd.read_file(
    "../data/mapthegap-mdg-adm3-2021-07-12.gpkg",
    driver="GPKG"
)
commune_output_july2021.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 1579 entries, 0 to 1578
Data columns (total 7 columns):
 #   Column                            Non-Null Count  Dtype   
---  ------                            --------------  -----   
 0   ADM3_PCODE                        1579 non-null   object  
 1   ADM3_EN                           1579 non-null   object  
 2   ADM3_TYPE                         1579 non-null   object  
 3   pixels_withbuilding_july2021      1579 non-null   int64   
 4   pixels_nobuilding_july2021        1579 non-null   int64   
 5   percentage_completeness_july2021  1579 non-null   float64 
 6   geometry                          1579 non-null   geometry
dtypes: float64(1), geometry(1), int64(2), object(3)
memory usage: 86.5+ KB


#### Level 2 Boundaries (districts)

In [82]:
# fill_value = 0 fills in NaN values with 0
# If a region has no mapped pixels, it will give NaN
# which would cause a bug in calculating pct completness
adm2_df = pd.pivot_table(mdg_pixels_all_jan2020[["ADM2_PCODE","status", "index"]], index = ["ADM2_PCODE"], columns = ["status"], aggfunc="count", fill_value=0)

In [83]:
adm2_df.head()

Unnamed: 0_level_0,index,index
status,mapped,unmapped
ADM2_PCODE,Unnamed: 1_level_2,Unnamed: 2_level_2
MG11101001A,2472,1734
MG11101002A,1439,3991
MG11101003A,1424,1774
MG11101004A,1646,3447
MG11101005A,2153,6988


The resulting dataframe has a multiindex, which we will fix in the next code blocks

In [84]:
adm2_df.columns = adm2_df.columns.get_level_values(1)
adm2_df = adm2_df.reset_index().reset_index()

In [85]:
# Dataframe now has a regular index!
adm2_df.head()

status,index,ADM2_PCODE,mapped,unmapped
0,0,MG11101001A,2472,1734
1,1,MG11101002A,1439,3991
2,2,MG11101003A,1424,1774
3,3,MG11101004A,1646,3447
4,4,MG11101005A,2153,6988


In [86]:
# Drop the index column 
adm2_df.drop(["index"],axis = 1,inplace=True)

# Renaming columns to fit previous convention
adm2_df.rename(columns = {
    'mapped':'pixels_withbuilding_jan2020',
    'unmapped':'pixels_nobuilding_jan2020'
    },
    inplace=True
)

# Adding a column for percent completeness
adm2_df['percentage_completeness_jan2020'] = (adm2_df['pixels_withbuilding_jan2020']/(adm2_df['pixels_withbuilding_jan2020'] + adm2_df['pixels_nobuilding_jan2020'])) * 100

In [87]:
adm2_df.head()

status,ADM2_PCODE,pixels_withbuilding_jan2020,pixels_nobuilding_jan2020,percentage_completeness_jan2020
0,MG11101001A,2472,1734,58.773181
1,MG11101002A,1439,3991,26.500921
2,MG11101003A,1424,1774,44.52783
3,MG11101004A,1646,3447,32.318869
4,MG11101005A,2153,6988,23.553222


We've successfully calculated the percentage completeness for level 2! Next, we will join the data with the admin boundaries information.

In [88]:
# Create new dataframe that will store mdg_adm2 
# with percentage completness output
mdg_adm2_with_output = mdg_adm2

# Get only the columns we need to identify region
mdg_adm2_with_output = mdg_adm2_with_output[[
    'ADM2_EN',
    'ADM2_PCODE',
 ]]

In [89]:
# Left joining percentage completness values 
# to their respective regions
mdg_adm2_with_output = pd.merge(mdg_adm2_with_output,adm2_df,how="left", on = "ADM2_PCODE")

In [90]:
mdg_adm2_with_output.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 119 entries, 0 to 118
Data columns (total 5 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ADM2_EN                          119 non-null    object 
 1   ADM2_PCODE                       119 non-null    object 
 2   pixels_withbuilding_jan2020      119 non-null    int64  
 3   pixels_nobuilding_jan2020        119 non-null    int64  
 4   percentage_completeness_jan2020  119 non-null    float64
dtypes: float64(1), int64(2), object(2)
memory usage: 5.6+ KB


In [91]:
mdg_adm2_with_output['pixels_withbuilding_jan2020'].sum() / (mdg_adm2_with_output['pixels_withbuilding_jan2020'].sum() + mdg_adm2_with_output['pixels_nobuilding_jan2020'].sum())

0.10889578597380667

The last step will be to join it to the previous output and save it to file.

In [92]:
# Left joining the new percent completness values 
# to the existing values using ADM2_PCODE as index
# For the second dataframe, we only keep wanted columns
district_output_jan2020 = pd.merge(
    district_output_july2021,
    mdg_adm2_with_output[['ADM2_PCODE',\
                          'pixels_withbuilding_jan2020',\
                          'pixels_nobuilding_jan2020',\
                          'percentage_completeness_jan2020'\
                         ]],
    how="left", 
    on = "ADM2_PCODE"
)

In [93]:
district_output_jan2020.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 119 entries, 0 to 118
Data columns (total 10 columns):
 #   Column                            Non-Null Count  Dtype   
---  ------                            --------------  -----   
 0   ADM2_PCODE                        119 non-null    object  
 1   ADM2_EN                           119 non-null    object  
 2   ADM2_TYPE                         119 non-null    object  
 3   pixels_withbuilding_july2021      119 non-null    int64   
 4   pixels_nobuilding_july2021        119 non-null    int64   
 5   percentage_completeness_july2021  119 non-null    float64 
 6   geometry                          119 non-null    geometry
 7   pixels_withbuilding_jan2020       119 non-null    int64   
 8   pixels_nobuilding_jan2020         119 non-null    int64   
 9   percentage_completeness_jan2020   119 non-null    float64 
dtypes: float64(2), geometry(1), int64(4), object(3)
memory usage: 10.2+ KB


In [94]:
district_output_jan2020.head(20)

Unnamed: 0,ADM2_PCODE,ADM2_EN,ADM2_TYPE,pixels_withbuilding_july2021,pixels_nobuilding_july2021,percentage_completeness_july2021,geometry,pixels_withbuilding_jan2020,pixels_nobuilding_jan2020,percentage_completeness_jan2020
0,MG11101001A,1er Arrondissement,District,3216,990,76.462197,"POLYGON ((47.50556 -18.89146, 47.50563 -18.891...",2472,1734,58.773181
1,MG11101002A,2e Arrondissement,District,1519,3911,27.974217,"POLYGON ((47.55842 -18.91178, 47.55857 -18.911...",1439,3991,26.500921
2,MG11101003A,3e Arrondissement,District,1456,1742,45.528455,"POLYGON ((47.51365 -18.87834, 47.51775 -18.879...",1424,1774,44.52783
3,MG11101004A,4e Arrondissement,District,2789,2304,54.761437,"POLYGON ((47.50262 -18.91043, 47.50261 -18.910...",1646,3447,32.318869
4,MG11101005A,5e Arrondissement,District,3037,6104,33.223936,"POLYGON ((47.53500 -18.85464, 47.53518 -18.854...",2153,6988,23.553222
5,MG11101006A,6e Arrondissement,District,636,2423,20.791108,"POLYGON ((47.48436 -18.83989, 47.48491 -18.840...",561,2498,18.339327
6,MG11102,Antananarivo Avaradrano,District,6692,27525,19.55753,"POLYGON ((47.61521 -18.71709, 47.61702 -18.717...",4015,30201,11.734276
7,MG11103,Ambohidratrimo,District,4332,31492,12.092452,"POLYGON ((47.49982 -18.43546, 47.50490 -18.439...",3797,32027,10.59904
8,MG11104,Ankazobe,District,1486,15437,8.780949,"POLYGON ((46.74249 -17.71321, 46.74325 -17.713...",645,16278,3.811381
9,MG11106,Manjakandriana,District,1296,24520,5.020143,"POLYGON ((47.72437 -18.47394, 47.72465 -18.477...",648,25168,2.510071


In [96]:
# Save to .gpkg file
filename = "../data/mapthegap-mdg-adm2-2021-07-12.gpkg"
district_output_jan2020.to_file(filename, driver="GPKG")

#### Level 3 Boundaries (communes)

In [95]:
# fill_value = 0 fills in NaN values with 0
# If a region has no mapped pixels, it will give NaN
# which would cause a bug in calculating pct completness
adm3_df = pd.pivot_table(mdg_pixels_all_jan2020[["ADM3_PCODE","status", "index"]], index = ["ADM3_PCODE"], columns = ["status"], aggfunc="count", fill_value=0)

In [97]:
adm3_df.head()

Unnamed: 0_level_0,index,index
status,mapped,unmapped
ADM3_PCODE,Unnamed: 1_level_2,Unnamed: 2_level_2
MG11101001,2472,1734
MG11101002,1439,3991
MG11101003,1424,1774
MG11101004,1646,3447
MG11101005,2153,6988


The resulting dataframe has a multiindex, which we will fix in the next code blocks

In [98]:
adm3_df.columns = adm3_df.columns.get_level_values(1)
adm3_df = adm3_df.reset_index().reset_index()

In [99]:
# Dataframe now has a regular index!
adm3_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1579 entries, 0 to 1578
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   index       1579 non-null   int64 
 1   ADM3_PCODE  1579 non-null   object
 2   mapped      1579 non-null   int64 
 3   unmapped    1579 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 49.5+ KB


In [100]:
# Drop the index column 
adm3_df.drop(["index"],axis = 1,inplace=True)

# Renaming columns to fit previous convention
adm3_df.rename(columns = {
    'mapped':'pixels_withbuilding_jan2020',
    'unmapped':'pixels_nobuilding_jan2020'
    },
    inplace=True
)

# Adding a column for percent completeness
adm3_df['percentage_completeness_jan2020'] = (adm3_df['pixels_withbuilding_jan2020']/(adm3_df['pixels_withbuilding_jan2020'] + adm3_df['pixels_nobuilding_jan2020'])) * 100

In [101]:
adm3_df.head()

status,ADM3_PCODE,pixels_withbuilding_jan2020,pixels_nobuilding_jan2020,percentage_completeness_jan2020
0,MG11101001,2472,1734,58.773181
1,MG11101002,1439,3991,26.500921
2,MG11101003,1424,1774,44.52783
3,MG11101004,1646,3447,32.318869
4,MG11101005,2153,6988,23.553222


We've successfully calculated the percentage completeness for level 2! Next, we will join the data with the admin boundaries information.

In [102]:
# Create new dataframe that will store mdg_adm2 
# with percentage completness output
mdg_adm3_with_output = mdg_adm3

# Get only the columns we need to identify region
mdg_adm3_with_output = mdg_adm3_with_output[[
    'ADM3_EN',
    'ADM3_PCODE',
 ]]

In [103]:
# Left joining percentage completness values 
# to their respective regions
mdg_adm3_with_output = pd.merge(mdg_adm3_with_output,adm3_df,how="left", on = "ADM3_PCODE")

In [104]:
mdg_adm3_with_output.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1579 entries, 0 to 1578
Data columns (total 5 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ADM3_EN                          1579 non-null   object 
 1   ADM3_PCODE                       1579 non-null   object 
 2   pixels_withbuilding_jan2020      1579 non-null   int64  
 3   pixels_nobuilding_jan2020        1579 non-null   int64  
 4   percentage_completeness_jan2020  1579 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 74.0+ KB


The last step will be to join it to the previous output and save it to file.

In [106]:
# Left joining the new percent completness values 
# to the existing values using ADM2_PCODE as index
# For the second dataframe, we only keep wanted columns
commune_output_jan2020 = pd.merge(
    commune_output_july2021,
    mdg_adm3_with_output[['ADM3_PCODE',\
                          'pixels_withbuilding_jan2020',\
                          'pixels_nobuilding_jan2020',\
                          'percentage_completeness_jan2020'\
                         ]],
    how="left", 
    on = "ADM3_PCODE"
)

In [107]:
commune_output_jan2020.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 1579 entries, 0 to 1578
Data columns (total 10 columns):
 #   Column                            Non-Null Count  Dtype   
---  ------                            --------------  -----   
 0   ADM3_PCODE                        1579 non-null   object  
 1   ADM3_EN                           1579 non-null   object  
 2   ADM3_TYPE                         1579 non-null   object  
 3   pixels_withbuilding_july2021      1579 non-null   int64   
 4   pixels_nobuilding_july2021        1579 non-null   int64   
 5   percentage_completeness_july2021  1579 non-null   float64 
 6   geometry                          1579 non-null   geometry
 7   pixels_withbuilding_jan2020       1579 non-null   int64   
 8   pixels_nobuilding_jan2020         1579 non-null   int64   
 9   percentage_completeness_jan2020   1579 non-null   float64 
dtypes: float64(2), geometry(1), int64(4), object(3)
memory usage: 135.7+ KB


In [108]:
commune_output_jan2020.head(20)

Unnamed: 0,ADM3_PCODE,ADM3_EN,ADM3_TYPE,pixels_withbuilding_july2021,pixels_nobuilding_july2021,percentage_completeness_july2021,geometry,pixels_withbuilding_jan2020,pixels_nobuilding_jan2020,percentage_completeness_jan2020
0,MG11101001,1er Arrondissement,Commune,3216,990,76.462197,"POLYGON ((47.50556 -18.89146, 47.50563 -18.891...",2472,1734,58.773181
1,MG11101002,2e Arrondissement,Commune,1519,3911,27.974217,"POLYGON ((47.55842 -18.91178, 47.55857 -18.911...",1439,3991,26.500921
2,MG11101003,3e Arrondissement,Commune,1456,1742,45.528455,"POLYGON ((47.51365 -18.87834, 47.51775 -18.879...",1424,1774,44.52783
3,MG11101004,4e Arrondissement,Commune,2789,2304,54.761437,"POLYGON ((47.50262 -18.91043, 47.50261 -18.910...",1646,3447,32.318869
4,MG11101005,5e Arrondissement,Commune,3037,6104,33.223936,"POLYGON ((47.53500 -18.85464, 47.53518 -18.854...",2153,6988,23.553222
5,MG11101006,6e Arrondissement,Commune,636,2423,20.791108,"POLYGON ((47.48436 -18.83989, 47.48491 -18.840...",561,2498,18.339327
6,MG11102010,Alasora,Commune,481,3040,13.660892,"POLYGON ((47.58420 -18.93248, 47.58459 -18.932...",461,3060,13.092871
7,MG11102039,Ankadikely Ilafy,Commune,4759,1267,78.974444,"POLYGON ((47.60350 -18.83852, 47.60155 -18.841...",2839,3186,47.120332
8,MG11102050,Ambohimanambola,Commune,13,1351,0.953079,"POLYGON ((47.61793 -18.91379, 47.61867 -18.919...",13,1351,0.953079
9,MG11102079,Sabotsy Namehana,Commune,252,3818,6.191646,"POLYGON ((47.54650 -18.78823, 47.54925 -18.791...",52,4018,1.277641


In [109]:
# Save to .gpkg file
filename = "../data/mapthegap-mdg-adm3-2021-07-12.gpkg"
commune_output_jan2020.to_file(filename, driver="GPKG")

# Finish!
By this point, you should already have the percent completeness for the whole country and per admin region.

1. `mdg_pixels_all(_jan2020).csv` and `mdg_pixels_all(_jan2020).gpkg` which contains mapped/unmapped pixels, labelled by admin boundary PCODES
2. `mapthegap-mdg-adm<2/3>-2021-07-12.gpkg`, which contains percent completeness values per administrative region, with values for both Jan 2020 and Jul 2021

To convert this into tilesets for plotting, refer to `2_CreateTileset.ipynb`