# Reshaping Data

This notebook reshapes the output from the notebook `1_Methodology.ipynb` to generate the final output per level 2/3/4 boundary.

## Setup

Importing the relevant packages and loading all the relevant datasets.

The relevant datasets are as follows
1. `phl_pixels_all.csv`, which is generated by running `1_Methodology.ipynb`, as a dataframe.
2. [Admin boundaries](https://data.humdata.org/dataset/philippines-administrative-levels-0-to-3), to which we will join the percent completeness data
3. The previous output for each admin boundary level, to which we'll join our new output from this notebook

In [2]:
import numpy as np
import pandas as pd
import geopandas as gpd
import geopandas as gpd
import matplotlib.pyplot as plt
from scipy.stats.stats import pearsonr

import os
import wget

### Loading mapped and unmapped PH pixels

In [43]:
phl_pixels_all = pd.read_csv("../data/phl_pixels_all.csv")

In [44]:
phl_pixels_all.head()

Unnamed: 0,index,ADM3_PCODE,RURBAN,ADM4_PCODE_NAME,ADM2_PCODE,status
0,1,PH020902000,R,020902010_Santa Rosa (Kaynatuan),PH020900000,mapped
1,3,PH020902000,R,020902010_Santa Rosa (Kaynatuan),PH020900000,mapped
2,4,PH020902000,R,020902010_Santa Rosa (Kaynatuan),PH020900000,mapped
3,5,PH020902000,R,020902010_Santa Rosa (Kaynatuan),PH020900000,mapped
4,8,PH020902000,R,020902010_Santa Rosa (Kaynatuan),PH020900000,mapped


### Loading admin boundaries

In [45]:
phl_adm2 = gpd.read_file(
    "../download_data/phl_adm_all/phl_admbnda_adm2_psa_namria_20200529.shp"
)

In [46]:
phl_adm3 = gpd.read_file(
    "../download_data/phl_adm_all/phl_admbnda_adm3_psa_namria_20200529.shp"
)

In [47]:
phl_adm4 = gpd.read_file(
    "../download_data/phl_adm_2015_level4_barangay.gpkg/phl_adm_2015_level4_barangay.gpkg"
)

### Loading previous output

In [48]:
province_output = pd.read_csv('../download_data/mapthegap-phl-adm2-2021-05-29.csv')

In [49]:
citymuni_output = pd.read_csv('../download_data/mapthegap-phl-adm3-2021-05-29.csv')

In [50]:
brgy_output = pd.read_csv('../download_data/mapthegap-phl-adm4-2021-05-29.csv')

## Reshaping

In the following code snippets, we use a pivot table to get the number of mapped and unmapped pixels for each administrative region, then calculating the percent completeness

We repeat the same steps for ADM2 - ADM4, using `ADM2_PCODE`, `ADM3_PCODE`, and `ADM4_PCODE_NAME` as indices. 

### Level 2 Boundaries (provinces)

In [51]:
# fill_value = 0 fills in NaN values with 0
# If a region has no mapped pixels, it will give NaN
# which would cause a bug in calculating pct completness
adm2_df = pd.pivot_table(phl_pixels_all[["ADM2_PCODE","status", "index"]], index = ["ADM2_PCODE"], columns = ["status"], aggfunc="count", fill_value=0)

In [52]:
adm2_df.head()

Unnamed: 0_level_0,index,index
status,mapped,unmapped
ADM2_PCODE,Unnamed: 1_level_2,Unnamed: 2_level_2
PH012800000,45646,18479
PH012900000,54459,23735
PH013300000,32348,42918
PH015500000,17768,238488
PH020900000,998,187


The resulting dataframe has a multiindex, which we will fix in the next code blocks

In [53]:
adm2_df.columns = adm2_df.columns.get_level_values(1)
adm2_df = adm2_df.reset_index().reset_index()

In [54]:
# Dataframe now has a regular index!
adm2_df.head()

status,index,ADM2_PCODE,mapped,unmapped
0,0,PH012800000,45646,18479
1,1,PH012900000,54459,23735
2,2,PH013300000,32348,42918
3,3,PH015500000,17768,238488
4,4,PH020900000,998,187


In [55]:
# Drop the index column 
adm2_df.drop(["index"],axis = 1,inplace=True)

# Renaming columns to fit previous convention
adm2_df.rename(columns = {
    'mapped':'pixels_withbuilding_june2021',
    'unmapped':'pixels_nobuilding_june2021'
    },
    inplace=True
)

# Adding a column for percent completeness
adm2_df['percentage_completeness_june2021'] = (adm2_df['pixels_withbuilding_june2021']/(adm2_df['pixels_withbuilding_june2021'] + adm2_df['pixels_nobuilding_june2021'])) * 100

In [56]:
adm2_df.head()

status,ADM2_PCODE,pixels_withbuilding_june2021,pixels_nobuilding_june2021,percentage_completeness_june2021
0,PH012800000,45646,18479,71.182846
1,PH012900000,54459,23735,69.646009
2,PH013300000,32348,42918,42.978237
3,PH015500000,17768,238488,6.933691
4,PH020900000,998,187,84.219409


We've successfully calculated the percentage completeness for level 2! Next, we will join the data with the admin boundaries information.

In [57]:
# Create new dataframe that will store phl_adm2 
# with percentage completness output
phl_adm2_with_output = phl_adm2

# Get only the columns we need to identify region
phl_adm2_with_output = phl_adm2_with_output[[
    'ADM2_EN',
    'ADM2_PCODE',
    'ADM2_REF',
    'ADM2ALT1EN',
    'ADM2ALT2EN'
 ]]

In [58]:
# Left joining percentage completness values 
# to their respective regions
phl_adm2_with_output = pd.merge(phl_adm2_with_output,adm2_df,how="left", on = "ADM2_PCODE")

In [59]:
phl_adm2_with_output.head()

Unnamed: 0,ADM2_EN,ADM2_PCODE,ADM2_REF,ADM2ALT1EN,ADM2ALT2EN,pixels_withbuilding_june2021,pixels_nobuilding_june2021,percentage_completeness_june2021
0,Abra,PH140100000,,,,13862,5080,73.18129
1,Agusan del Norte,PH160200000,,,,15586,24304,39.072449
2,Agusan del Sur,PH160300000,,,,3549,37551,8.635036
3,Aklan,PH060400000,,,,21777,24894,46.660667
4,Albay,PH050500000,,,,32784,25916,55.850085


The last step will be to join it to the previous output and save it to file.

In [60]:
# Left joining the new percent completness values 
# to the existing values using ADM2_PCODE as index
# For the second dataframe, we only keep wanted columns
province_output_june2021 = pd.merge(
    province_output,
    phl_adm2_with_output[['ADM2_PCODE',\
                          'pixels_withbuilding_june2021',\
                          'pixels_nobuilding_june2021',\
                          'percentage_completeness_june2021'\
                         ]],
    how="left", 
    on = "ADM2_PCODE"
)

In [61]:
# Scroll through the columns to see how percentage completeness increased over time!
province_output_june2021.head()

Unnamed: 0,ADM2_EN,ADM2_PCODE,ADM2_REF,ADM2ALT1EN,ADM2ALT2EN,pixels_withbuilding_june2020,pixels_nobuilding_june2020,percentage_completeness_june2020,pixels_withbuilding_may2021,pixels_nobuilding_may2021,percentage_completeness_may2021,pixels_withbuilding_june2021,pixels_nobuilding_june2021,percentage_completeness_june2021
0,Abra,PH140100000,,,,13790,5152,72.801183,13862,5080,73.18129,13862,5080,73.18129
1,Agusan del Norte,PH160200000,,,,4202,35688,10.533968,15586,24304,39.072449,15586,24304,39.072449
2,Agusan del Sur,PH160300000,,,,2953,38147,7.184915,3422,37678,8.326034,3549,37551,8.635036
3,Aklan,PH060400000,,,,14693,31978,31.482077,21519,25152,46.107861,21777,24894,46.660667
4,Albay,PH050500000,,,,27168,31532,46.282794,32756,25944,55.802385,32784,25916,55.850085


In [62]:
# Save to .csv file
filename = "../data/mapthegap-phl-adm2-2021-06-28.csv"
province_output_june2021.to_csv(filename)

Mostly same steps will be followed for the level 3 and 4 boundaries

### Level 3 (cities and municipalities)

In [63]:
# Pivot table of mapped and unmapped pixels
# fill_value = 0 fills in NaN values with 0
# If a region has no mapped pixels, it will give NaN
# which would cause a bug in calculating pct completness
adm3_df = pd.pivot_table(phl_pixels_all[["ADM3_PCODE","status", "index"]], index = ["ADM3_PCODE"], columns = ["status"], aggfunc="count", fill_value=0)

# Fix dataframe index
adm3_df.columns = adm3_df.columns.get_level_values(1)
adm3_df = adm3_df.reset_index().reset_index()

In [64]:
# Dataframe now has a regular index!
adm3_df.head()

status,index,ADM3_PCODE,mapped,unmapped
0,0,PH012801000,184,38
1,1,PH012802000,2965,883
2,2,PH012803000,2284,928
3,3,PH012804000,1573,334
4,4,PH012805000,3800,2193


In [65]:
# Drop the index column 
adm3_df.drop(["index"],axis = 1,inplace=True)

# Renaming columns to fit previous convention
adm3_df.rename(columns = {
    'mapped':'pixels_withbuilding_june2021',
    'unmapped':'pixels_nobuilding_june2021'
    },
    inplace=True
)

# Adding a column for percent completeness
adm3_df['percentage_completeness_june2021'] = (adm3_df['pixels_withbuilding_june2021']/(adm3_df['pixels_withbuilding_june2021'] + adm3_df['pixels_nobuilding_june2021'])) * 100

In [66]:
adm3_df.head()

status,ADM3_PCODE,pixels_withbuilding_june2021,pixels_nobuilding_june2021,percentage_completeness_june2021
0,PH012801000,184,38,82.882883
1,PH012802000,2965,883,77.053015
2,PH012803000,2284,928,71.108344
3,PH012804000,1573,334,82.485579
4,PH012805000,3800,2193,63.407309


In [67]:
# Create new dataframe that will store phl_adm3
# with percentage completness output
phl_adm3_with_output = phl_adm3

# Get only the columns we need to identify region
phl_adm3_with_output = phl_adm3_with_output[[
    'ADM3_EN',
    'ADM3_PCODE',
    'ADM3_REF',
    'ADM3ALT1EN',
    'ADM3ALT2EN'
 ]]

In [68]:
# Left joining percentage completness values 
# to their respective regions
phl_adm3_with_output = pd.merge(phl_adm3_with_output,adm3_df,how="left", on = "ADM3_PCODE")

In [69]:
phl_adm3_with_output.head()

Unnamed: 0,ADM3_EN,ADM3_PCODE,ADM3_REF,ADM3ALT1EN,ADM3ALT2EN,pixels_withbuilding_june2021,pixels_nobuilding_june2021,percentage_completeness_june2021
0,Aborlan,PH175301000,,,,652.0,3541.0,15.549726
1,Abra de Ilog,PH175101000,,,,1316.0,728.0,64.383562
2,Abucay,PH030801000,,,,1993.0,649.0,75.435276
3,Abulug,PH021501000,,,,3405.0,918.0,78.764747
4,Abuyog,PH083701000,,,,1473.0,988.0,59.853718


The last step will be to join it to the previous output and save it to file.

In [70]:
# Left joining the new percent completness values 
# to the existing values using ADM3_PCODE as index
# For the second dataframe, we only keep wanted columns
citymuni_output_june2021 = pd.merge(
    citymuni_output,
    phl_adm3_with_output[['ADM3_PCODE',\
                          'pixels_withbuilding_june2021',\
                          'pixels_nobuilding_june2021',\
                          'percentage_completeness_june2021'\
                         ]],
    how="left", 
    on = "ADM3_PCODE"
)

In [71]:
# Scroll through the columns to see how percentage completeness increased over time!
citymuni_output_june2021.head()

Unnamed: 0,ADM3_EN,ADM3_PCODE,ADM3_REF,ADM3ALT1EN,ADM3ALT2EN,pixels_withbuilding_june2020,pixels_nobuilding_june2020,percentage_completeness_june2020,pixels_withbuilding_may2021,pixels_nobuilding_may2021,percentage_completeness_may2021,pixels_withbuilding_june2021,pixels_nobuilding_june2021,percentage_completeness_june2021
0,Aborlan,PH175301000,,,,594,3599,14.166468,652,3541,15.549726,652.0,3541.0,15.549726
1,Abra de Ilog,PH175101000,,,,1315,729,64.334638,1316,728,64.383562,1316.0,728.0,64.383562
2,Abucay,PH030801000,,,,1996,646,75.548827,1993,649,75.435276,1993.0,649.0,75.435276
3,Abulug,PH021501000,,,,3396,927,78.556558,3405,918,78.764747,3405.0,918.0,78.764747
4,Abuyog,PH083701000,,,,1456,1005,59.162942,1473,988,59.853718,1473.0,988.0,59.853718


In [72]:
# Save to .csv file
filename = "../data/mapthegap-phl-adm3-2021-06-28.csv"
citymuni_output_june2021.to_csv(filename)

### Level 4 (barangays)

Note: Instead of using PCODE, we use `ADM4_PCODE_NAME` for this section

In [73]:
# Pivot table of mapped and unmapped pixels
# fill_value = 0 fills in NaN values with 0
# If a region has no mapped pixels, it will give NaN
# which would cause a bug in calculating pct completness
adm4_df = pd.pivot_table(phl_pixels_all[["ADM4_PCODE_NAME","status", "index"]], index = ["ADM4_PCODE_NAME"], columns = ["status"], aggfunc="count", fill_value=0)

# Fix dataframe index
adm4_df.columns = adm4_df.columns.get_level_values(1)
adm4_df = adm4_df.reset_index().reset_index()

In [74]:
# Dataframe now has a regular index!
adm4_df.head()

status,index,ADM4_PCODE_NAME,mapped,unmapped
0,0,012801001_Adams (Pob.),184,38
1,1,012802001_Bani,89,52
2,2,012802002_Buyon,143,54
3,3,012802003_Cabaruan,151,56
4,4,012802004_Cabulalaan,80,21


In [75]:
# Drop the index column 
adm4_df.drop(["index"],axis = 1,inplace=True)

# Renaming columns to fit previous convention
adm4_df.rename(columns = {
    'mapped':'pixels_withbuilding_june2021',
    'unmapped':'pixels_nobuilding_june2021'
    },
    inplace=True
)

# Adding a column for percent completeness
adm4_df['percentage_completeness_june2021'] = (adm4_df['pixels_withbuilding_june2021']/(adm4_df['pixels_withbuilding_june2021'] + adm4_df['pixels_nobuilding_june2021'])) * 100

In [76]:
adm4_df.head()

status,ADM4_PCODE_NAME,pixels_withbuilding_june2021,pixels_nobuilding_june2021,percentage_completeness_june2021
0,012801001_Adams (Pob.),184,38,82.882883
1,012802001_Bani,89,52,63.120567
2,012802002_Buyon,143,54,72.588832
3,012802003_Cabaruan,151,56,72.94686
4,012802004_Cabulalaan,80,21,79.207921


In [77]:
# Create new dataframe that will store phl_adm4
# with percentage completness output
phl_adm4_with_output = phl_adm4

# Get only the columns we need to identify region
phl_adm4_with_output = phl_adm4_with_output[[
    'Reg_Code',
    'Reg_Name',
    'Pro_Code',
    'Bgy_Code',
    'Bgy_Name',
    'RURBAN'
 ]]

# Create new column indicating both the pcode and name of the barangay
phl_adm4_with_output["ADM4_PCODE_NAME"] = phl_adm4_with_output["Bgy_Code"] + "_" + phl_adm4_with_output["Bgy_Name"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


In [78]:
# Left joining percentage completness values 
# to their respective regions
phl_adm4_with_output = pd.merge(phl_adm4_with_output,adm4_df,how="left", on = "ADM4_PCODE_NAME")

In [79]:
phl_adm4_with_output.head()

Unnamed: 0,Reg_Code,Reg_Name,Pro_Code,Bgy_Code,Bgy_Name,RURBAN,ADM4_PCODE_NAME,pixels_withbuilding_june2021,pixels_nobuilding_june2021,percentage_completeness_june2021
0,110000000,REGION XI (DAVAO REGION),118200000,118206010,Sawangan,R,118206010_Sawangan,6.0,68.0,8.108108
1,180000000,NEGROS ISLAND REGION (NIR),184500000,184501048,Felisa,U,184501048_Felisa,3.0,271.0,1.094891
2,120000000,REGION XII (SOCCSKSARGEN),126300000,126311020,Simsiman,R,126311020_Simsiman,147.0,59.0,71.359223
3,150000000,AUTONOMOUS REGION IN MUSLIM MINDANAO (ARMM),157000000,157001001,Balimbing Proper,R,157001001_Balimbing Proper,5.0,88.0,5.376344
4,150000000,AUTONOMOUS REGION IN MUSLIM MINDANAO (ARMM),157000000,157001002,Batu-batu (Pob.),R,157001002_Batu-batu (Pob.),25.0,168.0,12.953368


The last step will be to join it to the previous output and save it to file.

In [80]:
# Left joining the new percent completness values 
# to the existing values using ADM3_PCODE as index
# For the second dataframe, we only keep wanted columns
brgy_output_june2021 = pd.merge(
    brgy_output,
    phl_adm4_with_output[['ADM4_PCODE_NAME',\
                          'pixels_withbuilding_june2021',\
                          'pixels_nobuilding_june2021',\
                          'percentage_completeness_june2021'\
                         ]],
    how="left", 
    on = "ADM4_PCODE_NAME"
)

In [81]:
# Scroll through the columns to see how percentage completeness increased over time!
brgy_output_june2021.head()

Unnamed: 0,Reg_Code,Reg_Name,Pro_Code,Pro_Name,Mun_Code,Mun_Name,Bgy_Code,Bgy_Name,RURBAN,ADM4_PCODE_NAME,pixels_withbuilding_june2020,pixels_nobuilding_june2020,percentage_completeness_june2020,ADM4_PCODE,pixels_withbuilding_may2021,pixels_nobuilding_may2021,percentage_completeness_may2021,pixels_withbuilding_june2021,pixels_nobuilding_june2021,percentage_completeness_june2021
0,110000000,REGION XI (DAVAO REGION),118200000,COMPOSTELA VALLEY,118206000.0,MAWAB,118206010.0,Sawangan,R,118206010_Sawangan,3,71,4.054054,118206010.0,6,68,8.108108,6.0,68.0,8.108108
1,180000000,NEGROS ISLAND REGION (NIR),184500000,NEGROS OCCIDENTAL,184501000.0,BACOLOD CITY (Capital),184501048.0,Felisa,U,184501048_Felisa,2,272,0.729927,184501048.0,3,271,1.094891,3.0,271.0,1.094891
2,120000000,REGION XII (SOCCSKSARGEN),126300000,SOUTH COTABATO,126311000.0,NORALA,126311020.0,Simsiman,R,126311020_Simsiman,116,90,56.31068,126311020.0,147,59,71.359223,147.0,59.0,71.359223
3,150000000,AUTONOMOUS REGION IN MUSLIM MINDANAO (ARMM),157000000,TAWI-TAWI,157001000.0,PANGLIMA SUGALA (BALIMBING),157001001.0,Balimbing Proper,R,157001001_Balimbing Proper,0,93,0.0,157001001.0,5,88,5.376344,5.0,88.0,5.376344
4,150000000,AUTONOMOUS REGION IN MUSLIM MINDANAO (ARMM),157000000,TAWI-TAWI,157001000.0,PANGLIMA SUGALA (BALIMBING),157001002.0,Batu-batu (Pob.),R,157001002_Batu-batu (Pob.),11,182,5.699482,157001002.0,25,168,12.953368,25.0,168.0,12.953368


In [82]:
# Save to .csv file
filename = "../data/mapthegap-phl-adm4-2021-06-28.csv"
brgy_output_june2021.to_csv(filename)

## Madagascar

### Loading mapped and unmapped MDG pixels

In [3]:
mdg_pixels_all = pd.read_csv("../data/mdg_pixels_all.csv")

In [4]:
mdg_pixels_all.head()

Unnamed: 0,index,ADM1_PCODE,ADM2_PCODE,ADM3_PCODE,ADM4_PCODE,status
0,1,MG71,MG71713,MG71713170,MG71713170002,mapped
1,5,MG71,MG71713,MG71713170,MG71713170002,mapped
2,6,MG71,MG71713,MG71713170,MG71713170001,mapped
3,7,MG71,MG71713,MG71713170,MG71713170001,mapped
4,9,MG71,MG71713,MG71713170,MG71713170001,mapped


### Loading admin boundaries

In [21]:
mdg_adm2 = gpd.read_file(
    "../download_data/mdg_adm_all/mdg_admbnda_adm2_BNGRC_OCHA_20181031.shp"
)

In [37]:
mdg_adm2.head()

Unnamed: 0,ADM0_PCODE,ADM0_EN,ADM1_PCODE,ADM1_EN,ADM1_TYPE,ADM2_PCODE,ADM2_EN,ADM2_TYPE,PROV_CODE,OLD_PROVIN,PROV_TYPE,NOTES,SOURCE,geometry
0,MG,Madagascar,MG11,Analamanga,Region,MG11101001A,1er Arrondissement,District,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.50556 -18.89146, 47.50563 -18.891..."
1,MG,Madagascar,MG11,Analamanga,Region,MG11101002A,2e Arrondissement,District,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.55842 -18.91178, 47.55857 -18.911..."
2,MG,Madagascar,MG11,Analamanga,Region,MG11101003A,3e Arrondissement,District,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.51365 -18.87834, 47.51775 -18.879..."
3,MG,Madagascar,MG11,Analamanga,Region,MG11101004A,4e Arrondissement,District,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.50262 -18.91043, 47.50261 -18.910..."
4,MG,Madagascar,MG11,Analamanga,Region,MG11101005A,5e Arrondissement,District,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.53500 -18.85464, 47.53518 -18.854..."


In [6]:
mdg_adm3 = gpd.read_file(
    "../download_data/mdg_adm_all/mdg_admbnda_adm3_BNGRC_OCHA_20181031.shp"
)

In [59]:
mdg_adm3.head()

Unnamed: 0,ADM0_PCODE,ADM0_EN,ADM1_PCODE,ADM1_EN,ADM1_TYPE,ADM2_PCODE,ADM2_EN,ADM2_TYPE,ADM3_PCODE,ADM3_EN,ADM3_TYPE,PROV_CODE_,OLD_PROVIN,PROV_TYPE,NOTES,SOURCE,geometry
0,MG,Madagascar,MG11,Analamanga,Region,MG11101001A,1er Arrondissement,District,MG11101001,1er Arrondissement,Commune,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.50556 -18.89146, 47.50563 -18.891..."
1,MG,Madagascar,MG11,Analamanga,Region,MG11101002A,2e Arrondissement,District,MG11101002,2e Arrondissement,Commune,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.55842 -18.91178, 47.55857 -18.911..."
2,MG,Madagascar,MG11,Analamanga,Region,MG11101003A,3e Arrondissement,District,MG11101003,3e Arrondissement,Commune,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.51365 -18.87834, 47.51775 -18.879..."
3,MG,Madagascar,MG11,Analamanga,Region,MG11101004A,4e Arrondissement,District,MG11101004,4e Arrondissement,Commune,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.50262 -18.91043, 47.50261 -18.910..."
4,MG,Madagascar,MG11,Analamanga,Region,MG11101005A,5e Arrondissement,District,MG11101005,5e Arrondissement,Commune,1,Antananarivo,Old Provinces/Faritany dissolved in 2007,Previous district name is Antananarivo Renivoh...,Note that Communes (admin 3) have become the D...,"POLYGON ((47.53500 -18.85464, 47.53518 -18.854..."


### Reshaping

In the following code snippets, we use a pivot table to get the number of mapped and unmapped pixels for each administrative region, then calculating the percent completeness

We repeat the same steps for ADM2 - ADM3, using `ADM2_PCODE`, `ADM3_PCODE` as indices. 

#### Level 2 Boundaries (districts)

In [22]:
# fill_value = 0 fills in NaN values with 0
# If a region has no mapped pixels, it will give NaN
# which would cause a bug in calculating pct completness
adm2_df = pd.pivot_table(mdg_pixels_all[["ADM2_PCODE","status", "index"]], index = ["ADM2_PCODE"], columns = ["status"], aggfunc="count", fill_value=0)

In [23]:
adm2_df.head()

Unnamed: 0_level_0,index,index
status,mapped,unmapped
ADM2_PCODE,Unnamed: 1_level_2,Unnamed: 2_level_2
MG11101001A,3216,990
MG11101002A,1519,3911
MG11101003A,1456,1742
MG11101004A,2789,2304
MG11101005A,3037,6104


The resulting dataframe has a multiindex, which we will fix in the next code blocks

In [24]:
adm2_df.columns = adm2_df.columns.get_level_values(1)
adm2_df = adm2_df.reset_index().reset_index()

In [25]:
# Dataframe now has a regular index!
adm2_df.head()

status,index,ADM2_PCODE,mapped,unmapped
0,0,MG11101001A,3216,990
1,1,MG11101002A,1519,3911
2,2,MG11101003A,1456,1742
3,3,MG11101004A,2789,2304
4,4,MG11101005A,3037,6104


In [26]:
# Drop the index column 
adm2_df.drop(["index"],axis = 1,inplace=True)

# Renaming columns to fit previous convention
adm2_df.rename(columns = {
    'mapped':'pixels_withbuilding_july2021',
    'unmapped':'pixels_nobuilding_july2021'
    },
    inplace=True
)

# Adding a column for percent completeness
adm2_df['percentage_completeness_july2021'] = (adm2_df['pixels_withbuilding_july2021']/(adm2_df['pixels_withbuilding_july2021'] + adm2_df['pixels_nobuilding_july2021'])) * 100

In [27]:
adm2_df.head()

status,ADM2_PCODE,pixels_withbuilding_july2021,pixels_nobuilding_july2021,percentage_completeness_july2021
0,MG11101001A,3216,990,76.462197
1,MG11101002A,1519,3911,27.974217
2,MG11101003A,1456,1742,45.528455
3,MG11101004A,2789,2304,54.761437
4,MG11101005A,3037,6104,33.223936


We've successfully calculated the percentage completeness for level 2! Next, we will join the data with the admin boundaries information.

In [30]:
# Create new dataframe that will store mdg_adm2 
# with percentage completness output
mdg_adm2_with_output = mdg_adm2

# Get only the columns we need to identify region
mdg_adm2_with_output = mdg_adm2_with_output[[
    'ADM2_EN',
    'ADM2_PCODE',
 ]]

In [31]:
# Left joining percentage completness values 
# to their respective regions
mdg_adm2_with_output = pd.merge(mdg_adm2_with_output,adm2_df,how="left", on = "ADM2_PCODE")

In [42]:
mdg_adm2_with_output.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 119 entries, 0 to 118
Data columns (total 5 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   ADM2_EN                           119 non-null    object 
 1   ADM2_PCODE                        119 non-null    object 
 2   pixels_withbuilding_july2021      119 non-null    int64  
 3   pixels_nobuilding_july2021        119 non-null    int64  
 4   percentage_completeness_july2021  119 non-null    float64
dtypes: float64(1), int64(2), object(2)
memory usage: 5.6+ KB


In [35]:
mdg_adm2_with_output['pixels_withbuilding_july2021'].sum() / (mdg_adm2_with_output['pixels_withbuilding_july2021'].sum() + mdg_adm2_with_output['pixels_nobuilding_july2021'].sum())

0.17325448678328068

The last step will be to join it to the previous output and save it to file.

In [49]:
# Left joining the new percent completness values 
# to the existing values using ADM2_PCODE as index
# For the second dataframe, we only keep wanted columns
district_output_july2021 = pd.merge(
    mdg_adm2[['ADM2_PCODE','ADM2_EN','ADM2_TYPE','geometry']],
    mdg_adm2_with_output[['ADM2_PCODE',\
                          'pixels_withbuilding_july2021',\
                          'pixels_nobuilding_july2021',\
                          'percentage_completeness_july2021'\
                         ]],
    how="left", 
    on = "ADM2_PCODE"
)

In [50]:
district_output_july2021.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 119 entries, 0 to 118
Data columns (total 7 columns):
 #   Column                            Non-Null Count  Dtype   
---  ------                            --------------  -----   
 0   ADM2_PCODE                        119 non-null    object  
 1   ADM2_EN                           119 non-null    object  
 2   ADM2_TYPE                         119 non-null    object  
 3   geometry                          119 non-null    geometry
 4   pixels_withbuilding_july2021      119 non-null    int64   
 5   pixels_nobuilding_july2021        119 non-null    int64   
 6   percentage_completeness_july2021  119 non-null    float64 
dtypes: float64(1), geometry(1), int64(2), object(3)
memory usage: 7.4+ KB


In [51]:
district_output_july2021.head()

Unnamed: 0,ADM2_PCODE,ADM2_EN,ADM2_TYPE,geometry,pixels_withbuilding_july2021,pixels_nobuilding_july2021,percentage_completeness_july2021
0,MG11101001A,1er Arrondissement,District,"POLYGON ((47.50556 -18.89146, 47.50563 -18.891...",3216,990,76.462197
1,MG11101002A,2e Arrondissement,District,"POLYGON ((47.55842 -18.91178, 47.55857 -18.911...",1519,3911,27.974217
2,MG11101003A,3e Arrondissement,District,"POLYGON ((47.51365 -18.87834, 47.51775 -18.879...",1456,1742,45.528455
3,MG11101004A,4e Arrondissement,District,"POLYGON ((47.50262 -18.91043, 47.50261 -18.910...",2789,2304,54.761437
4,MG11101005A,5e Arrondissement,District,"POLYGON ((47.53500 -18.85464, 47.53518 -18.854...",3037,6104,33.223936


In [52]:
# Save to .gpkg file
filename = "../data/mapthegap-mdg-adm2-2021-07-12.gpkg"
district_output_july2021.to_file(filename, driver="GPKG")

Mostly same steps will be followed for the level 3 and 4 boundaries

#### Level 3 (Commune)

In [55]:
# Pivot table of mapped and unmapped pixels
# fill_value = 0 fills in NaN values with 0
# If a region has no mapped pixels, it will give NaN
# which would cause a bug in calculating pct completness
adm3_df = pd.pivot_table(mdg_pixels_all[["ADM3_PCODE","status", "index"]], index = ["ADM3_PCODE"], columns = ["status"], aggfunc="count", fill_value=0)

# Fix dataframe index
adm3_df.columns = adm3_df.columns.get_level_values(1)
adm3_df = adm3_df.reset_index().reset_index()

In [56]:
# Dataframe now has a regular index!
adm3_df.head()

status,index,ADM3_PCODE,mapped,unmapped
0,0,MG11101001,3216,990
1,1,MG11101002,1519,3911
2,2,MG11101003,1456,1742
3,3,MG11101004,2789,2304
4,4,MG11101005,3037,6104


In [57]:
# Drop the index column 
adm3_df.drop(["index"],axis = 1,inplace=True)

# Renaming columns to fit previous convention
adm3_df.rename(columns = {
    'mapped':'pixels_withbuilding_july2021',
    'unmapped':'pixels_nobuilding_july2021'
    },
    inplace=True
)

# Adding a column for percent completeness
adm3_df['percentage_completeness_july2021'] = (adm3_df['pixels_withbuilding_july2021']/(adm3_df['pixels_withbuilding_july2021'] + adm3_df['pixels_nobuilding_july2021'])) * 100

In [58]:
adm3_df.head()

status,ADM3_PCODE,pixels_withbuilding_july2021,pixels_nobuilding_july2021,percentage_completeness_july2021
0,MG11101001,3216,990,76.462197
1,MG11101002,1519,3911,27.974217
2,MG11101003,1456,1742,45.528455
3,MG11101004,2789,2304,54.761437
4,MG11101005,3037,6104,33.223936


In [72]:
# Create new dataframe that will store phl_adm3
# with percentage completness output
mdg_adm3_with_output = mdg_adm3

# Get only the columns we need to identify region
mdg_adm3_with_output = mdg_adm3_with_output[[
    'ADM3_EN',
    'ADM3_PCODE',
    'ADM2_TYPE'
 ]]

In [75]:
# Left joining percentage completness values 
# to their respective regions
mdg_adm3_with_output = pd.merge(mdg_adm3_with_output,adm3_df,how="left", on = "ADM3_PCODE")

In [76]:
mdg_adm3_with_output.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1579 entries, 0 to 1578
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   ADM3_EN                           1579 non-null   object 
 1   ADM3_PCODE                        1579 non-null   object 
 2   ADM2_TYPE                         1579 non-null   object 
 3   pixels_withbuilding_july2021      1579 non-null   int64  
 4   pixels_nobuilding_july2021        1579 non-null   int64  
 5   percentage_completeness_july2021  1579 non-null   float64
dtypes: float64(1), int64(2), object(3)
memory usage: 86.4+ KB


In [77]:
mdg_adm3_with_output.head()

Unnamed: 0,ADM3_EN,ADM3_PCODE,ADM2_TYPE,pixels_withbuilding_july2021,pixels_nobuilding_july2021,percentage_completeness_july2021
0,1er Arrondissement,MG11101001,District,3216,990,76.462197
1,2e Arrondissement,MG11101002,District,1519,3911,27.974217
2,3e Arrondissement,MG11101003,District,1456,1742,45.528455
3,4e Arrondissement,MG11101004,District,2789,2304,54.761437
4,5e Arrondissement,MG11101005,District,3037,6104,33.223936


The last step will be to join it to the previous output and save it to file.

In [79]:
# Left joining the new percent completness values 
# to the existing values using ADM3_PCODE as index
# For the second dataframe, we only keep wanted columns
commune_output_july2021 = pd.merge(
    mdg_adm3[['ADM3_PCODE','ADM3_EN','ADM3_TYPE','geometry']],
    mdg_adm3_with_output[['ADM3_PCODE',\
                          'pixels_withbuilding_july2021',\
                          'pixels_nobuilding_july2021',\
                          'percentage_completeness_july2021'\
                         ]],
    how="left", 
    on = "ADM3_PCODE"
)

In [80]:
# Scroll through the columns to see how percentage completeness increased over time!
commune_output_july2021.head()

Unnamed: 0,ADM3_PCODE,ADM3_EN,ADM3_TYPE,geometry,pixels_withbuilding_july2021,pixels_nobuilding_july2021,percentage_completeness_july2021
0,MG11101001,1er Arrondissement,Commune,"POLYGON ((47.50556 -18.89146, 47.50563 -18.891...",3216,990,76.462197
1,MG11101002,2e Arrondissement,Commune,"POLYGON ((47.55842 -18.91178, 47.55857 -18.911...",1519,3911,27.974217
2,MG11101003,3e Arrondissement,Commune,"POLYGON ((47.51365 -18.87834, 47.51775 -18.879...",1456,1742,45.528455
3,MG11101004,4e Arrondissement,Commune,"POLYGON ((47.50262 -18.91043, 47.50261 -18.910...",2789,2304,54.761437
4,MG11101005,5e Arrondissement,Commune,"POLYGON ((47.53500 -18.85464, 47.53518 -18.854...",3037,6104,33.223936


In [84]:
# Save to .gpkg file
filename = "../data/mapthegap-mdg-adm3-2021-07-12.gpkg"
commune_output_july2021.to_file(filename, driver="GPKG")

In [85]:
! gsutil cp ~/osm-completeness/data/mapthegap-mdg-adm3-2021-07-12.gpkg gs://tm-ardie/2021-07-13-osm-completeness-madagascar/

Copying file:///home/jupyter/osm-completeness/data/mapthegap-mdg-adm3-2021-07-12.gpkg [Content-Type=application/octet-stream]...
- [1 files][ 23.9 MiB/ 23.9 MiB]                                                
Operation completed over 1 objects/23.9 MiB.                                     


# 2020
Redoing the analysis, joining the output above with jan2020 values

### Loading mapped and unmapped MDG pixels

In [95]:
mdg_pixels_all_jan2020 = gpd.read_file(
    "../data/mdg_pixels_all_jan2020.gpkg",
    driver='GPKG'
)

In [96]:
mdg_pixels_all_jan2020.head()

Unnamed: 0,index,ADM1_PCODE,ADM2_PCODE,ADM3_PCODE,ADM4_PCODE,status,geometry
0,1,MG71,MG71713,MG71713170,MG71713170002,mapped,POINT (49.27167 -11.95611)
1,5,MG71,MG71713,MG71713170,MG71713170002,mapped,POINT (49.25278 -11.99111)
2,6,MG71,MG71713,MG71713170,MG71713170001,mapped,POINT (49.22861 -12.04611)
3,7,MG71,MG71713,MG71713170,MG71713170001,mapped,POINT (49.22861 -12.04639)
4,9,MG71,MG71713,MG71713170,MG71713170001,mapped,POINT (49.23278 -12.05417)


### Loading previous output

In [89]:
district_output_july2021 = gpd.read_file(
    '../data/mapthegap-mdg-adm2-2021-07-12.gpkg',
    driver='GPKG'
)
district_output_july2021.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 7 columns):
 #   Column                            Non-Null Count  Dtype   
---  ------                            --------------  -----   
 0   ADM2_PCODE                        119 non-null    object  
 1   ADM2_EN                           119 non-null    object  
 2   ADM2_TYPE                         119 non-null    object  
 3   pixels_withbuilding_july2021      119 non-null    int64   
 4   pixels_nobuilding_july2021        119 non-null    int64   
 5   percentage_completeness_july2021  119 non-null    float64 
 6   geometry                          119 non-null    geometry
dtypes: float64(1), geometry(1), int64(2), object(3)
memory usage: 6.6+ KB


In [87]:
commune_output_july2021 = gpd.read_file(
    "../data/mapthegap-mdg-adm3-2021-07-12.gpkg",
    driver="GPKG"
)
commune_output_july2021.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 1579 entries, 0 to 1578
Data columns (total 7 columns):
 #   Column                            Non-Null Count  Dtype   
---  ------                            --------------  -----   
 0   ADM3_PCODE                        1579 non-null   object  
 1   ADM3_EN                           1579 non-null   object  
 2   ADM3_TYPE                         1579 non-null   object  
 3   pixels_withbuilding_july2021      1579 non-null   int64   
 4   pixels_nobuilding_july2021        1579 non-null   int64   
 5   percentage_completeness_july2021  1579 non-null   float64 
 6   geometry                          1579 non-null   geometry
dtypes: float64(1), geometry(1), int64(2), object(3)
memory usage: 86.5+ KB


### Loading admin boundaries

In [21]:
mdg_adm2 = gpd.read_file(
    "../download_data/mdg_adm_all/mdg_admbnda_adm2_BNGRC_OCHA_20181031.shp"
)

In [6]:
mdg_adm3 = gpd.read_file(
    "../download_data/mdg_adm_all/mdg_admbnda_adm3_BNGRC_OCHA_20181031.shp"
)

### Reshaping

In the following code snippets, we use a pivot table to get the number of mapped and unmapped pixels for each administrative region, then calculating the percent completeness

We repeat the same steps for ADM2 - ADM3, using `ADM2_PCODE`, `ADM3_PCODE` as indices. 

#### Level 2 Boundaries (districts)

In [97]:
# fill_value = 0 fills in NaN values with 0
# If a region has no mapped pixels, it will give NaN
# which would cause a bug in calculating pct completness
adm2_df = pd.pivot_table(mdg_pixels_all_jan2020[["ADM2_PCODE","status", "index"]], index = ["ADM2_PCODE"], columns = ["status"], aggfunc="count", fill_value=0)

In [98]:
adm2_df.head()

Unnamed: 0_level_0,index,index
status,mapped,unmapped
ADM2_PCODE,Unnamed: 1_level_2,Unnamed: 2_level_2
MG11101001A,2472,1734
MG11101002A,1439,3991
MG11101003A,1424,1774
MG11101004A,1646,3447
MG11101005A,2153,6988


The resulting dataframe has a multiindex, which we will fix in the next code blocks

In [99]:
adm2_df.columns = adm2_df.columns.get_level_values(1)
adm2_df = adm2_df.reset_index().reset_index()

In [100]:
# Dataframe now has a regular index!
adm2_df.head()

status,index,ADM2_PCODE,mapped,unmapped
0,0,MG11101001A,2472,1734
1,1,MG11101002A,1439,3991
2,2,MG11101003A,1424,1774
3,3,MG11101004A,1646,3447
4,4,MG11101005A,2153,6988


In [101]:
# Drop the index column 
adm2_df.drop(["index"],axis = 1,inplace=True)

# Renaming columns to fit previous convention
adm2_df.rename(columns = {
    'mapped':'pixels_withbuilding_jan2020',
    'unmapped':'pixels_nobuilding_jan2020'
    },
    inplace=True
)

# Adding a column for percent completeness
adm2_df['percentage_completeness_jan2020'] = (adm2_df['pixels_withbuilding_jan2020']/(adm2_df['pixels_withbuilding_jan2020'] + adm2_df['pixels_nobuilding_jan2020'])) * 100

In [102]:
adm2_df.head()

status,ADM2_PCODE,pixels_withbuilding_jan2020,pixels_nobuilding_jan2020,percentage_completeness_jan2020
0,MG11101001A,2472,1734,58.773181
1,MG11101002A,1439,3991,26.500921
2,MG11101003A,1424,1774,44.52783
3,MG11101004A,1646,3447,32.318869
4,MG11101005A,2153,6988,23.553222


We've successfully calculated the percentage completeness for level 2! Next, we will join the data with the admin boundaries information.

In [103]:
# Create new dataframe that will store mdg_adm2 
# with percentage completness output
mdg_adm2_with_output = mdg_adm2

# Get only the columns we need to identify region
mdg_adm2_with_output = mdg_adm2_with_output[[
    'ADM2_EN',
    'ADM2_PCODE',
 ]]

In [104]:
# Left joining percentage completness values 
# to their respective regions
mdg_adm2_with_output = pd.merge(mdg_adm2_with_output,adm2_df,how="left", on = "ADM2_PCODE")

In [105]:
mdg_adm2_with_output.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 119 entries, 0 to 118
Data columns (total 5 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ADM2_EN                          119 non-null    object 
 1   ADM2_PCODE                       119 non-null    object 
 2   pixels_withbuilding_jan2020      119 non-null    int64  
 3   pixels_nobuilding_jan2020        119 non-null    int64  
 4   percentage_completeness_jan2020  119 non-null    float64
dtypes: float64(1), int64(2), object(2)
memory usage: 5.6+ KB


In [106]:
mdg_adm2_with_output['pixels_withbuilding_jan2020'].sum() / (mdg_adm2_with_output['pixels_withbuilding_jan2020'].sum() + mdg_adm2_with_output['pixels_nobuilding_jan2020'].sum())

0.10889578597380667

The last step will be to join it to the previous output and save it to file.

In [107]:
# Left joining the new percent completness values 
# to the existing values using ADM2_PCODE as index
# For the second dataframe, we only keep wanted columns
district_output_jan2020 = pd.merge(
    district_output_july2021,
    mdg_adm2_with_output[['ADM2_PCODE',\
                          'pixels_withbuilding_jan2020',\
                          'pixels_nobuilding_jan2020',\
                          'percentage_completeness_jan2020'\
                         ]],
    how="left", 
    on = "ADM2_PCODE"
)

In [108]:
district_output_jan2020.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 119 entries, 0 to 118
Data columns (total 10 columns):
 #   Column                            Non-Null Count  Dtype   
---  ------                            --------------  -----   
 0   ADM2_PCODE                        119 non-null    object  
 1   ADM2_EN                           119 non-null    object  
 2   ADM2_TYPE                         119 non-null    object  
 3   pixels_withbuilding_july2021      119 non-null    int64   
 4   pixels_nobuilding_july2021        119 non-null    int64   
 5   percentage_completeness_july2021  119 non-null    float64 
 6   geometry                          119 non-null    geometry
 7   pixels_withbuilding_jan2020       119 non-null    int64   
 8   pixels_nobuilding_jan2020         119 non-null    int64   
 9   percentage_completeness_jan2020   119 non-null    float64 
dtypes: float64(2), geometry(1), int64(4), object(3)
memory usage: 10.2+ KB


In [128]:
district_output_jan2020.head(20)

Unnamed: 0,ADM2_PCODE,ADM2_EN,ADM2_TYPE,pixels_withbuilding_july2021,pixels_nobuilding_july2021,percentage_completeness_july2021,geometry,pixels_withbuilding_jan2020,pixels_nobuilding_jan2020,percentage_completeness_jan2020
0,MG11101001A,1er Arrondissement,District,3216,990,76.462197,"POLYGON ((47.50556 -18.89146, 47.50563 -18.891...",2472,1734,58.773181
1,MG11101002A,2e Arrondissement,District,1519,3911,27.974217,"POLYGON ((47.55842 -18.91178, 47.55857 -18.911...",1439,3991,26.500921
2,MG11101003A,3e Arrondissement,District,1456,1742,45.528455,"POLYGON ((47.51365 -18.87834, 47.51775 -18.879...",1424,1774,44.52783
3,MG11101004A,4e Arrondissement,District,2789,2304,54.761437,"POLYGON ((47.50262 -18.91043, 47.50261 -18.910...",1646,3447,32.318869
4,MG11101005A,5e Arrondissement,District,3037,6104,33.223936,"POLYGON ((47.53500 -18.85464, 47.53518 -18.854...",2153,6988,23.553222
5,MG11101006A,6e Arrondissement,District,636,2423,20.791108,"POLYGON ((47.48436 -18.83989, 47.48491 -18.840...",561,2498,18.339327
6,MG11102,Antananarivo Avaradrano,District,6692,27525,19.55753,"POLYGON ((47.61521 -18.71709, 47.61702 -18.717...",4015,30201,11.734276
7,MG11103,Ambohidratrimo,District,4332,31492,12.092452,"POLYGON ((47.49982 -18.43546, 47.50490 -18.439...",3797,32027,10.59904
8,MG11104,Ankazobe,District,1486,15437,8.780949,"POLYGON ((46.74249 -17.71321, 46.74325 -17.713...",645,16278,3.811381
9,MG11106,Manjakandriana,District,1296,24520,5.020143,"POLYGON ((47.72437 -18.47394, 47.72465 -18.477...",648,25168,2.510071


In [132]:
# Save to .gpkg file
filename = "../data/mapthegap-mdg-adm2-2021-07-12.gpkg"
district_output_jan2020.to_file(filename, driver="GPKG")

In [133]:
! gsutil cp ~/osm-completeness/data/mapthegap-mdg-adm2-2021-07-12.gpkg gs://tm-ardie/2021-07-13-osm-completeness-madagascar/

Copying file:///home/jupyter/osm-completeness/data/mapthegap-mdg-adm2-2021-07-12.gpkg [Content-Type=application/octet-stream]...
/ [1 files][ 10.1 MiB/ 10.1 MiB]                                                
Operation completed over 1 objects/10.1 MiB.                                     


#### Level 3 Boundaries (communes)

In [112]:
# fill_value = 0 fills in NaN values with 0
# If a region has no mapped pixels, it will give NaN
# which would cause a bug in calculating pct completness
adm3_df = pd.pivot_table(mdg_pixels_all_jan2020[["ADM3_PCODE","status", "index"]], index = ["ADM3_PCODE"], columns = ["status"], aggfunc="count", fill_value=0)

In [113]:
adm3_df.head()

Unnamed: 0_level_0,index,index
status,mapped,unmapped
ADM3_PCODE,Unnamed: 1_level_2,Unnamed: 2_level_2
MG11101001,2472,1734
MG11101002,1439,3991
MG11101003,1424,1774
MG11101004,1646,3447
MG11101005,2153,6988


The resulting dataframe has a multiindex, which we will fix in the next code blocks

In [114]:
adm3_df.columns = adm3_df.columns.get_level_values(1)
adm3_df = adm3_df.reset_index().reset_index()

In [116]:
# Dataframe now has a regular index!
adm3_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1579 entries, 0 to 1578
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   index       1579 non-null   int64 
 1   ADM3_PCODE  1579 non-null   object
 2   mapped      1579 non-null   int64 
 3   unmapped    1579 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 49.5+ KB


In [117]:
# Drop the index column 
adm3_df.drop(["index"],axis = 1,inplace=True)

# Renaming columns to fit previous convention
adm3_df.rename(columns = {
    'mapped':'pixels_withbuilding_jan2020',
    'unmapped':'pixels_nobuilding_jan2020'
    },
    inplace=True
)

# Adding a column for percent completeness
adm3_df['percentage_completeness_jan2020'] = (adm3_df['pixels_withbuilding_jan2020']/(adm3_df['pixels_withbuilding_jan2020'] + adm3_df['pixels_nobuilding_jan2020'])) * 100

In [118]:
adm3_df.head()

status,ADM3_PCODE,pixels_withbuilding_jan2020,pixels_nobuilding_jan2020,percentage_completeness_jan2020
0,MG11101001,2472,1734,58.773181
1,MG11101002,1439,3991,26.500921
2,MG11101003,1424,1774,44.52783
3,MG11101004,1646,3447,32.318869
4,MG11101005,2153,6988,23.553222


We've successfully calculated the percentage completeness for level 2! Next, we will join the data with the admin boundaries information.

In [119]:
# Create new dataframe that will store mdg_adm2 
# with percentage completness output
mdg_adm3_with_output = mdg_adm3

# Get only the columns we need to identify region
mdg_adm3_with_output = mdg_adm3_with_output[[
    'ADM3_EN',
    'ADM3_PCODE',
 ]]

In [120]:
# Left joining percentage completness values 
# to their respective regions
mdg_adm3_with_output = pd.merge(mdg_adm3_with_output,adm3_df,how="left", on = "ADM3_PCODE")

In [122]:
mdg_adm3_with_output.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1579 entries, 0 to 1578
Data columns (total 5 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ADM3_EN                          1579 non-null   object 
 1   ADM3_PCODE                       1579 non-null   object 
 2   pixels_withbuilding_jan2020      1579 non-null   int64  
 3   pixels_nobuilding_jan2020        1579 non-null   int64  
 4   percentage_completeness_jan2020  1579 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 74.0+ KB


In [123]:
mdg_adm3_with_output['pixels_withbuilding_jan2020'].sum() / (mdg_adm3_with_output['pixels_withbuilding_jan2020'].sum() + mdg_adm3_with_output['pixels_nobuilding_jan2020'].sum())

0.10889578597380667

The last step will be to join it to the previous output and save it to file.

In [124]:
# Left joining the new percent completness values 
# to the existing values using ADM2_PCODE as index
# For the second dataframe, we only keep wanted columns
commune_output_jan2020 = pd.merge(
    commune_output_july2021,
    mdg_adm3_with_output[['ADM3_PCODE',\
                          'pixels_withbuilding_jan2020',\
                          'pixels_nobuilding_jan2020',\
                          'percentage_completeness_jan2020'\
                         ]],
    how="left", 
    on = "ADM3_PCODE"
)

In [125]:
commune_output_jan2020.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 1579 entries, 0 to 1578
Data columns (total 10 columns):
 #   Column                            Non-Null Count  Dtype   
---  ------                            --------------  -----   
 0   ADM3_PCODE                        1579 non-null   object  
 1   ADM3_EN                           1579 non-null   object  
 2   ADM3_TYPE                         1579 non-null   object  
 3   pixels_withbuilding_july2021      1579 non-null   int64   
 4   pixels_nobuilding_july2021        1579 non-null   int64   
 5   percentage_completeness_july2021  1579 non-null   float64 
 6   geometry                          1579 non-null   geometry
 7   pixels_withbuilding_jan2020       1579 non-null   int64   
 8   pixels_nobuilding_jan2020         1579 non-null   int64   
 9   percentage_completeness_jan2020   1579 non-null   float64 
dtypes: float64(2), geometry(1), int64(4), object(3)
memory usage: 135.7+ KB


In [129]:
commune_output_jan2020.head(20)

Unnamed: 0,ADM3_PCODE,ADM3_EN,ADM3_TYPE,pixels_withbuilding_july2021,pixels_nobuilding_july2021,percentage_completeness_july2021,geometry,pixels_withbuilding_jan2020,pixels_nobuilding_jan2020,percentage_completeness_jan2020
0,MG11101001,1er Arrondissement,Commune,3216,990,76.462197,"POLYGON ((47.50556 -18.89146, 47.50563 -18.891...",2472,1734,58.773181
1,MG11101002,2e Arrondissement,Commune,1519,3911,27.974217,"POLYGON ((47.55842 -18.91178, 47.55857 -18.911...",1439,3991,26.500921
2,MG11101003,3e Arrondissement,Commune,1456,1742,45.528455,"POLYGON ((47.51365 -18.87834, 47.51775 -18.879...",1424,1774,44.52783
3,MG11101004,4e Arrondissement,Commune,2789,2304,54.761437,"POLYGON ((47.50262 -18.91043, 47.50261 -18.910...",1646,3447,32.318869
4,MG11101005,5e Arrondissement,Commune,3037,6104,33.223936,"POLYGON ((47.53500 -18.85464, 47.53518 -18.854...",2153,6988,23.553222
5,MG11101006,6e Arrondissement,Commune,636,2423,20.791108,"POLYGON ((47.48436 -18.83989, 47.48491 -18.840...",561,2498,18.339327
6,MG11102010,Alasora,Commune,481,3040,13.660892,"POLYGON ((47.58420 -18.93248, 47.58459 -18.932...",461,3060,13.092871
7,MG11102039,Ankadikely Ilafy,Commune,4759,1267,78.974444,"POLYGON ((47.60350 -18.83852, 47.60155 -18.841...",2839,3186,47.120332
8,MG11102050,Ambohimanambola,Commune,13,1351,0.953079,"POLYGON ((47.61793 -18.91379, 47.61867 -18.919...",13,1351,0.953079
9,MG11102079,Sabotsy Namehana,Commune,252,3818,6.191646,"POLYGON ((47.54650 -18.78823, 47.54925 -18.791...",52,4018,1.277641


In [130]:
# Save to .gpkg file
filename = "../data/mapthegap-mdg-adm3-2021-07-12.gpkg"
commune_output_jan2020.to_file(filename, driver="GPKG")

In [131]:
! gsutil cp ~/osm-completeness/data/mapthegap-mdg-adm3-2021-07-12.gpkg gs://tm-ardie/2021-07-13-osm-completeness-madagascar/

Copying file:///home/jupyter/osm-completeness/data/mapthegap-mdg-adm3-2021-07-12.gpkg [Content-Type=application/octet-stream]...
- [1 files][ 23.9 MiB/ 23.9 MiB]                                                
Operation completed over 1 objects/23.9 MiB.                                     


In [135]:
# test to see if we got the correct file
test_gdf = gpd.read_file("../data/mapthegap-mdg-adm2-2021-07-12.gpkg",driver='GPKG')
test_gdf.head()

Unnamed: 0,ADM2_PCODE,ADM2_EN,ADM2_TYPE,pixels_withbuilding_july2021,pixels_nobuilding_july2021,percentage_completeness_july2021,pixels_withbuilding_jan2020,pixels_nobuilding_jan2020,percentage_completeness_jan2020,geometry
0,MG11101001A,1er Arrondissement,District,3216,990,76.462197,2472,1734,58.773181,"POLYGON ((47.50556 -18.89146, 47.50563 -18.891..."
1,MG11101002A,2e Arrondissement,District,1519,3911,27.974217,1439,3991,26.500921,"POLYGON ((47.55842 -18.91178, 47.55857 -18.911..."
2,MG11101003A,3e Arrondissement,District,1456,1742,45.528455,1424,1774,44.52783,"POLYGON ((47.51365 -18.87834, 47.51775 -18.879..."
3,MG11101004A,4e Arrondissement,District,2789,2304,54.761437,1646,3447,32.318869,"POLYGON ((47.50262 -18.91043, 47.50261 -18.910..."
4,MG11101005A,5e Arrondissement,District,3037,6104,33.223936,2153,6988,23.553222,"POLYGON ((47.53500 -18.85464, 47.53518 -18.854..."
