The Population data is sourced from the Humanitarian Data Exchange Hub https://data.humdata.org/, a repository of humanitarian data from multiple sources, and uses mainly data from Meta, which developed a model to estimate population at an extremely granular level (you can read about it here https://dataforgood.facebook.com/dfg/tools/high-resolution-population-density-maps), with integration of data from Kontur, a free humanitarian geospatial data provider (https://www.kontur.io/) for countries not supported by the Meta model. It's important to note that the data is subsetted in countries, and we used that as an advantage to track all the quadkeys of interest and label them to each country.

In [1]:
import pandas as pd
import os

cart = 'C:\\Users\\Luca\\Downloads\\temp'

We're now going to illustrate the process of reading a Meta dataset for a random country, in this case Angola. The dataset consists of three columns: longitude, latitude, and the population, which is named as the country's ISO-3 code, followed by 'general', and the year in which the estimate took place. Since all the information is stored in the file's name, it's easy to iteratively find the population column's name, and then rename it as a more commonplace 'population'.

In [2]:
ago = pd.read_csv(f'{cart}\\ago_general_2020.csv')
ago

Unnamed: 0,longitude,latitude,ago_general_2020
0,11.751806,-3.999861,7.292872
1,13.914028,-3.999861,11.271312
2,13.916806,-3.999861,11.271312
3,13.917639,-3.999861,11.271312
4,13.917917,-3.999861,11.271312
...,...,...,...
5687940,22.232361,-18.999028,3.144540
5687941,23.769583,-18.999306,3.126035
5687942,23.770139,-18.999306,3.126035
5687943,23.770417,-18.999306,3.126035


In [3]:
file = 'ago_general_2020.csv'
country = file[0:3]
anno = file[12:16]
nome_var = country + '_general_' + anno

ago = ago.rename(columns = {nome_var: 'population', 'latitude': 'lat', 'longitude': 'lon'})

In [4]:
ago.describe()

Unnamed: 0,lon,lat,population
count,5687945.0,5687945.0,5687945.0
mean,17.63886,-8.936835,12.78684
std,3.833931,4.499281,13.14782
min,11.23764,-18.99958,0.0
25%,14.78986,-12.54819,5.726582
50%,16.15236,-7.005972,9.23722
75%,21.33736,-5.227361,14.78771
max,24.99958,-3.999861,179.8801


The assignment of each point to its own quadkey 14 tile is easily done thanks to pyquadkey2's quadkey.from_geo() function. In this case, there are 99,565 separate quadkeys for the country of Angola.

In [5]:
from pyquadkey2 import quadkey
ago['quadkey'] = ago.apply(lambda x: str(quadkey.from_geo((x['lat'], x['lon']), 14)), axis=1)
len(ago.quadkey.unique())

99565

The 'population' value represents the estimated population density. We estimate the true value of the quadkey 14 tile to be the mean of all the points within the tile.

In [6]:
ago = ago.groupby('quadkey', as_index = False).population.mean()
ago.describe()

Unnamed: 0,population
count,99565.0
mean,10.431596
std,6.949121
min,0.0
25%,5.88245
50%,9.140633
75%,13.760512
max,95.530343


The script is automated for the remaining countries

In [None]:
for file in lista:
    df = pd.read_csv(path + '\\' + file)

    country = file[0:3]
    anno = file[12:16]
    nome_var = country + '_general_' + anno

    diz = {nome_var: 'Population', 'population_2020': 'Population', 'latitude': 'LATNUM', 'longitude': 'LONGNUM', 'Lat': 'LATNUM', 'Lon': 'LONGNUM'}
    
    df = df.rename(columns = diz)

    print(f'getting quadkey for {country}...')
    df['quadkey'] = df.apply(lambda x: str(quadkey.from_geo((x['LATNUM'], x['LONGNUM']), 14)), axis=1)
    
    print(f'grouping quadkeys for {country}...')
    df_group = df.groupby('quadkey', as_index = False).mean()

    print(f'exporting {country} to .csv...')
    nome = path + '\\' + country + '_pop_quad.csv'
    df_group.to_csv(nome, index = False)
    print(f'{country} dataset finished')

Below is illustated how to read a Kontur dataset for a random country, in this case Denmark. As opposed to Meta pinpointing a value through the use of coordinates, Kontur uses an hexagonal structure, effectively using an area instead of a point. 
Lucky for us, the hexagonal structure, named h3, has a python package able to assign to each exagon a coordinate using the h3_to_geo() function. This way, its very easy to assing each value to a quadkey 14 tile using a combination of h3_to_geo() and quadkey.from_geo().

In [26]:
import geopandas as gpd

den = gpd.read_file("C:\\Users\\Luca\\Downloads\\temp\\kontur_population_DK_20231101.gpkg")

In [27]:
den

Unnamed: 0,h3,population,geometry
0,881f35bb63fffff,1.0,"POLYGON ((923924.606 7559317.455, 923707.754 7..."
1,881f35bb5dfffff,4.0,"POLYGON ((926715.764 7563979.651, 926498.606 7..."
2,881f35bb5bfffff,23.0,"POLYGON ((929275.976 7564909.693, 929058.599 7..."
3,881f35bb59fffff,38.0,"POLYGON ((927786.353 7565098.193, 927569.090 7..."
4,881f35bb57fffff,6.0,"POLYGON ((928623.941 7562484.495, 928406.661 7..."
...,...,...,...
67262,8809926c85fffff,4.0,"POLYGON ((907873.466 7649182.867, 907656.217 7..."
67263,8809926c83fffff,3.0,"POLYGON ((908525.410 7651618.515, 908308.063 7..."
67264,8809926c81fffff,1.0,"POLYGON ((907449.670 7650491.725, 907232.429 7..."
67265,8809926c27fffff,2.0,"POLYGON ((903770.712 7639624.610, 903553.965 7..."


In [28]:
den.population.describe()

count    67267.000000
mean        87.888623
std        602.893352
min          1.000000
25%          7.000000
50%         15.000000
75%         36.000000
max      28072.000000
Name: population, dtype: float64

Kontur dataset express population in term of number of people in a single hexagon, whereas we want it to be people per squared kilometre. To have the actual population density we thus need both the population and the area in each exagon. Fortunately, the h3 package still comes in handy as its built_in function cell_area() takes in input the h3 string identifier and returns the relative area in squared kilometres. After that, it's easy to calculate the population density as the ratio between the population and the area.

In [29]:
import time
import h3
start = time.time()
den['area'] = den.apply(lambda x: h3.cell_area(x['h3'], unit='km^2'), axis = 1)
end = time.time()
print(f'elapsed time: {end-start}s')

elapsed time: 0.915780782699585s


In [47]:
den['pop_dens'] = den.apply(lambda x: x['population'] / x['area'], axis = 1)
den.pop_dens.describe()

count    67267.000000
mean       147.668990
std       1000.822609
min          1.607817
25%         11.588974
50%         25.705845
75%         61.575524
max      46416.735189
Name: pop_dens, dtype: float64

It's time to assign the population density value to the quadkey 14 tiles. With the aforementioned nested function we can establish all the tiles in Denmark, and then use the mean of the hexagons inside the tile as an estimate of the true value of the tile. There are 24,109 different quadkeys in Denmark.

In [48]:
from pyquadkey2 import quadkey
den['quadkey'] = den.apply(lambda x: str(quadkey.from_geo(h3.h3_to_geo(x['h3']), 14)), axis = 1)
den = den.groupby('quadkey', as_index = False).pop_dens.mean()
len(den)

24109

In [35]:
iso2_to3 = {'AT': 'aut', 'BE': 'bel', 'BI': 'bdi', 'CH': 'che', 'CM': 'cmr', 'CZ': 'cze', 'DE': 'deu', 'DK': 'den',
            'ES': 'esp', 'FI': 'fin', 'FR': 'fra', 'GR': 'grc', 'IN': 'ind', 'IT': 'ita', 'MM': 'mmr', 'NL': 'nld',
            'PK': 'pak', 'PT': 'prt', 'SE': 'swe', 'SZ': 'swz'}

We can now iterate over all the other countries.

In [42]:
cart = 'C:\\Users\\Luca\\Downloads\\temp'
sav = "C:\\Users\\Luca\\Downloads\\RWI\\pop_dens"

for file in os.listdir(cart):
    country = iso2_to3[file[-16:-14]]

    df = gpd.read_file(f'{cart}\\{file}')
    print(df.shape)
    df['area'] = df.apply(lambda x: h3.cell_area(x['h3'], unit='km^2'), axis = 1)
    df['pop_dens'] = df.apply(lambda x: x['population'] / x['area'], axis = 1)

    print(f'{country}: {df.pop_dens.min(), df.pop_dens.mean(), df.pop_dens.max()}')

    df['quadkey'] = df.apply(lambda x: str(quadkey.from_geo(h3.h3_to_geo(x['h3']), 14)), axis = 1)
    df = df.groupby('quadkey', as_index = False).pop_dens.mean()
    print(f'{len(df)} unique quadkeys')
    df.to_csv(f'{sav}\\{country}_pop_quad.csv', index = False)
    print('-'*24)

(86915, 3)
aut: (1.3368188009138215, 142.38129697386236, 11732.329906647901)
29398 unique quadkeys
------------------------
(40991, 3)
bel: (1.5144828642426102, 452.0052608610119, 33111.70124367208)
12684 unique quadkeys
------------------------
(27425, 3)
bdi: (1.165287170228583, 570.5466244415105, 45995.57410169283)
4291 unique quadkeys
------------------------
(41959, 3)
che: (1.3738461960656514, 301.4360284057041, 20535.589233428007)
13193 unique quadkeys
------------------------
(107455, 3)
cmr: (1.2876812531273332, 395.8360657595759, 40920.43850694487)
31451 unique quadkeys
------------------------
(80168, 3)
cze: (1.3810865179537741, 187.34734752643266, 11668.423193623279)
30266 unique quadkeys
------------------------
(363550, 3)
deu: (1.3765224166689387, 346.4506141655887, 29838.79029955791)
137340 unique quadkeys
------------------------
(67267, 3)
den: (1.6078168953564669, 147.66899041616549, 46416.73518910256)
24109 unique quadkeys
------------------------
(286702, 3)
esp: 

Whoops, looks like we forgot to add the Great Britain! Lucky for us, it's just a matter of a few minutes.

In [52]:
import geopandas as gpd
import h3
from pyquadkey2 import quadkey

cart = 'C:\\Users\\Luca\\Downloads\\temp'
sav = "C:\\Users\\Luca\\Downloads\\RWI\\pop_dens"

df = gpd.read_file(f'{cart}\\kontur_population_GB_20231101.gpkg')
print(df.shape)
df['area'] = df.apply(lambda x: h3.cell_area(x['h3'], unit='km^2'), axis = 1)
df['pop_dens'] = df.apply(lambda x: x['population'] / x['area'], axis = 1)

print(f'gbr: {df.pop_dens.min(), df.pop_dens.mean(), df.pop_dens.max()}')

df['quadkey'] = df.apply(lambda x: str(quadkey.from_geo(h3.h3_to_geo(x['h3']), 14)), axis = 1)
df = df.groupby('quadkey', as_index = False).pop_dens.mean()
print(f'{len(df)} unique quadkeys')
df.to_csv(f'{sav}\\gbr_pop_quad.csv', index = False)

(224522, 3)
gbr: (1.381416194904864, 457.96517651696405, 26184.621513304402)
95738 unique quadkeys


We now use the datasets to extract a list of all the quadkey 14 tiles we will use in the study.

In [None]:
import os
import pandas as pd

sav = "C:\\Users\\Luca\\Downloads\\RWI\\pop_dens"
list_quads = []

for file in os.listdir(sav):
    temp = pd.read_csv(f'{sav}\\{file}')
    qs = list(temp.quadkeys.unique())

    list_quads += qs

list_quads = list(set(list_quads))

with open('quad_paesi2.txt', 'w') as file:
    file.write(','.join(list_quads))