This notebook illustrate the harvesting process for the Precipitation variable. The data is derived from the GPM IMERG product, a model estimating precipitation data at a 30-minute rate using satellite imagery combined with ground calibrations. It is property of NASA / JAXA. The data can be found on NASA's API GES DISC https://disc.gsfc.nasa.gov/.

We decided to estimate the precipitation in quadkey 14 tiles as an the yearly-averaged precipitation from 2023, the last full year with available data. We do so in order to not have an estimation biased by the period of retrieval of data. We used the monthly-averaged datasets to retrieve the data.
Below is illustrated the reading process of the precipitation monthly global data for January, stored in NetCDF4 format.

In [1]:
import netCDF4 as nc
from netCDF4 import Dataset
import matplotlib.pyplot as plt
import time
import numpy as np
import numpy.ma as ma
import pandas as pd
%matplotlib inline

# Percorso al file NetCDF
file_path = "C:\\Users\\Luca\\Downloads\\precipitation\\3B-MO.MS.MRG.3IMERG.20230101-S000000-E235959.01.V07B.HDF5.nc4"

In [2]:
f = Dataset(file_path, 'r')

The precipitation variable is stored similar to .tif files: its location is stored in its position in the dataset. Contrary to the .tif, though, it reads column-for-column and from the south-most point (starting from the west-most longitude, it associates to all the 1800 latitude values, from -90 to 90 degree, then starts over to the next longitude).

In [3]:
for item in f.dimensions:
    print(f.dimensions[item].name, f.dimensions[item].size)

time 1
lon 3600
lat 1800


In [4]:
vars = f.variables.keys()

for item in vars:
    print(item, f[item].dimensions, f[item].shape)

precipitation ('time', 'lon', 'lat') (1, 3600, 1800)


In [5]:
data = f.variables['precipitation'][:]
data.shape

(1, 3600, 1800)

In [6]:
lat_dim = 1800
lon_dim = 3600

In [7]:
x = np.round(np.arange(-180, 180, 0.1), 1)
y =  np.round(np.arange(-90, 90, 0.1), 1)
len(x), len(y), len(set(x)), len(set(y))

(3600, 1800, 3600, 1800)

We use the precipitation value's index to determine its latitude and longitude; its important to know that some values at the north and south-most latitudes are missing and thus masked in the array.

In [8]:
df = []
start = time.time()
for lon_index in range(lon_dim):
    for lat_index in range(lat_dim):
        value = data[0, lon_index, lat_index]
        if ma.is_masked(value):
            value = np.nan
        df.append({'lat': y[lat_index], 'lon': x[lon_index], 'precipitation_01': value})

df = pd.DataFrame(df)
end = time.time()
print(f'operation finished in {end-start}s')
df

operation finished in 62.1261100769043s


Unnamed: 0,lat,lon,precipitation_01
0,-90.0,-180.0,
1,-89.9,-180.0,
2,-89.8,-180.0,
3,-89.7,-180.0,
4,-89.6,-180.0,
...,...,...,...
6479995,89.5,179.9,
6479996,89.6,179.9,
6479997,89.7,179.9,
6479998,89.8,179.9,


In [9]:
df[df.precipitation_01 == df.precipitation_01.max()]

Unnamed: 0,lat,lon,precipitation_01
5498204,10.4,125.4,1.906


In [10]:
df.describe()

Unnamed: 0,lat,lon,precipitation_01
count,6480000.0,6480000.0,6438567.0
mean,-0.05,-0.05,0.08737933
std,51.96152,103.9231,0.1249898
min,-90.0,-180.0,0.0
25%,-45.025,-90.025,0.004
50%,-0.05,-0.05,0.039
75%,44.925,89.925,0.118
max,89.9,179.9,1.906


We want to map each point to a quadkey 14 tile via pyquadkey2's quadkey.from_geo() function, but the function doesn't work for latitudes higher than 85 or lower than -85; dropping them from the dataset has no impact on the information we have since they have all missing values.

In [11]:
df = df[(df.lat >= -85) & (df.lat <= 85)]
len(df)

6123600

In [12]:
f.close()

In [13]:
cart = "C:\\Users\\Luca\\Downloads\\RWI\\precipitation"
df.to_csv(f'{cart}\\precipitation_01.csv', index = False)

We automate the process for all the other datasets relating to the other months.

In [15]:
import os
repo = "C:\\Users\\Luca\\Downloads\\precipitation"
cart = "C:\\Users\\Luca\\Downloads\\RWI\\precipitation"

for file in os.listdir(repo)[1:]:
    f = Dataset(f'{repo}\\{file}', 'r')
    mese = file[-16:-14]

    print(f'reading {mese} file...')
    for item in f.dimensions:
        print(f.dimensions[item].name, f.dimensions[item].size)

    vars = f.variables.keys()

    for item in vars:
        print(item, f[item].dimensions, f[item].shape)

    data = f.variables['precipitation'][:]

    print('creating dataframe...')
    nome_prec = f'precipitation_{mese}'
    df = []
    start = time.time()
    for lon_index in range(lon_dim):
        for lat_index in range(lat_dim):
            value = data[0, lon_index, lat_index]
            if ma.is_masked(value):
                value = np.nan
            df.append({'lat': y[lat_index], 'lon': x[lon_index], nome_prec: value})
    
    df = pd.DataFrame(df)
    end = time.time()
    print(f'operation finished in {end-start}s')
    df

    f.close()
    print('restricting borders...')
    df = df.loc[(df.lat >= -85) & (df.lat <= 85), :]
    print(df.shape)

    print(f'missing values:')
    print(df.isnull().sum())

    print(f'min/mean/median/max: {df[nome_prec].min(), df[nome_prec].mean(), df[nome_prec].median(), df[nome_prec].max()}')

    print('exporting dataframe...')
    df.to_csv(f'{cart}\\precipitation_{mese}.csv', index = False)
    print(f'iteration {mese} finished')
    print('-'*40)

reading 02 file...
time 1
lon 3600
lat 1800
precipitation ('time', 'lon', 'lat') (1, 3600, 1800)
creating dataframe...
operation finished in 35.86643576622009s
restricting borders...
(6123600, 3)
missing values:
lat                 0
lon                 0
precipitation_02    0
dtype: int64
min/mean/median/max: (0.0, 0.095133258875561, 0.0430000014603138, 2.180999994277954)
exporting dataframe...
iteration 02 finished
----------------------------------------
reading 03 file...
time 1
lon 3600
lat 1800
precipitation ('time', 'lon', 'lat') (1, 3600, 1800)
creating dataframe...
operation finished in 28.357120037078857s
restricting borders...
(6123600, 3)
missing values:
lat                 0
lon                 0
precipitation_03    0
dtype: int64
min/mean/median/max: (0.0, 0.09273991349912154, 0.05700000375509262, 2.0190000534057617)
exporting dataframe...
iteration 03 finished
----------------------------------------
reading 04 file...
time 1
lon 3600
lat 1800
precipitation ('time', 'lon

We create the final dataset containing all the precipitation values for the 12 months of 2023.

In [25]:
for file in os.listdir(cart)[:-1]:
    temp = pd.read_csv(f'{cart}\\{file}')

    df = df.merge(temp, on = ['lon', 'lat'])
    print(df.shape)

(6123600, 4)
(6123600, 5)
(6123600, 6)
(6123600, 7)
(6123600, 8)
(6123600, 9)
(6123600, 10)
(6123600, 11)
(6123600, 12)
(6123600, 13)
(6123600, 14)


In [26]:
df

Unnamed: 0,lat,lon,precipitation_12,precipitation_01,precipitation_02,precipitation_03,precipitation_04,precipitation_05,precipitation_06,precipitation_07,precipitation_08,precipitation_09,precipitation_10,precipitation_11
0,-85.0,-180.0,0.000,0.0,0.0,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
1,-84.9,-180.0,0.000,0.0,0.0,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
2,-84.8,-180.0,0.000,0.0,0.0,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
3,-84.7,-180.0,0.000,0.0,0.0,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
4,-84.6,-180.0,0.000,0.0,0.0,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6123595,84.6,179.9,0.001,0.0,0.0,0.002,0.005,0.012,0.021,0.052,0.045,0.077,0.005,0.002
6123596,84.7,179.9,0.001,0.0,0.0,0.002,0.004,0.012,0.020,0.051,0.050,0.073,0.004,0.001
6123597,84.8,179.9,0.001,0.0,0.0,0.002,0.003,0.011,0.016,0.052,0.045,0.069,0.004,0.002
6123598,84.9,179.9,0.001,0.0,0.0,0.002,0.002,0.010,0.015,0.052,0.043,0.068,0.003,0.002


In [35]:
df[df.precipitation_06 == df.precipitation_06.max()]

Unnamed: 0,lat,lon,precipitation_12,precipitation_01,precipitation_02,precipitation_03,precipitation_04,precipitation_05,precipitation_06,precipitation_07,precipitation_08,precipitation_09,precipitation_10,precipitation_11
4185515,20.5,66.0,0.0,0.0,0.0,0.041,0.027,0.0,2.958,0.12,0.009,0.105,0.0,0.0


We estimate the yearly precipitation as an average of the monthly precipitations.

In [36]:
df['precipitation_year'] = df[list(df.columns)[2:]].mean(axis = 1)
df.precipitation_year.describe()

count    6.123600e+06
mean     9.697821e-02
std      9.227351e-02
min      0.000000e+00
25%      2.658333e-02
50%      7.591667e-02
75%      1.396667e-01
max      1.147583e+00
Name: precipitation_year, dtype: float64

In [37]:
df[['lat', 'lon', 'precipitation_year']].to_csv(f'{cart}\\precipitation_year2023.csv', index = False)

We now transpose to quadkey 14 tiles. Important notice is that the resolution of the dataset is far lower than required, resulting into lots of tiles we would use in the study missing. We treat the estimation of missing values in a separate notebook called 'Precipitation missing values treatment'.

In [2]:
import pandas as pd

df = pd.read_csv("C:\\Users\\Luca\\Downloads\\RWI\\precipitation\\precipitation_year2023.csv")
df

Unnamed: 0,lat,lon,precipitation_year
0,-85.0,-180.0,0.000000
1,-84.9,-180.0,0.000000
2,-84.8,-180.0,0.000000
3,-84.7,-180.0,0.000000
4,-84.6,-180.0,0.000000
...,...,...,...
6123595,84.6,179.9,0.018500
6123596,84.7,179.9,0.018167
6123597,84.8,179.9,0.017083
6123598,84.9,179.9,0.016500


We change the precipitation measure from mm/hr to mm/day

In [3]:
df.precipitation_year = df.precipitation_year * 24
df.describe()

Unnamed: 0,lat,lon,precipitation_year
count,6123600.0,6123600.0,6123600.0
mean,-1.152538e-16,-0.05,2.327477
std,49.10364,103.9231,2.214564
min,-85.0,-180.0,0.0
25%,-42.5,-90.025,0.638
50%,-0.0,-0.05,1.822
75%,42.5,89.925,3.352
max,85.0,179.9,27.542


In [4]:
from pyquadkey2 import quadkey
import time
start = time.time()
df['quadkey'] = df.apply(lambda x: str(quadkey.from_geo((x['lat'], x['lon']), 14)), axis = 1)
end = time.time()
print(f'elapsed time {end-start}s')

elapsed time 229.61092233657837s


In [5]:
df[['quadkey', 'precipitation_year']].to_csv("C:\\Users\\Luca\\Downloads\\RWI\\precipitation\\precipitation_quadkey.csv", index = False)

In [14]:
len(df.quadkey.unique())

6123600

In [1]:
import pandas as pd

df = pd.read_csv("C:\\Users\\Luca\\Downloads\\RWI\\precipitation\\precipitation_quadkey.csv")
df

Unnamed: 0,quadkey,precipitation_year
0,22222222200202,0.000
1,22222220220002,0.000
2,22222202222220,0.000
3,22222202002202,0.000
4,22222200022200,0.000
...,...,...
6123595,11111133311033,0.444
6123596,11111131331031,0.436
6123597,11111131111013,0.410
6123598,11111113113231,0.396


We keep the quadkey tiles that are identified in the dataset.

In [6]:
with open('quad_paesi2.txt', 'r') as file:
    list_quads = file.read()
    list_quads = list_quads.split(',')

In [7]:
df = df.astype({'quadkey': 'str'})

In [10]:
cart = "C:\\Users\\Luca\\Downloads\\RWI\\precipitation"
pronti = df[df.quadkey.isin(list_quads)]
pronti.to_csv(f'{cart}\\precipitation_utili14.csv', index = False)
len(pronti)

130865

We imput the missing precipitations values by the value assigned to its parent quadkey 13 tile. We construct a dataset of missing tiles, assign to each tile its parent tile and the parent tile's value. Notice that there are still a lot of missing values. We will repeat the same estimation up to the quadkey 11 tile.

In [11]:
missing = list(set(list_quads) - set(pronti.quadkey.unique()))
len(missing)

3708354

In [12]:
quads_pres = list(df.quadkey.unique())
quads_pres13 = [q[:-1] for q in quads_pres]
quads_pres13 = list(set(quads_pres13))
len(quads_pres13)

6123600

In [17]:
missing13 = [m[:-1] for m in missing]
len(missing13), len(set(missing13))

(3708354, 1278136)

In [21]:
missing13 = list(set(missing13))
len(missing13)

1278136

In [25]:
len(set(missing13) - set(quads_pres13))

1109223

In [28]:
df['quadkey13'] = quads_pres13
pronti = df[(df.quadkey13.isin(missing13))]
pronti

Unnamed: 0,lat,lon,precipitation_year,quadkey,quadkey13
118,-73.2,-180.0,0.396,22002202222022,1202221113311
139,-71.1,-180.0,1.356,22002002000022,3001223021313
168,-68.2,-180.0,2.566,22000022000222,1222301033122
183,-66.7,-180.0,2.936,22000000020202,1221012011321
260,-59.0,-180.0,3.468,20220200020000,1231212321222
...,...,...,...,...,...
6123322,57.3,179.9,2.234,13113331111031,1222112100131
6123361,61.2,179.9,1.090,13113111311231,1222022021023
6123397,64.8,179.9,0.798,13111131333213,3001220210302
6123533,78.4,179.9,0.382,11311131331211,1202120321330


In [30]:
df13 = []

for q in missing:
    df13.append({'quadkey': q, 'quadkey13': q[:-1]})

df13 = pd.DataFrame(df13)

In [32]:
df13 = df13[df13.quadkey13.isin(list(pronti.quadkey13.unique()))]
df13

Unnamed: 0,quadkey,quadkey13
3,12233032223313,1223303222331
36,12312231323002,1231223132300
47,12223031202223,1222303120222
50,12022001011110,1202200101111
54,12233010221332,1223301022133
...,...,...
3708328,12313231213101,1231323121310
3708335,12312232132331,1231223213233
3708338,30031002020333,3003100202033
3708344,12312211021323,1231221102132


In [35]:
df13 = df13.merge(pronti[['quadkey13', 'precipitation_year']], on = 'quadkey13', how = 'left')
df13

Unnamed: 0,quadkey,quadkey13,precipitation_year
0,12233032223313,1223303222331,7.040
1,12312231323002,1231223132300,0.746
2,12223031202223,1222303120222,0.778
3,12022001011110,1202200101111,0.572
4,12233010221332,1223301022133,0.172
...,...,...,...
392239,12313231213101,1231323121310,2.870
392240,12312232132331,1231223213233,3.346
392241,30031002020333,3003100202033,3.102
392242,12312211021323,1231221102132,5.150


In [38]:
df13.to_csv(f'{cart}\\precipitation_utili13.csv', index = False)

Construction of the approsimation of missing quadkey 13 tile using their parent 12 tile's value.

In [39]:
quads_pres12 = [q[:-1] for q in quads_pres13]
quads_pres12 = list(set(quads_pres12))
len(quads_pres12)

6123600

In [41]:
miss = list(set(missing13) - set(pronti.quadkey13.unique()))
len(miss)

1109223

In [42]:
missing12 = [m[:-1] for m in missing13]
missing12 = list(set(missing12))
len(missing12)

404386

In [43]:
df['quadkey12'] = quads_pres12
pronti = df[(df.quadkey12.isin(missing12))]
pronti

Unnamed: 0,lat,lon,precipitation_year,quadkey,quadkey13,quadkey12
15,-83.5,-180.0,0.004,22220200220222,0233011102323,300120111331
106,-74.4,-180.0,0.172,22020000222222,3001103221112,210010331100
132,-71.8,-180.0,1.028,22002020200222,3100311322312,123102302010
138,-71.2,-180.0,1.304,22002002020002,1333322112303,132200333133
150,-70.0,-180.0,1.982,22000220202202,2320332002222,300011201200
...,...,...,...,...,...,...
6123490,74.1,179.9,0.528,11313333331031,2000301300303,300112233113
6123509,76.0,179.9,0.422,11313131311033,0013200202020,122202112200
6123545,79.6,179.9,0.368,11133331131231,1320133001001,300012032133
6123557,80.8,179.9,0.396,11133113131211,2032232303000,122221303303


In [45]:
df12 = []

for q in missing:
    df12.append({'quadkey': q, 'quadkey12': q[:-2]})

df12 = pd.DataFrame(df12)

In [46]:
df12 = df12[df12.quadkey12.isin(list(pronti.quadkey12.unique()))]
df12

Unnamed: 0,quadkey,quadkey12
1,12021022132111,120210221321
3,12233032223313,122330322233
11,12303131112231,123031311122
14,13230333320203,132303333202
18,30001030322303,300010303223
...,...,...
3708344,12312211021323,123122110213
3708345,12310030021123,123100300211
3708350,21001101111122,210011011111
3708352,30012330211323,300123302113


In [47]:
df12 = df12.merge(pronti[['quadkey12', 'precipitation_year']], on = 'quadkey12', how = 'left')
df12

Unnamed: 0,quadkey,quadkey12,precipitation_year
0,12021022132111,120210221321,1.060000
1,12233032223313,122330322233,0.422000
2,12303131112231,123031311122,2.030000
3,13230333320203,132303333202,5.784000
4,30001030322303,300010303223,2.508000
...,...,...,...
1953494,12312211021323,123122110213,0.216000
1953495,12310030021123,123100300211,0.926000
1953496,21001101111122,210011011111,0.644000
1953497,30012330211323,300123302113,4.432000


In [51]:
df12[['quadkey', 'precipitation_year']].to_csv(f'{cart}\\precipitation_utili12.csv', index = False)

Construction of the approssimation of missing quadkey 12 tiles by their parent 11 tile's value.

In [52]:
quads_pres11 = [q[:-1] for q in quads_pres12]
quads_pres11 = list(set(quads_pres11))
len(quads_pres11)

2772992

In [54]:
miss2 = list(set(missing12) - set(pronti.quadkey12.unique()))
len(miss2)

179599

In [55]:
missing11 = [m[:-1] for m in missing12]
missing11 = list(set(missing11))
len(missing11)

122051

In [56]:
df['quadkey11'] = df.apply(lambda x: x['quadkey12'][:-1], axis = 1)
pronti = df[(df.quadkey11.isin(missing11))]
pronti

Unnamed: 0,lat,lon,precipitation_year,quadkey,quadkey13,quadkey12,quadkey11
15,-83.5,-180.0,0.004,22220200220222,0233011102323,300120111331,30012011133
23,-82.7,-180.0,0.000,22220000002000,0232311010311,210031100230,21003110023
106,-74.4,-180.0,0.172,22020000222222,3001103221112,210010331100,21001033110
107,-74.3,-180.0,0.174,22020000202220,3112021001013,300101010202,30010101020
132,-71.8,-180.0,1.028,22002020200222,3100311322312,123102302010,12310230201
...,...,...,...,...,...,...,...
6123490,74.1,179.9,0.528,11313333331031,2000301300303,300112233113,30011223311
6123509,76.0,179.9,0.422,11313131311033,0013200202020,122202112200,12220211220
6123545,79.6,179.9,0.368,11133331131231,1320133001001,300012032133,30001203213
6123557,80.8,179.9,0.396,11133113131211,2032232303000,122221303303,12222130330


In [59]:
pronti = pronti.groupby('quadkey11', as_index = False)['precipitation_year'].mean()
pronti

Unnamed: 0,quadkey11,precipitation_year
0,10223032121,1.1590
1,10223032130,1.5800
2,10223032131,4.5210
3,10223032212,1.2280
4,10223032213,2.6300
...,...,...
100832,31011230010,0.2060
100833,31011230011,1.9970
100834,31011230012,2.0070
100835,31011230020,1.7095


In [57]:
df11 = []

for q in missing:
    df11.append({'quadkey': q, 'quadkey11': q[:-3]})

df11 = pd.DataFrame(df11)

In [60]:
df11 = df11[df11.quadkey11.isin(list(pronti.quadkey11.unique()))]
df11

Unnamed: 0,quadkey,quadkey11
0,12200123321232,12200123321
1,12021022132111,12021022132
2,12200300230233,12200300230
3,12233032223313,12233032223
9,13202212300331,13202212300
...,...,...
3708349,12200210101123,12200210101
3708350,21001101111122,21001101111
3708351,21003102011232,21003102011
3708352,30012330211323,30012330211


In [61]:
df11 = df11.merge(pronti[['quadkey11', 'precipitation_year']], on = 'quadkey11', how = 'left')
df11

Unnamed: 0,quadkey,quadkey11,precipitation_year
0,12200123321232,12200123321,2.4260
1,12021022132111,12021022132,1.0600
2,12200300230233,12200300230,1.7520
3,12233032223313,12233032223,2.3570
4,13202212300331,13202212300,2.5720
...,...,...,...
3003459,12200210101123,12200210101,1.6320
3003460,21001101111122,21001101111,0.5570
3003461,21003102011232,21003102011,1.6780
3003462,30012330211323,30012330211,3.3760


In [62]:
df11[['quadkey', 'precipitation_year']].to_csv(f'{cart}\\precipitation_utili11.csv', index = False)

In [63]:
len(set(missing11) - set(pronti.quadkey11.unique()))

21214