To ensure the privacy of the respondents, the DHS Survey randomly may scatter the true response location by around 2 km in any direction for urban areas, and 5 km in any direction for rural areas.
Due to this, a respondent cluster's true location may not lie within the quadkey 14 tile we would expect it to be and we must take measures to account for it in our calculations.
This notebook shows the steps towards definining the spatial join decided in the original study: for urban areas, all covariates are calculated as the average of the tiles forming the 2x2 grid closer to the (alleged) cluster centroid location, whereas for rural areas the grid is 4x4. 
We limit ourselves to just constructing a dataset of clusters and a list of all quadkeys we need to take values from.

In [1]:
import pandas as pd

train = pd.read_csv('WI_per_clusters.csv')
train

Unnamed: 0,country_cluster,wealth index,country,DHSCLUST,URBAN_RURA,LATNUM,LONGNUM,ALT
0,AL1,0.07209,AL,1,U,40.710440,19.946651,33.8
1,AL10,0.05673,AL,10,U,40.703271,19.976494,94.9
2,AL100,-0.06155,AL,100,R,41.645007,19.925487,276.5
3,AL101,-0.06966,AL,101,R,41.664977,20.045771,250.1
4,AL102,-0.08978,AL,102,R,41.529707,20.064258,245.4
...,...,...,...,...,...,...,...,...
68179,ZW95,0.79165,ZW,95,U,-18.341095,29.890007,1151.0
68180,ZW96,1.03499,ZW,96,U,-21.357135,30.645874,504.0
68181,ZW97,0.88336,ZW,97,U,-19.452031,29.773865,1412.0
68182,ZW98,0.93091,ZW,98,U,-20.131983,28.513421,1388.0


In [2]:
r, u = train.URBAN_RURA.value_counts()
(u/len(train),r/len(train))

(0.3727560718057022, 0.6272439281942978)

In [3]:
train.country.value_counts()

country
IA    30170
CO     4866
EG     1836
KE     1691
NG     1389
PH     1247
MR     1200
HN     1148
PE     1132
BO     1000
JO      970
GU      858
MW      850
ZA      746
AL      715
UG      696
MD      650
TZ      628
AO      625
TD      624
GH      618
MZ      617
SL      576
PK      561
BJ      555
BU      554
NM      550
ZM      545
CI      539
CD      536
BF      514
RW      500
NP      476
NI      476
TL      455
HT      450
MM      441
CM      430
GN      401
ZW      400
MB      400
LS      399
GA      390
TJ      366
ML      345
TG      330
LB      325
GY      325
KY      316
AM      313
ET      305
GM      280
SZ      275
KM      252
SN      214
DR      114
Name: count, dtype: int64

In [4]:
tenere = ['country_cluster', 'URBAN_RURA', 'LATNUM', 'LONGNUM']
train = train[tenere]

Below we describe the process of defining the closer 2x2 grid for the first urban cluster of the dataset.
We use pyquadkey2's nearby() function to get a list of all the tiles near the (alleged) cluster's own tile. Then, we use the to_geo() function to retrieve each nearby tile's centroid cooordinates.

In [79]:
from pyquadkey2 import quadkey
x = train.loc[0, 'LONGNUM']
y = train.loc[0, 'LATNUM']

quad1 = quadkey.from_geo((y,x),14)
plausibles = quad1.nearby()

print(quad1, plausibles)

12201110021011 ['12201110021102', '12201110021012', '12201110021011', '12201110021100', '12201110021010', '12201110003233', '12201110003322', '12201110021013', '12201110003232']


In [80]:
from pyquadkey2.quadkey import TileAnchor

coord = {}

for qu in plausibles:
    ye, xe = quadkey.from_str(qu).to_geo(anchor = TileAnchor.ANCHOR_CENTER)
    if qu != str(quad1):
        coord[qu] = (ye,xe)

In [81]:
y,x

(40.7104396765, 19.9466514633)

In [82]:
coord

{'12201110021102': (40.688969037624, 19.962158203125),
 '12201110021012': (40.688969037624, 19.918212890625),
 '12201110021100': (40.705627938205, 19.962158203125),
 '12201110021010': (40.705627938205, 19.918212890625),
 '12201110003233': (40.722282672831, 19.940185546875),
 '12201110003322': (40.722282672831, 19.962158203125),
 '12201110021013': (40.688969037624, 19.940185546875),
 '12201110003232': (40.722282672831, 19.918212890625)}

It's important to note that we can't just take the three nearest tiles to our cluster's alleged location, since it could generate a 'T' shaped grid as opposed to the square grid we want. We instead see if the point is nearer to its northern or southern tile, and then the same between its eastern or western tile. Picking those two tiles is the same as picking the two nearest tiles, but instead of picking the third closest tile, which would usually be to the opposite end of the nearest tile, thus not guaranteeing a square grid, we take the tile which is moved by 1 both vertically and horizontally (ie: if the two nearest tiles are west and north to the cluster's alleged tile, we will pick the tile that its both 1 north and 1 west to the cluster, thus ensuring we generate a square grid)

In [83]:
##capire se devi andare a 'destra' o 'sinistra' e 'sopra' o 'sotto'
import math
def calc(c1, c2):
    dir = -1
    if c1 >=c2:
        dir = 1
    return (abs(c1-c2), dir)

sigmax = []
sigmay = []
point = (y,x)
for qu, points in coord.items():
    sigmax.append(calc(points[1], point[1]))
    sigmay.append(calc(points[0], point[0]))

sigmax.sort(key=lambda x: x[0])
sigmay.sort(key = lambda x: x[0])

In [84]:
sigmax[0], sigmay[0]

((0.006465916425000273, -1), (0.00481173829500392, -1))

In [86]:
q = quad1.to_tile()
q

((9099, 6160), 14)

In [87]:
quadkey.from_tile((9098,6159), 14)

12201110003232

By using pyquadkey2's to_tile() function, we can now establish the three other tiles and return them as a list.

In [8]:
def trova_vicini(quad, sopra, destra):
    (y, x), zoom = quad.to_tile()
    yn = y + sopra
    xn = x + destra
    lista = [quadkey.from_tile((yn,x), zoom), quadkey.from_tile((y,xn),zoom), quadkey.from_tile((yn,xn),zoom)]
    return [str(l) for l in lista]

In [90]:
trova_vicini(quad1, sigmax[0][1], sigmay[0][1])

['12201110021010', '12201110003233', '12201110003232']

We create a function to automate the process. We want it to return the 4 quadkeys as a string separating each key with a comma.

In [91]:
def near(lon, lat, zoom:int = 14):
    point = (lon,lat)
    quad1 = quadkey.from_geo(point, zoom)
    plausibles = quad1.nearby()

    coord = {}
    for qu in plausibles:
        ye, xe = quadkey.from_str(qu).to_geo(anchor = TileAnchor.ANCHOR_CENTER)
        if qu != str(quad1):
            coord[qu] = (ye,xe)
    sigmax = []
    sigmay = []
        
    for qu, points in coord.items():
        sigmax.append(calc(points[1], point[1]))
        sigmay.append(calc(points[0], point[0]))
    
    sigmax.sort(key=lambda x: x[0])
    sigmay.sort(key = lambda x: x[0])

    sopra = sigmay[0][1]
    destra = sigmax[0][1]

    vicini = trova_vicini(quad1, sopra, destra)

    l = [str(quad1)]
    l += vicini
    
    return ','.join(l)

In [58]:
near(y,x)

'12201110021011,12201110021010,12201110003233,12201110003232'

In [92]:
near(y,x)

'12201110021011,12201110021010,12201110003233,12201110003232'

In [93]:
import time
start = time.time()
yk = train.loc[1, 'LONGNUM']
xk = train.loc[1, 'LATNUM']
print(near(yk, xk))
end = time.time()
print(f'elapsed time: {end-start}s')

12231102133322,12231102133323,12231102311100,12231102311101
elapsed time: 0.0s


We apply the function to all the urban areas.

In [94]:
import time
start = time.time()
urban = train.loc[train.URBAN_RURA == 'U', :]
print(len(urban))
urban['nearests'] = urban.apply(lambda x: near(x['LATNUM'], x['LONGNUM']), axis = 1)
end = time.time()
print(f'elapsed time: {end - start}')

25416
elapsed time: 5.8580591678619385


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  urban['nearests'] = urban.apply(lambda x: near(x['LATNUM'], x['LONGNUM']), axis = 1)


In [95]:
urban

Unnamed: 0,country_cluster,URBAN_RURA,LATNUM,LONGNUM,nearests
0,AL1,U,40.710440,19.946651,"12201110021011,12201110021010,12201110003233,1..."
1,AL10,U,40.703271,19.976494,"12201110021101,12201110021110,12201110021103,1..."
12,AL11,U,40.689990,19.975658,"12201110021103,12201110021102,12201110021121,1..."
15,AL112,U,41.323535,19.468450,"12023323312132,12023323312123,12023323312310,1..."
16,AL113,U,41.326573,19.432670,"12023323312122,12023323312033,12023323312300,1..."
...,...,...,...,...,...
68179,ZW95,U,-18.341095,29.890007,"30012303030002,30012303021113,30012303030020,3..."
68180,ZW96,U,-21.357135,30.645874,"30012323310032,30012323310023,30012323310030,3..."
68181,ZW97,U,-19.452031,29.773865,"30012321001231,30012321001320,30012321001233,3..."
68182,ZW98,U,-20.131983,28.513421,"30012320210223,30012320210232,30012320210221,3..."


In [96]:
temp = urban[['country_cluster', 'LONGNUM', 'LATNUM', 'nearests']]
print(len(temp))
temp.drop_duplicates(inplace = True)
print(len(temp))
temp.to_csv("C:\\Users\\Luca\\Downloads\\RWI\\spatial_join_urban.csv", index = False)

25416
25416


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp.drop_duplicates(inplace = True)


The rural areas need a wider range of quadkey tiles to choose from. We can get it by using the nearby() function at an higher distance (this time, it will return the 24 tiles that are at most 2 tiles away) and repeat the same steps. The 'direction' problem now is how to expand the 3x3 grid (which would be guaranteed by taking the 8 closest tiles), but is easily done by seeing in which direction does the ninth-most closest longitude (latitude) tile is (its the same as saying the third different longitude/latitude distance, as there are 4 tiles that have the same latitude/longitude distance for each direction).

In [9]:
rural = train.loc[train.URBAN_RURA == 'R', :].reset_index()
rural

Unnamed: 0,index,country_cluster,URBAN_RURA,LATNUM,LONGNUM
0,2,AL100,R,41.645007,19.925487
1,3,AL101,R,41.664977,20.045771
2,4,AL102,R,41.529707,20.064258
3,5,AL103,R,41.463585,20.091683
4,6,AL104,R,41.537868,20.127118
...,...,...,...,...,...
42763,68172,ZW89,R,-18.251363,31.264714
42764,68174,ZW90,R,-17.252286,30.914214
42765,68176,ZW92,R,-17.935135,31.118093
42766,68177,ZW93,R,-17.664349,31.168450


In [11]:
from pyquadkey2 import quadkey

y = rural.loc[0, 'LATNUM']
x = rural.loc[0, 'LONGNUM']

quad1 = quadkey.from_geo((y,x), 14)
plausibles = quad1.nearby(n = 2)
quad1, plausibles

(12023332021232,
 ['12023332021223',
  '12023332021320',
  '12023332023102',
  '12023332021202',
  '12023332023000',
  '12023332021212',
  '12023332023010',
  '12023332021233',
  '12023332021221',
  '12023332021302',
  '12023332023003',
  '12023332023100',
  '12023332021232',
  '12023332021222',
  '12023332021231',
  '12023332023013',
  '12023332021203',
  '12023332023001',
  '12023332021322',
  '12023332021220',
  '12023332021213',
  '12023332023002',
  '12023332021230',
  '12023332023011',
  '12023332023012'])

In [13]:
from pyquadkey2.quadkey import TileAnchor
coord = {}

for qu in plausibles:
    ye, xe = quadkey.from_str(qu).to_geo(anchor = TileAnchor.ANCHOR_CENTER)
    if qu != str(quad1):
        coord[qu] = (ye,xe)
coord

{'12023332021223': (41.648288312595, 19.896240234375),
 '12023332021320': (41.664705030092, 19.962158203125),
 '12023332023102': (41.615442324681, 19.962158203125),
 '12023332021202': (41.681117562906, 19.874267578125),
 '12023332023000': (41.631867410697, 19.874267578125),
 '12023332021212': (41.681117562906, 19.918212890625),
 '12023332023010': (41.631867410697, 19.918212890625),
 '12023332021233': (41.648288312595, 19.940185546875),
 '12023332021221': (41.664705030092, 19.896240234375),
 '12023332021302': (41.681117562906, 19.962158203125),
 '12023332023003': (41.615442324681, 19.896240234375),
 '12023332023100': (41.631867410697, 19.962158203125),
 '12023332021222': (41.648288312595, 19.874267578125),
 '12023332021231': (41.664705030092, 19.940185546875),
 '12023332023013': (41.615442324681, 19.940185546875),
 '12023332021203': (41.681117562906, 19.896240234375),
 '12023332023001': (41.631867410697, 19.896240234375),
 '12023332021322': (41.648288312595, 19.962158203125),
 '12023332

In [15]:
##ora si decide in quale direzione ci sono due tasselli invece che uno
import math
def calc(c1, c2):
    dir = -1
    if c1 >=c2:
        dir = 1
    return (abs(c1-c2), dir)


sigmax = []
sigmay = []
point = (y,x)
for qu, points in coord.items():
    sigmax.append(calc(points[1], point[1]))
    sigmay.append(calc(points[0], point[0]))

sigmax.sort(key=lambda x: x[0])
sigmay.sort(key = lambda x: x[0])

sigmax[:10], sigmay[:10]

([(0.007274473275000304, -1),
  (0.007274473275000304, -1),
  (0.007274473275000304, -1),
  (0.007274473275000304, -1),
  (0.014698182974999696, 1),
  (0.014698182974999696, 1),
  (0.014698182974999696, 1),
  (0.014698182974999696, 1),
  (0.014698182974999696, 1),
  (0.029247129525000304, -1)],
 [(0.0032815961950021233, 1),
  (0.0032815961950021233, 1),
  (0.0032815961950021233, 1),
  (0.0032815961950021233, 1),
  (0.013139305702999593, -1),
  (0.013139305702999593, -1),
  (0.013139305702999593, -1),
  (0.013139305702999593, -1),
  (0.013139305702999593, -1),
  (0.01969831369200392, 1)])

In [16]:
destra = sigmax[9][1]
sopra = sigmay[9][1]

sopra,destra

(1, -1)

In [17]:
import numpy as np
def trova_vicini_rur(quad, sopra, destra):
    (y, x), zoom = quad.to_tile()
    yn1 = y + 1
    xn1 = x + 1
    yn2 = y - 1
    xn2 = x - 1
    yn3 = y + 2*sopra
    xn3 = x + 2*destra

    yl = [y, yn1, yn2, yn3]
    xl = [x, xn1, xn2, xn3]

    yl,xl = np.meshgrid(yl, xl)
    yl = yl.flatten()
    xl = xl.flatten()

    lista = []
    i = 0
    for vert in yl:
        p = (vert, xl[i])
        lista.append(quadkey.from_tile(p, zoom))
        i += 1
    
    return [str(l) for l in lista]

In [18]:
trova_vicini_rur(quad1, sopra, destra)

['12023332021232',
 '12023332021233',
 '12023332021223',
 '12023332021322',
 '12023332023010',
 '12023332023011',
 '12023332023001',
 '12023332023100',
 '12023332021230',
 '12023332021231',
 '12023332021221',
 '12023332021320',
 '12023332021212',
 '12023332021213',
 '12023332021203',
 '12023332021302']

The rural areas join is just a slightly more complicated function of the urban areas one.

In [19]:
def near_rur(lon, lat, zoom:int = 14):
    point = (lon,lat)
    quad1 = quadkey.from_geo(point, zoom)
    plausibles = quad1.nearby(n = 2)

    coord = {}
    for qu in plausibles:
        ye, xe = quadkey.from_str(qu).to_geo(anchor = TileAnchor.ANCHOR_CENTER)
        if qu != str(quad1):
            coord[qu] = (ye,xe)
    sigmax = []
    sigmay = []
        
    for qu, points in coord.items():
        sigmax.append(calc(points[1], point[1]))
        sigmay.append(calc(points[0], point[0]))
    
    sigmax.sort(key=lambda x: x[0])
    sigmay.sort(key = lambda x: x[0])

    sopra = sigmay[9][1]
    destra = sigmax[9][1]

    vicini = trova_vicini_rur(quad1, sopra, destra)
    
    return ','.join(vicini)

In [20]:
near_rur(y,x)

'12023332021232,12023332021233,12023332021223,12023332021322,12023332023010,12023332023011,12023332023001,12023332023100,12023332021230,12023332021231,12023332021221,12023332021320,12023332021212,12023332021213,12023332021203,12023332021302'

In [22]:
import time
start = time.time()
print(len(rural))
rural['nearests'] = rural.apply(lambda x: near_rur(x['LATNUM'], x['LONGNUM']), axis = 1)
end = time.time()
print(f'elapsed time: {end - start}')

42768
elapsed time: 31.46090316772461


In [24]:
rural.drop('index', axis = 1, inplace = True)

In [25]:
rural

Unnamed: 0,country_cluster,URBAN_RURA,LATNUM,LONGNUM,nearests
0,AL100,R,41.645007,19.925487,"12023332021232,12023332021233,12023332021223,1..."
1,AL101,R,41.664977,20.045771,"12023332030220,12023332030221,12023332021331,1..."
2,AL102,R,41.529707,20.064258,"12023332032221,12023332032230,12023332032220,1..."
3,AL103,R,41.463585,20.091683,"12023332210030,12023332210031,12023332210021,1..."
4,AL104,R,41.537868,20.127118,"12023332032320,12023332032321,12023332032231,1..."
...,...,...,...,...,...
42763,ZW89,R,-18.251363,31.264714,"30012312003312,30012312003313,30012312003303,3..."
42764,ZW90,R,-17.252286,30.914214,"30012301133312,30012301133313,30012301133303,3..."
42765,ZW92,R,-17.935135,31.118093,"30012310223202,30012310223203,30012310222313,3..."
42766,ZW93,R,-17.664349,31.168450,"30012310221010,30012310221011,30012310221001,3..."


In [26]:
temp = rural[['country_cluster', 'LONGNUM', 'LATNUM', 'nearests']]
print(len(temp))
temp.drop_duplicates(inplace = True)
print(len(temp))
temp.to_csv("C:\\Users\\Luca\\Downloads\\RWI\\spatial_join_rural.csv", index = False)

42768
42768


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp.drop_duplicates(inplace = True)


We concatenate the datasets back together and export the dataset as one.

In [125]:
total = pd.concat([urban, rural], ignore_index = True)
total

Unnamed: 0,country_cluster,URBAN_RURA,LATNUM,LONGNUM,nearests
0,AL1,U,40.710440,19.946651,"12201110021011,12201110021010,12201110003233,1..."
1,AL10,U,40.703271,19.976494,"12201110021101,12201110021110,12201110021103,1..."
2,AL11,U,40.689990,19.975658,"12201110021103,12201110021102,12201110021121,1..."
3,AL112,U,41.323535,19.468450,"12023323312132,12023323312123,12023323312310,1..."
4,AL113,U,41.326573,19.432670,"12023323312122,12023323312033,12023323312300,1..."
...,...,...,...,...,...
68179,ZW89,R,-18.251363,31.264714,"30012312003312,30012312003303,30012312003310,3..."
68180,ZW90,R,-17.252286,30.914214,"30012301133312,30012301133303,30012301133310,3..."
68181,ZW92,R,-17.935135,31.118093,"30012310223202,30012310223203,30012310223220,3..."
68182,ZW93,R,-17.664349,31.168450,"30012310221010,30012310221011,30012310203232,3..."


In [127]:
len(total.country_cluster.unique())

68184

In [128]:
total.to_csv("C:\\Users\\Luca\\Downloads\\RWI\\spatial_join.csv", index = False)