This notebook treats the harvesting of the DHS cluster's average RWI values. It should be noted that, due to the confidentiality of the survey, it is not allowed to show the datasets raw data to people not directly specified by the notebook's owner beforehand. As such, this notebook will only show the script used to obtain the results, without showing outputs. 

The data, comprising of both survey data and location, is available per request at https://www.dhsprogram.com/.
The first part of the dataset construction, which binds the survey data with the location (as the datasets are separate) can be done in R and the script to do so is directly provided on the DHS site.
The output dataset will have the variables hhid (household identifier; used for joining the datasets, but will not be used in the study), DHSCLUST (the DHS cluster of the household partaking in the surveym which we will use as the train data's statistical unit), wealth index (our RWI), URBAN_RURA (dicotomic variable showing if the area is urban or rural, which we will use for the spatial join), LATNUM and LONGNUM (coordinates, which we will also use for the spatial join), country, plus minor variables that could be of interest such as wealth index quintile.

In [None]:
import pandas as pd
import os

We will show how the single country's dataset is incorporated in the main dataset. We will take the country of Zimbabwe's.

In [None]:
path = "C:\\Users\\Luca\\Downloads\\WI + Geo"

os.listdir(path)

The missing values are coded as 9999 in the dataset; also the RWI values are incorrectly stored as a million times bigger than they should, so we will correct it.

In [None]:
nome = os.listdir(path)[len(os.listdir(path))-1]
zw = pd.read_csv(path + '\\' + nome, na_values = 9999)

In [None]:
cols = zw.columns

In [None]:
zw['wealth index'] = zw['wealth index'] / 1000000

In [None]:
zw.sample()

Initially, we included the altitude variable, measured in either GPS or DEM, and used the GPS measure as default. We did not use it in the actual study though, so feel free to not include it in the dataset during the R manipulation and remove the script addressing it from this notebook.

In [None]:
zw['ALT'] = zw.ALT_GPS.fillna(zw.ALT_DEM)

In [None]:
zw.drop(['ALT_GPS', 'ALT_DEM'], axis = 1, inplace = True)

In [None]:
df = pd.DataFrame(columns = cols)
df.drop(['ALT_GPS', 'ALT_DEM'], axis = 1, inplace = True)

The import of each dataset, its treatment and consequent appendage to the main dataset is automated.

In [None]:
for file in os.listdir(path):
    country = file[0:2]
    temp = pd.read_csv(path + '\\' + file, na_values = 9999)

    print(f'{country}: {temp.shape}')
    
    temp['wealth index'] = temp['wealth index'] / 1000000

    temp['ALT'] = temp.ALT_GPS.fillna(temp.ALT_DEM)
    temp.drop(['ALT_GPS', 'ALT_DEM'], axis = 1, inplace = True)

    df = pd.concat([df, temp], ignore_index = True)
    print(f'{country} appendage finished')

The dataset consists of 1,713,316 different households, divided in 68,184 different clusters. We will group the observations by creating an unique identifier consisting of the country's code and the DHS cluster number internal to the country, then group for this identifier and take the average of the RWI values inside the cluster.

In [None]:
len(df)

In [None]:
def create_mix_column(df, key_1_col, key_2_col):
    df['country_cluster'] = df[key_1_col] + df[key_2_col].astype(str)
    return df

In [None]:
df_fin = create_mix_column(df, 'country', 'DHSCLUST')

df_fin.sample(5)

In [None]:
len(df_fin.country_cluster.unique())

We create the dataset of RWI averages and then join it to the dataset containing only the clusters's information (dropping the household id variable).

In [None]:
medie_clust = df_fin.groupby('country_cluster', as_index = False)['wealth index'].mean()

In [None]:
medie_clust.sample(5)

In [None]:
tenere = ['country', 'DHSCLUST', 'URBAN_RURA', 'LATNUM', 'LONGNUM', 'ALT', 'country_cluster']
df_loc = df_fin[tenere]
df_loc.drop_duplicates(inplace = True)
len(df_loc)

In [None]:
medie_clust = medie_clust.merge(df_loc, on = 'country_cluster')

In [None]:
medie_clust['wealth index'] = medie_clust['wealth index'].round(5)
medie_clust

In [None]:
medie_clust.to_csv('WI_per_clusters.csv', index = False)