# Demographic and Health Survey (DHS) Data Preparation

Download the Philippine National DHS Dataset from the [official website here](https://www.dhsprogram.com/what-we-do/survey/survey-display-510.cfm). Copy and unzip the file in the data directory. Importantly, the DHS folder should contain the following files:
- `PHHR70DT/PHHR70FL.DTA`
- `PHHR70DT/PHHR70FL.DO`

## Imports

In [15]:
import pandas as pd

## File locations

In [16]:
data_dir = '../data/'
dhs_zip = data_dir + ''
dhs_file = dhs_zip + 'PHHR71DT/PHHR71FL.DTA'
dhs_dict_file = dhs_zip + 'PHHR71DT/PHHR71FL.DO'
print(dhs_dict_file)

../data/PHHR71DT/PHHR71FL.DO


In [17]:
!ls ../data/PHHR71DT/PHHR71FL.DO


[31m../data/PHHR71DT/PHHR71FL.DO[m[m


## Helper Function

In [18]:
def get_dhs_dict(dhs_dict_file):
    dhs_dict = dict()
    with open(dhs_dict_file, 'r', errors='replace') as file:
        line = file.readline()
        while line:
            line = file.readline()
            if 'label variable' in line:
                code = line.split()[2]
                colname = ' '.join([x.strip('"') for x in line.split()[3:]])
                dhs_dict[code] = colname
    return dhs_dict

## Load DHS Dataset

In [19]:
dhs = pd.read_stata(dhs_file, convert_categoricals=False)
dhs_dict = get_dhs_dict(dhs_dict_file)
dhs = dhs.rename(columns=dhs_dict).dropna(axis=1)
print('Data Dimensions: {}'.format(dhs.shape))

Data Dimensions: (27496, 339)


## Aggregate Columns

In [20]:
data = dhs[[
    'Cluster number',
    'Wealth index factor score combined (5 decimals)',
    'Education completed in single years',
    'Has electricity'
]].groupby('Cluster number').mean()

data['Time to get to water source (minutes)'] = dhs[[
    'Cluster number',
    'Time to get to water source (minutes)'
]].replace(996, 0).groupby('Cluster number').median()

data.columns = [[
    'Wealth Index',
    'Education completed (years)',
    'Access to electricity',
    'Access to water (minutes)'
]]

print('Data Dimensions: {}'.format(data.shape))
data.head(10)

Data Dimensions: (1249, 4)


Unnamed: 0_level_0,Wealth Index,Education completed (years),Access to electricity,Access to water (minutes)
Cluster number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,-31881.608696,9.391304,0.913043,0.0
2,-2855.375,9.708333,0.958333,0.0
3,-57647.047619,8.428571,0.857143,0.0
4,-54952.666667,6.714286,0.809524,0.0
5,-77819.16,8.24,0.92,0.0
6,-80701.695652,8.086957,0.869565,10.0
7,-62490.538462,7.5,0.807692,0.0
8,-80889.666667,4.958333,0.958333,0.0
9,-77994.52,6.64,0.8,0.0
10,-135511.961538,5.153846,0.346154,25.0


## Save Processed DHS File

In [21]:
data.to_csv(data_dir+'dhs_indicators.csv')