# <center>Predictive Modeling of Under-5 Mortality Determinants in Kenya Using KDHS 2022 Data

---

## Business Understanding
Under-5 mortality remains a major public health challenge in Kenya. Although progress has been made, child deaths before the age of five still contribute significantly to preventable mortality. In 2023, sub-Saharan Africa recorded the highest under-5 mortality rate at **74 deaths per 1000 live births** (WHO, 2024).  

The key problem this project seeks to address is:  
**What demographic, socioeconomic, and environmental factors most influence under-5 mortality in Kenya, and can we predict individual or community-level risk using machine learning?**

This topic is highly relevant because reducing under-5 mortality is central to Kenya’s progress toward **Sustainable Development Goal (SDG) 3 – Good Health and Well-being**. Specifically, **SDG target 3.2** aims to end preventable deaths of newborns and children under 5 years by 2030 (WHO, n.d.).  

The project sits at the intersection of **public health and data science**, with a target audience that includes:  
- Policymakers  
- Public health agencies  
- NGOs and implementing partners  
- Academic researchers  

If successful, this work could:  
- Provide actionable insights on high-risk populations  
- Support targeted interventions (e.g., immunization, nutrition, maternal health services)  
- Guide equitable resource allocation  
- Ultimately reduce preventable child deaths  

While previous research has used DHS data with **logistic regression and survival analysis**, this project will extend the literature by applying **predictive supervised machine learning** (e.g., Random Forest, Gradient Boosting), emphasizing both **accuracy** and **interpretability** to inform health policy.

---

## Data Understanding
The dataset will be the **2022 Kenya Demographic and Health Survey (KDHS)**, obtained through The DHS Program upon approval. It is a nationally representative household survey covering fertility, maternal and child health, mortality, socioeconomic conditions, and health service utilization.  

**Target variable:** Under-5 mortality (death of a child before reaching the age of five).  

**Features of interest include:**  
- **Maternal characteristics**: age, education, parity, antenatal care visits  
- **Child characteristics**: sex of child, birth interval, birth order, place of delivery, immunization status  
- **Household characteristics**: wealth index, access to water and sanitation, household size  
- **Geographic/Environmental factors**: region, rural vs. urban residence, environmental exposures  
- **Healthcare utilization**: vaccination coverage, access to healthcare services  

Previous DHS-based studies have primarily been descriptive or inferential. This project will build upon them by creating a **predictive framework** that identifies patterns of risk in under-5 mortality, making results more actionable for public health interventions.

---

## Data Preparation
The DHS datasets are provided in **CSV, Stata and SPSS formats** (`.csv`, `.dta`, `.sav`) and will be imported into **Python**. The data include a mix of **categorical** and **numerical** variables.  

**Preprocessing steps will include:**  
- Handling missing values (e.g., imputation strategies)  
- Recoding categorical variables 
- Deriving new features (e.g., age groups, birth intervals, maternal age at first birth)  
- Balancing outcome classes (since child survival is far more common than child death)  
- Normalizing or standardizing numeric inputs where appropriate  
- Applying DHS sample weights to account for complex survey design  

**Challenges anticipated:**  
- Complex survey design (stratification, clustering, weighting)  
- Class imbalance (rare event prediction problem)  
- Variable coding complexity (DHS uses numeric codes that require careful recoding)  

**Data visualisation plans:**  
- Descriptive statistics (means, medians, proportions by outcome)  
- Mortality rates by region, wealth quintile, maternal education, and other key factors  
- Distribution plots comparing children who survived vs. those who died before age five  
- Correlation heatmaps for potential multicollinearity among predictors  


## 1: Importing Libraries
We begin by importing the essential Python libraries for working with the DHS dataset:

- **pyreadstat** → allows us to read Stata (`.dta`) and SPSS (`.sav`) files provided by the DHS Program.  
- **pandas** → the primary library for data manipulation and analysis.  
- **numpy** → useful for numerical operations and handling arrays.

These libraries form the foundation of our data exploration and preparation workflow.


In [1]:
import pyreadstat
import pandas as pd
import numpy as np

## 2: Loading the Dataset
The 2022 Kenya DHS dataset is provided in SAS format (`.sas7bdat`).  
We will use **pyreadstat** to load it into Python, which returns two objects:  

- **df** → the data frame containing child-level survey records.  
- **meta** → metadata describing variable labels, value labels, and other survey design details.  

Once loaded, we display the first five rows to verify the data structure.

In [2]:
df, meta = pyreadstat.read_sas7bdat('KEKR8CFL.SAS7BDAT')
df.head()

Unnamed: 0,CASEID,BIDX,V000,V001,V002,V003,V004,V005,V006,V007,...,S621B,S626Q,S626R,S626S,S626T,S631A,S631B,S631C,S631L,S631M
0,1 4 2,1.0,KE8,1.0,4.0,2.0,1.0,1296049.0,4.0,2022.0,...,,,,,,,,,,
1,1 13 2,1.0,KE8,1.0,13.0,2.0,1.0,1296049.0,4.0,2022.0,...,,,,,,,,,,
2,1 26 2,1.0,KE8,1.0,26.0,2.0,1.0,1296049.0,4.0,2022.0,...,,,,,,,,,,
3,1 42 1,1.0,KE8,1.0,42.0,1.0,1.0,1296049.0,4.0,2022.0,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1 55 2,1.0,KE8,1.0,55.0,2.0,1.0,1296049.0,4.0,2022.0,...,,,,,,,,,,


The table above displays the first five rows of the dataset.  
Each row corresponds to a child record from the DHS survey, while the columns represent demographic, maternal, household, and health-related variables.  

Key observations at this stage:  
- Variable names are abbreviated (e.g., `v001`, `b5`, etc.), consistent with DHS coding conventions.  
- The metadata (`meta`) will help map these codes to human-readable labels for interpretation.  
- We can already see both categorical (coded numerically) and continuous variables, which will require preprocessing before modeling.  


## 3: Understanding Variable Names and Labels
The DHS dataset uses abbreviated variable names (e.g., `v001`, `b5`, `hw1`) that are not immediately intuitive.  
Fortunately, the metadata object (`meta`) contains descriptive labels for each variable.  

In this step, we print out the **first 1,312 variables and their corresponding labels** to better understand the dataset’s structure and content.  
This will help us identify which variables are relevant for under-5 mortality analysis.


In [3]:
# Variable names + labels
for var, label in zip(df.columns[:1312], meta.column_labels[:1312]):
    print(f"{var}: {label}")


CASEID: Case Identification
BIDX: Birth column number
V000: Country code and phase
V001: Cluster number
V002: Household number
V003: Respondent's line number
V004: Ultimate area unit
V005: Women's individual sample weight (6 decimals)
V006: Month of interview
V007: Year of interview
V008: Date of interview (CMC)
V008A: Date of interview Century Day Code (CDC)
V009: Respondent's month of birth
V010: Respondent's year of birth
V011: Date of birth (CMC)
V012: Respondent's current age
V013: Age in 5-year groups
V014: Completeness of age information
V015: Result of individual interview
V016: Day of interview
V017: CMC start of calendar
V018: Row of month of interview
V019: Length of calendar
V019A: Number of calendar columns
V020: Ever-married sample
V021: Primary sampling unit
V022: Sample strata for sampling errors
V023: Stratification used in sample design
V024: Region
V025: Type of place of residence
V026: NA - De facto place of residence
V027: Number of visits
V028: Interviewer identif

The output above displays the variable codes with their corresponding descriptive labels.  
This mapping is essential for navigating the DHS dataset because the codes alone are often cryptic.  

From the labels, we can already identify some key variables:  
- `b5` → *Child is alive* (critical for constructing the under-5 mortality outcome).  
- `b7` → *Age at death in months* (used to confirm whether death occurred before age 5).  
- `v106` → *Highest educational level* (maternal education).  
- `v190` → *Wealth index* (household socioeconomic status).  
- `v025` → *Type of place of residence* (urban vs. rural).  

These variables will form the basis for feature selection and preprocessing in the next steps.  


## 4: Viewing the MAP File (Data Dictionary)
In addition to the dataset and metadata, DHS also provides a **MAP file** (`.MAP`) that contains the full variable descriptions.  
This file serves as a **data dictionary**, detailing how each variable is coded and labeled.  

Here, we load and display the contents of the MAP file to better understand the structure of the dataset and ensure accurate variable interpretation.

In [4]:
#load, open and view the map file with the full descriptions of the variables
with open('KEKR8CFL.MAP', 'r', encoding="ISO-8859-1") as file:
    sas_code = file.read()
    print(sas_code)

                                                   KEKR8CFL                                                   
                                                KEKR8CFL_DICT                                                 

                                             KEKR8CFL.DCF                                                     
                                    Last Modified:  8/12/2024  1:03:00 PM                                     

--------------------------------------------------------------------------------------------------------------
Level Name                    Level Label                                                  Type            Rec
  Record Name                   Record Label                                              Value  Req  Max  Len
--------------------------------------------------------------------------------------------------------------
HOUSEHOLD                     HOUSEHOLD                                                                       

The output above displays the full contents of the **MAP file**.  
It provides detailed descriptions of the variables, including:  
- Variable codes (e.g., `b5`, `v190`)  
- Full labels and definitions  
- Value labels (e.g., 0 = "No", 1 = "Yes")  
- Information on categorical encodings  

This file acts as a reference guide and ensures we select and recode variables correctly when preparing features and the target variable (under-5 mortality).  


## 5: Creating a Subset of Base Variables
From the DHS dataset, we extract a subset of variables that are most relevant to **under-5 mortality**.  

These variables cover:  
- **Child characteristics** (sex, birth order, birth weight, survival status, age at death, birth intervals)  
- **Maternal characteristics** (age, education, antenatal care, place of delivery)  
- **Household characteristics** (region, residence type, wealth index, cooking fuel, religion, ethnicity, household size)  

This subset allows us to reduce complexity while keeping the most important explanatory factors.


In [5]:
#create a subset of base variables for neonatal mortality
base_vars = ['CASEID','V012','V013','V024','V025','V106','V130','V131','V136','V190','V161','V206','V207',
          'BORD','B4','B5','B7','B11','B12','B20',
          'M13','M14','M15','M18','M19','M2A','M2B','M2G', 'M2H']

'''
'CASEID' = Unique Case Identifier - object
'V012' = Respondent's current age - continuous/float64
'V013' = Age in 5-year groups - categorical - ordinal
'V024' = Region - categorical - nominal
'V025' = Type of place of residence - categorical - nominal
'V106' = Highest educational level - categorical - ordinal
'V130' = Religion - categorical - nominal
'V131' = Ethnicity - categorical - nominal
'V136' = Number of household members (listed) - discrete/int64
'V190' = Wealth index combined - categorical - ordinal
'V161' = Type of cooking fuel (smoke exposure, indoor air pollution) - categorical - nominal
'V206' =  Sons who have died - discrete/int64
'V207' =  Daughters who have died - discrete/int64
'BORD' = Birth order number - discrete/int64
'B4' = Sex of child - categorical - nominal
'B5' = Child is alive - categorical - nominal
'B7' = Age at death (months, imputed) - discrete/int64
'B11' = Preceding birth interval (months) - discrete/int64
'B12' = Succeeding birth interval (months) - discrete/int64
'B20' = Duration of pregnancy in months - discrete/int64
'M13' = Timing of 1st antenatal check (months) - discrete/int64
'M14' = Number of antenatal visits during pregnancy - discrete/int64
'M15' = Place of delivery - categorical - nominal
'M18' = Size of child at birth - categorical - ordinal
'M19' = Birth weight in kilograms (3 decimals) - continuous/float64
'M2A' = Prenatal: doctor - categorical - nominal (binary)
'M2B' = Prenatal: nurse/midwife/clinical officer - categorical - nominal (binary)
'M2G' = Prenatal: traditional birth attendant - categorical - nominal (binary)
'M2H' = Prenatal: Community health worker/field worker - categorical - nominal (binary)
'''

"\n'CASEID' = Unique Case Identifier - object\n'V012' = Respondent's current age - continuous/float64\n'V013' = Age in 5-year groups - categorical - ordinal\n'V024' = Region - categorical - nominal\n'V025' = Type of place of residence - categorical - nominal\n'V106' = Highest educational level - categorical - ordinal\n'V130' = Religion - categorical - nominal\n'V131' = Ethnicity - categorical - nominal\n'V136' = Number of household members (listed) - discrete/int64\n'V190' = Wealth index combined - categorical - ordinal\n'V161' = Type of cooking fuel (smoke exposure, indoor air pollution) - categorical - nominal\n'V206' =  Sons who have died - discrete/int64\n'V207' =  Daughters who have died - discrete/int64\n'BORD' = Birth order number - discrete/int64\n'B4' = Sex of child - categorical - nominal\n'B5' = Child is alive - categorical - nominal\n'B7' = Age at death (months, imputed) - discrete/int64\n'B11' = Preceding birth interval (months) - discrete/int64\n'B12' = Succeeding birth i

We now have a focused set of variables that will form the **analytical dataset**.  

Key outcome-related variables:  
- **B5 (Child is alive):** Indicates whether the child is alive or dead.  
- **B7 (Age at death in months):** For children who died, this specifies when.  

Together, these allow us to construct the **under-5 mortality variable** (death before reaching 60 months of age).  

Other variables serve as **potential risk factors** (maternal education, antenatal visits, household wealth, place of delivery, etc.), helping us explore determinants of child survival.


## 6: Expanding Predictors for Under-5 Mortality
While the base variables capture maternal, child, and household characteristics, under-5 mortality is also strongly influenced by broader determinants.  

We therefore expand our predictor set to include:  
- **Nutrition** (child feeding, breastfeeding, growth indicators)  
- **Household environment (WASH)** (water source, sanitation, cooking fuel, housing materials)  
- **Vaccination and healthcare access** (immunization status, antenatal/postnatal care, skilled attendance, distance to facilities)  
- **Disease exposure** (episodes of diarrhea, fever, cough, A


In [6]:
# Expand predictors/ columns to account for under-5 mortality rates
# Consider Nutrition, Household Environment, Vaccination and Health care access, Disease Exposure

labels = dict(zip(meta.column_names, meta.column_labels))

def find_by_keyword(keywords):
    keywords = [k.lower() for k in keywords]
    matches = {k:v for k,v in labels.items() if any(kw in (v.lower() if isinstance(v,str) else "") for kw in keywords)}
    return pd.Series(matches)

# Keywords to search for 
print("Vaccination-related variables:")
print(find_by_keyword(["vaccin", "immun", "measles", "bcg", "dpt", "polio", "Pneumococcal", "Rota", "pentavalent"]))

print("\nNutrition-related variables:")
print(find_by_keyword(["weight", "height", "zscore", "stunt", "wast", "breast", "breastfeed", "complement"]))

print("\nRecent illness variables:")
print(find_by_keyword(["diarr", "fever", "cough", "ari", "malaria"]))

print("\nWASH / household environment:")
print(find_by_keyword(["water", "toilet", "latrine", "sanitation", "fuel", "floor", "roof", "crowd"]))

print("\nHealthcare access:")
print(find_by_keyword(["tetanus", "postnatal", "post natal", "post-natal", "attend", "skilled", "facility", "distance", "time to"]))



Vaccination-related variables:
V418                          Entries in immunization roster
H1A        Has health card and or other vaccination document
H2                                              Received BCG
H2D                                                  BCG day
H2M                                                BCG month
                                 ...                        
H67M                                NA - Pentavalent 4 month
H67Y                                 NA - Pentavalent 4 year
H69              Place where most vaccinations were received
S1102BD    Services or information provided by community ...
S528A                                   Yellow fever vaccine
Length: 94, dtype: object

Nutrition-related variables:
V005          Women's individual sample weight (6 decimals)
V3A08G                      Reason not using: breastfeeding
V404                                Currently breastfeeding
V407            NA - Number of times breastfed during night
V4

The output lists all variables in the dataset whose labels match the given keywords.  

This helps us:  
- **Screen for relevant predictors** that can explain child survival outcomes.  
- **Group variables into thematic categories** (e.g., nutrition, WASH, vaccination, healthcare).  
- **Decide which variables to include** in the modeling phase for analyzing determinants of under-5 mortality.  

This approach ensures we do not overlook important contextual factors that drive under-5 survival.


## 7: Expanding the Variable Subset for Under-5 Mortality Analysis  

In addition to the `base_vars` defined earlier, we now create a complementary subset `extra_vars`.  
This subset captures predictors beyond maternal/household characteristics, structured around five thematic domains:  

1. **Vaccination variables** – immunization coverage (BCG, DPT, polio, measles, yellow fever, IPV).  
2. **Nutrition variables** – breastfeeding, birthweight, child anthropometry, feeding practices.


In [7]:
# create another subset of base variables for child mortality
extra_vars = [
    #Vaccination Variables
    'H1A', 'H2', 'H4',
    'H6', 'H8', 'H0',
    'H9', 'H9A', 'H51', 'H52', 'H53', 'H54','H55', 'H56', 'H57','H60',
    'H58','H10', 'H69', 'S528A',

    #Nutrition Variables
    'V404', 'M4', 'M5',
    'M19A',
    'M34', 'M55', 'HW1',
    'HW2', 'HW3',
    'HW70', 'HW71', 'HW72',

    #Illness / Exposure Variables
    'M49A', 'S621A',

    #WASH / Household Environment Variables
    'V113',
    'V127', 'V465',

    #Healthcare Access Variables
    'V394', 'V417',
    'V467D', 'V483A', 'V483B',
    'M1', 'M1A',
    'M3A', 'M3B', 'M3G', 'M3H',
    'S446'
]

''' 
#Vaccination Variables
"H1A" = Has health card and or other vaccination document - categorical - nominal
"H2" = Received BCG - categorical - nominal
"H4" = Received POLIO 1 #alive attenuated vaccine - categorical - nominal
"H6" = Received POLIO 2 - categorical - nominal
"H8" = Received POLIO 3 - categorical - nominal
"H0" = Received POLIO 0 - categorical - nominal
"H9" = Received MEASLES 1 - categorical - nominal
"H9A" = Received MEASLES 2 - categorical - nominal
"H60" = Received inactivated polio (IPV) - categorical - nominal
"H10" = Ever had vaccination - categorical - nominal (binary)
"H51" = Received Pentavalent 1 - categorical - nominal
"H52" = Received Pentavalent 2 - categorical - nominal
"H53" = Received Pentavalent 3 - categorical - nominal
"H54" = Received Pneumococcal 1 - categorical - nominal
"H55" = Received Pneumococcal 2 - categorical - nominal
"H56" = Received Pneumococcal 3 - categorical - nominal
"H57" = Received Rotavirus 1 - categorical - nominal
"H58" = Received Rotavirus 2 - categorical - nominal
"H69" = Place where most vaccinations were received - categorical - nominal
"S528A" = Yellow fever vaccine - categorical - nominal (binary)

#Nutrition Variables
"V404" = Currently breastfeeding - categorical - nominal (binary)
"M4" = Duration of breastfeeding - categorical (mixed numeric + special codes): define a clean recoding scheme later for M4 (e.g., splitting into M4_duration and M4_status)
"M5" = Months of breastfeeding - categorical (mixed numeric + special codes): requires recoding like M4 above
"M19A" = Weight at birth/recall - categorical - ordinal
"M34" = When child put to breast - categorical (mixed numeric + special codes)
    - recode into a clean numeric variable:

    - Create a single variable for time in hours (convert days → hours).

    - Preserve special codes (199, 299, 999) as separate flags.

    - added twist of units (hours vs days) that we’d need to normalize before analysis

    - note that 100 (“within 1 hour”) is a special code, not a duration per se. 
    
    - When recoding to hours, decide whether to treat it as 0.5 hr, 1 hr, or keep separate.

"M55" = Given child anything other than breast milk - categorical - nominal (binary)
"HW1" = Childs age in months
"HW2" = Child's weight in kilograms (1 decimal) - continuous/float64
"HW3" = Child's height in centimeters (1 decimal) - continuous/float64
"HW70" = Height/Age standard deviation (new WHO) - mixed numeric (continuous, scaled by 100) + special codes (9996–9999)
"HW71" = Weight/Age standard deviation (new WHO) - mixed numeric (continuous, scaled by 100) + special codes (9996–9999)
"HW72" = Weight/Height standard deviation (new WHO) - mixed numeric (continuous, scaled by 100) + special codes (9996–9999)

#Illness / Exposure Variables
"M49A" = During pregnancy took: SP/fansidar for malaria - categorical - nominal (binary)
"S621A" = In contact with someone with cough or TB - categorical - nominal (binary)

#WASH / Household Environment Variables
"V113" = Source of drinking water - categorical - nominal
"V127" = Main floor material - categorical - nominal
"V465" = Disposal of youngest child's stools when not u - categorical - nominal

#Healthcare Access Variables
"V394" = Visited health facility last 12 months - categorical - nominal (binary)
"V417" = Entries in pregnancy and postnatal care roster - discrete/int64
"V467D" = Getting medical help for self: distance to health facility - categorical - nominal
"V483A" = Minutes to nearest healthcare facility - discrete/int64
"V483B" = Mode of transportation to nearest healthcare facility - categorical - nominal
"M1" = Number of tetanus injections before birth - discrete/int64
"M1A" = Number of tetanus injections before pregnancy - discrete/int64
"M3A" = Assistance: doctor - categorical - nominal (binary)
"M3B" = Assistance: nurse/midwife/clinical officer - categorical - nominal (binary)
"M3G" = Assistance: traditional birth attendant - categorical - nominal (binary)
"M3H" = Assistance: Relative/friend - categorical - nominal (binary)
"S446" = Respondent treated with respect at facility - categorical - nominal 
'''


' \n#Vaccination Variables\n"H1A" = Has health card and or other vaccination document - categorical - nominal\n"H2" = Received BCG - categorical - nominal\n"H4" = Received POLIO 1 #alive attenuated vaccine - categorical - nominal\n"H6" = Received POLIO 2 - categorical - nominal\n"H8" = Received POLIO 3 - categorical - nominal\n"H0" = Received POLIO 0 - categorical - nominal\n"H9" = Received MEASLES 1 - categorical - nominal\n"H9A" = Received MEASLES 2 - categorical - nominal\n"H60" = Received inactivated polio (IPV) - categorical - nominal\n"H10" = Ever had vaccination - categorical - nominal (binary)\n"H51" = Received Pentavalent 1 - categorical - nominal\n"H52" = Received Pentavalent 2 - categorical - nominal\n"H53" = Received Pentavalent 3 - categorical - nominal\n"H54" = Received Pneumococcal 1 - categorical - nominal\n"H55" = Received Pneumococcal 2 - categorical - nominal\n"H56" = Received Pneumococcal 3 - categorical - nominal\n"H57" = Received Rotavirus 1 - categorical - nomina

The `extra_vars` list extends the scope of predictors by capturing broader determinants of child survival.  

Together with the `base_vars`, this will allow us to:  
- Construct a **comprehensive analytical dataset** that blends maternal, child, household, and contextual factors.  
- Explicitly test the impact of immunization, nutrition, WASH, and healthcare access on under-5 mortality.  
- Design models that can inform **policy-relevant interventions** by highlighting modifiable determinants.  

Next, we will merge `base_vars` and `extra_vars` into a unified list of predictors, and define the **target variable** for under-5 mortality (based on `B5` = alive/dead and `B7` = age at death in months).


## 8: Creating a Unified DataFrame for Under-5 Mortality Analysis  

Now that we have identified both the **base variables** (directly tied to child survival) and the **extra explanatory variables** (covering nutrition, WASH, healthcare access, vaccination, and illness exposure),  
we create a unified working dataset (`df_subset`).  

This dataset will serve as the foundation for descriptive exploration, cleaning, and modeling, ensuring we capture both direct and indirect determinants of **under-5 mortality**.  


In [8]:
#create a unified dataframe
df_subset = df[base_vars + extra_vars].copy()

In [9]:
df_subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19530 entries, 0 to 19529
Data columns (total 78 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   CASEID  19530 non-null  object 
 1   V012    19530 non-null  float64
 2   V013    19530 non-null  float64
 3   V024    19530 non-null  float64
 4   V025    19530 non-null  float64
 5   V106    19530 non-null  float64
 6   V130    19530 non-null  float64
 7   V131    19530 non-null  float64
 8   V136    19530 non-null  float64
 9   V190    19530 non-null  float64
 10  V161    19530 non-null  float64
 11  V206    19530 non-null  float64
 12  V207    19530 non-null  float64
 13  BORD    19530 non-null  float64
 14  B4      19530 non-null  float64
 15  B5      19530 non-null  float64
 16  B7      694 non-null    float64
 17  B11     14305 non-null  float64
 18  B12     4850 non-null   float64
 19  B20     19530 non-null  float64
 20  M13     10035 non-null  float64
 21  M14     10412 non-null  float64
 22

The `.info()` output confirms the structure of our subsetted DataFrame.  
It shows:  
- **Total number of rows**: corresponding to the number of observations (children).  
- **Total number of columns**: equal to `base_vars + extra_vars`.  
- **Data types**: mostly integers and categorical codes, with some floats (e.g., anthropometric z-scores).  
- **Missing values**: marked as `Non-Null count < Total rows`, which will guide our data cleaning strategy.  

This validates that our merged dataset is ready for recoding, handling special codes, and analysis.  


## 9: Renaming Columns for Readability  

To make the dataset more interpretable, we rename the variables using their **DHS labels** rather than cryptic codes (e.g., `H1A → Has health card`).  

This ensures that subsequent analysis is easier to follow and aligns variables directly with **under-5 mortality determinants** such as:  
- **Vaccination coverage** (e.g., measles, polio, BCG).  
- **Nutrition and feeding practices** (e.g., breastfeeding, anthropometric indices).  
- **Household environment (WASH)** (e.g., drinking water, sanitation).  
- **Healthcare access** (e.g., antenatal visits, tetanus injections, facility access).  
- **Child survival outcomes** (e.g., age at death, alive status).  


In [10]:
#rename the columns for easier understanding
df_subset = df_subset.rename(columns = {
    'CASEID': 'CASEID',
    'V012': 'Respondents current age',
    'V013': 'Age in 5-year groups',
    'V024': 'Region',
    'V025': 'Type of place of residence',
    'V106': 'Highest educational level',
    'V115': 'Time to get to water source',
    'V130': 'Religion',
    'V131': 'Ethnicity',
    'V136': 'Number of household members (listed)',
    'V190': 'Wealth index combined',
    'V161': 'Type of cooking fuel (smoke exposure, indoor air pollution)',
    'V206': 'Sons who have died',
    'V207': 'Daughters who have died',
    'BORD': 'Birth order number',
    'B4': 'Sex of child',
    'B5': 'Child is alive',
    'B7': 'Age at death (months, imputed)', 
    'B11': 'Preceding birth interval (months)',
    'B12': 'Succeeding birth interval (months)',
    'B20': 'Duration of pregnancy in months',
    'M13': 'Timing of 1st antenatal check (months)',
    'M14': 'Number of antenatal visits during pregnancy',
    'M15': 'Place of delivery',
    'M18': 'Size of child at birth',
    'M19': 'Birth weight in kilograms (3 decimals)',
    'M2A':  'Prenatal: doctor',
    'M2B':  'Prenatal: nurse/midwife/clinical officer',
    'M2G': 'Prenatal: traditional birth attendant',
    'M2H': 'Prenatal: Community health worker/field worker',
    'H1A': 'Has health card and or other vaccination document',
    'H2': 'Received BCG',
    'H4': 'Received POLIO 1',
    'H6': 'Received POLIO 2',
    'H8': 'Received POLIO 3',
    'H0': 'Received POLIO 0',
    'H9': 'Received MEASLES 1',
    'H9A': 'Received MEASLES 2',
    'H60': 'Received inactivated polio (IPV)',
    'H10': 'Ever had vaccination',
    'H51': 'Received Pentavalent 1',
    'H52': 'Received Pentavalent 2',
    'H53': 'Received Pentavalent 3',
    'H54': 'Received Pneumococcal 1',
    'H55': 'Received Pneumococcal 2',
    'H56': 'Received Pneumococcal 3',
    'H57': 'Received Rotavirus 1',
    'H58': 'Received Rotavirus 2',
    'H69': 'Place where most vaccinations were received',
    'S528A': 'Yellow fever vaccine',
    'V404': 'Currently breastfeeding',
    'M4': 'Duration of breastfeeding',
    'M5': 'Months of breastfeeding',
    'M19A': 'Weight at birth/recall',
    'M34': 'When child put to breast',
    'M55': 'Given child anything other than breast milk',
    'HW1': 'Childs age in months',
    'HW2': 'Childs weight in kilograms (1 decimal)',
    'HW3': 'Childs height in centimeters (1 decimal)',
    'HW70': 'Height/Age standard deviation (new WHO)',
    'HW71': 'Weight/Age standard deviation (new WHO)',
    'HW72': 'Weight/Height standard deviation (new WHO)',
    'M49A': 'During pregnancy took: SP/fansidar for malaria',
    'S621A': 'In contact with someone with cough or TB',
    'V113': 'Source of drinking water',
    'V127': 'Main floor material',
    'V465': 'Disposal of youngest childs stools when not u',
    'V394': 'Visited health facility last 12 months',
    'V417': 'Entries in pregnancy and postnatal care roster',
    'V467D': 'Getting medical help for self: distance to health facility',
    'V483A': 'Minutes to nearest healthcare facility',
    'V483B': 'Mode of transportation to nearest healthcare facility',
    'M1': 'Number of tetanus injections before birth',
    'M1A': 'Number of tetanus injections before pregnancy',
    'M3A': 'Assistance: doctor',
    'M3B': 'Assistance: nurse/midwife/clinical officer',
    'M3H': 'Assistance: Relative/friend',
    'M3G': 'Assistance: traditional birth attendant',
    'S446': 'Respondent treated with respect at facility'
})

In [11]:
#confirm renaming
df_subset.head()

Unnamed: 0,CASEID,Respondents current age,Age in 5-year groups,Region,Type of place of residence,Highest educational level,Religion,Ethnicity,Number of household members (listed),Wealth index combined,...,Getting medical help for self: distance to health facility,Minutes to nearest healthcare facility,Mode of transportation to nearest healthcare facility,Number of tetanus injections before birth,Number of tetanus injections before pregnancy,Assistance: doctor,Assistance: nurse/midwife/clinical officer,Assistance: traditional birth attendant,Assistance: Relative/friend,Respondent treated with respect at facility
0,1 4 2,34.0,4.0,1.0,1.0,0.0,7.0,11.0,6.0,4.0,...,1.0,30.0,12.0,1.0,0.0,1.0,0.0,0.0,0.0,2.0
1,1 13 2,39.0,5.0,1.0,1.0,2.0,1.0,3.0,8.0,5.0,...,2.0,5.0,24.0,,,,,,,
2,1 26 2,28.0,3.0,1.0,1.0,2.0,3.0,3.0,5.0,4.0,...,2.0,5.0,24.0,,,,,,,
3,1 42 1,30.0,4.0,1.0,1.0,2.0,4.0,3.0,3.0,5.0,...,,,,,,,,,,
4,1 55 2,34.0,4.0,1.0,1.0,2.0,2.0,3.0,4.0,5.0,...,,,,1.0,2.0,1.0,1.0,0.0,0.0,


The columns have now been renamed with descriptive labels.  

- This makes the dataset **human-readable** and ready for recoding.  
- For example:  
  - `H9` → **Received MEASLES 1**  
  - `V404` → **Currently breastfeeding**  
  - `V113` → **Source of drinking water**  
- Key outcome variables like **Child is alive** and **Age at death (months, imputed)** are now easy to identify.  

This step improves transparency and minimizes errors during cleaning, exploratory analysis, and modeling.  


## 10: Data Quality Check — Missingness and Problematic Columns  

Before running analyses, it’s important to evaluate the dataset for missingness and structural issues.  
We will write a helper function to:  
- Calculate the percentage of missing values per variable.  
- Flag variables with **100% missing** or **high missingness (>50%)**.  
- Flag **constant columns** (no variation).  
- Optionally highlight **key outcome variables** if they have missing values.  
- Identify **mixed data types**, which may indicate data corruption or coding inconsistencies.  


In [12]:
#write a function to check missingness and flag problematic columns
def validate_columns(df):
    total_rows = len(df)
    report = (
        df.isnull()
        .sum()
        .reset_index()
        .rename(columns={"index": "variable", 0: "missing_count"})
    )
    report["pct_missing"] = (report["missing_count"] / total_rows) * 100

    # flag empty or mostly empty columns
    report["flag"] = report["pct_missing"].apply(
        lambda x: "100% missing" if x == 100 else ("High missingness" if x > 50 else "")
    )

    return report.sort_values(by="pct_missing", ascending=False).reset_index(drop=True)


In [13]:
#Validate - apply the function
report = validate_columns(df_subset)
print(report.head(50))  # see the top 50

                                             variable  missing_count  \
0                      Age at death (months, imputed)          18836   
1                                Yellow fever vaccine          16686   
2                                Ever had vaccination          16176   
3         Respondent treated with respect at facility          15125   
4                  Succeeding birth interval (months)          14680   
5         Given child anything other than breast milk          14211   
6      During pregnancy took: SP/fansidar for malaria          14070   
7                            When child put to breast          13571   
8       Disposal of youngest childs stools when not u          13544   
9                              Weight at birth/recall          13382   
10                             Size of child at birth          13382   
11             Birth weight in kilograms (3 decimals)          13382   
12      Number of tetanus injections before pregnancy          1

The output table shows the **first 50 variables with the most missingness**.  

Pay attention to the `flag` column:
- `100% missing`: variable is completely empty and can be safely dropped.  
- `High missingness`: more than half the data is missing — decide whether to impute, recode, or drop.  
- `Constant column`: offers no analytical value, drop directly.  
- `Mixed data types`: requires inspection to harmonize (e.g., string vs numeric coding).  

Next step: we will use this report to **filter, recode, and clean** the dataset before any modeling.  


## 11: Drop Columns with Excessive Missingness

From the validation report, we identified several variables that are either **completely empty**, have **extremely high missingness**, or are otherwise unsuitable for analysis in their raw form.  
We define a helper function `safe_drop()` that ensures:
- It drops only existing columns (ignores typos or non-existent names).  
- It handles duplicates in the list automatically.  

We then apply it to drop the problematic columns identified in our review.  


In [14]:
#define function for dropping columns with too many missing values
def safe_drop(df, cols_to_drop):
    """
    Drop columns from a DataFrame safely.
    - Ignores duplicates in the list.
    - Ignores columns that don't exist in df.
    """
    # Ensure unique names
    cols_to_drop = list(set(cols_to_drop))
    
    # Keep only those actually in the dataframe
    cols_to_drop = [col for col in cols_to_drop if col in df.columns]
    
    # Drop them
    return df.drop(columns=cols_to_drop)

# Usage:
cols = [
    'Respondent treated with respect at facility',
    'Disposal of youngest childs stools when not u',
    'During pregnancy took: SP/fansidar for malaria',
    'Weight at birth/recall',
    'Duration of breastfeeding',
    'Respondents current age'
]

df = safe_drop(df_subset, cols)


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19530 entries, 0 to 19529
Data columns (total 72 columns):
 #   Column                                                       Non-Null Count  Dtype  
---  ------                                                       --------------  -----  
 0   CASEID                                                       19530 non-null  object 
 1   Age in 5-year groups                                         19530 non-null  float64
 2   Region                                                       19530 non-null  float64
 3   Type of place of residence                                   19530 non-null  float64
 4   Highest educational level                                    19530 non-null  float64
 5   Religion                                                     19530 non-null  float64
 6   Ethnicity                                                    19530 non-null  float64
 7   Number of household members (listed)                         19530 non-null 

## DataFrame Structure Summary

The dataset contains **19,530 observations (rows)** and **61 variables (columns)**.  

- **Data Types:**  
  - `object` → 1 column (`CASEID`, the respondent identifier).  
  - `float64` → 60 columns (all numeric variables, though many represent categorical codes).  

- **Coverage of Variables:**  
  - **Complete variables (no missing values):** respondent demographics (age, region, residence, education, religion, ethnicity), household indicators (cooking fuel, floor material, drinking water), and identifiers (CASEID).  
  - **Moderate coverage (50–70% non-null):** antenatal care details, place of delivery, assistance at birth, and breastfeeding practices.  
  - **Low coverage (<20% non-null):** *Age at death (months, imputed)*, *Ever had vaccination*, *Yellow fever vaccine*, and *Succeeding birth interval*.  

- **Domain Coverage:**  
  1. **Demographics & Household** – age, residence type, wealth index, cooking fuel, floor material.  
  2. **Child Survival** – sex of child, alive/deceased status, age at death, number of deceased children.  
  3. **Maternal & Antenatal Care** – antenatal visits, tetanus injections, delivery place, birth attendants.  
  4. **Child Health & Nutrition** – breastfeeding duration, timing of first feed, anthropometry (weight, height, WHO Z-scores).  
  5. **Immunization** – BCG, DPT, Polio, Measles, IPV, Yellow Fever.  
  6. **Healthcare Access** – distance to facility, transport, facility visits.  
  7. **Disease Exposure** – contact with cough/TB cases.  

### Key Takeaway
The dataset is rich and multidimensional but exhibits **variable missingness across domains**.  
- **Core demographic and household data** are well captured.  
- **Mortality, vaccination, and nutrition variables** have patchy coverage and will require careful treatment (e.g., imputation, selective dropping) before statistical modeling on under-5 mortality can proceed.


## 12: Exporting Cleaned Subset  

After cleaning and reducing the dataset to the most relevant variables for under-5 mortality analysis,  
we now save the working subset to a CSV file. This ensures reproducibility, easier sharing,  
and compatibility with other analytical tools beyond Python.  


In [16]:
#Save to CSV
df.to_csv("u5mr_subset.csv", index=False)

print(f"Saved essential subset with {df.shape[1]} variables and {df.shape[0]} rows.")


Saved essential subset with 72 variables and 19530 rows.


## Data Understanding

In [17]:
df = pd.read_csv('u5mr_subset.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19530 entries, 0 to 19529
Data columns (total 72 columns):
 #   Column                                                       Non-Null Count  Dtype  
---  ------                                                       --------------  -----  
 0   CASEID                                                       19530 non-null  object 
 1   Age in 5-year groups                                         19530 non-null  float64
 2   Region                                                       19530 non-null  float64
 3   Type of place of residence                                   19530 non-null  float64
 4   Highest educational level                                    19530 non-null  float64
 5   Religion                                                     19530 non-null  float64
 6   Ethnicity                                                    19530 non-null  float64
 7   Number of household members (listed)                         19530 non-null 

## Dataset Structure  

The cleaned dataset contains **19,530 records** (rows) and **61 variables** (columns).  
Each row represents an individual child or household record from the DHS survey, while columns capture demographic, health, and environmental characteristics.  

### Key Observations:  
- **Identifiers:** `CASEID` uniquely identifies each record.  
- **Demographics:** Respondent’s age, region, residence type, education, religion, and ethnicity are fully recorded (no missing values).  
- **Socioeconomic Factors:** Wealth index, household size, and cooking fuel type are available for all cases.  
- **Mortality Information:**  
  - `Child is alive` is complete for all records, allowing reliable categorization of survival status.  
  - `Age at death (months, imputed)` is available for **694 children**, representing deceased cases.  
  - Birth intervals (preceding/succeeding) are variably complete, with preceding interval data more widely available.  
- **Maternal and Child Health:** Variables on antenatal visits, place of delivery, breastfeeding, child’s weight/height, and immunization show **moderate levels of missingness**, which will need careful handling.  
- **Vaccination Data:** Coverage is strong for BCG, DPT3, Polio3, and Measles2 (over 11,000 records each), but weaker for Yellow fever and secondary vaccinations.  
- **Environmental Factors:** Source of drinking water and flooring material are complete across all records.  
- **Healthcare Access:** Distance to facility, transport, and recent facility visits are available for about half of the cases.  



In [18]:
# Shape of the dataset
print("Shape of dataset:", df.shape)

# Quick look at the first few rows
display(df.head())

# Summary info
df.info()

# Check missing values
missing = df.isnull().sum().sort_values(ascending=False)
print(missing.head(15))  # top 15 variables with missing data

# Check duplicates
print("Number of duplicate rows:", df.duplicated().sum())

# Check basic stats for numeric columns
display(df.describe())


Shape of dataset: (19530, 72)


Unnamed: 0,CASEID,Age in 5-year groups,Region,Type of place of residence,Highest educational level,Religion,Ethnicity,Number of household members (listed),Wealth index combined,"Type of cooking fuel (smoke exposure, indoor air pollution)",...,Entries in pregnancy and postnatal care roster,Getting medical help for self: distance to health facility,Minutes to nearest healthcare facility,Mode of transportation to nearest healthcare facility,Number of tetanus injections before birth,Number of tetanus injections before pregnancy,Assistance: doctor,Assistance: nurse/midwife/clinical officer,Assistance: traditional birth attendant,Assistance: Relative/friend
0,1 4 2,4.0,1.0,1.0,0.0,7.0,11.0,6.0,4.0,2.0,...,1.0,1.0,30.0,12.0,1.0,0.0,1.0,0.0,0.0,0.0
1,1 13 2,5.0,1.0,1.0,2.0,1.0,3.0,8.0,5.0,2.0,...,0.0,2.0,5.0,24.0,,,,,,
2,1 26 2,3.0,1.0,1.0,2.0,3.0,3.0,5.0,4.0,2.0,...,0.0,2.0,5.0,24.0,,,,,,
3,1 42 1,4.0,1.0,1.0,2.0,4.0,3.0,3.0,5.0,2.0,...,0.0,,,,,,,,,
4,1 55 2,4.0,1.0,1.0,2.0,2.0,3.0,4.0,5.0,2.0,...,2.0,,,,1.0,2.0,1.0,1.0,0.0,0.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19530 entries, 0 to 19529
Data columns (total 72 columns):
 #   Column                                                       Non-Null Count  Dtype  
---  ------                                                       --------------  -----  
 0   CASEID                                                       19530 non-null  object 
 1   Age in 5-year groups                                         19530 non-null  float64
 2   Region                                                       19530 non-null  float64
 3   Type of place of residence                                   19530 non-null  float64
 4   Highest educational level                                    19530 non-null  float64
 5   Religion                                                     19530 non-null  float64
 6   Ethnicity                                                    19530 non-null  float64
 7   Number of household members (listed)                         19530 non-null 

Unnamed: 0,Age in 5-year groups,Region,Type of place of residence,Highest educational level,Religion,Ethnicity,Number of household members (listed),Wealth index combined,"Type of cooking fuel (smoke exposure, indoor air pollution)",Sons who have died,...,Entries in pregnancy and postnatal care roster,Getting medical help for self: distance to health facility,Minutes to nearest healthcare facility,Mode of transportation to nearest healthcare facility,Number of tetanus injections before birth,Number of tetanus injections before pregnancy,Assistance: doctor,Assistance: nurse/midwife/clinical officer,Assistance: traditional birth attendant,Assistance: Relative/friend
count,19530.0,19530.0,19530.0,19530.0,19530.0,19530.0,19530.0,19530.0,19530.0,19530.0,...,19530.0,10267.0,10267.0,10267.0,10412.0,6805.0,11728.0,11728.0,11728.0,11728.0
mean,3.436252,23.497542,1.657655,1.324322,5.404813,20.560471,5.869688,2.637481,9.443625,0.100256,...,1.000614,1.682283,43.744619,20.783286,1.268248,1.772814,0.415331,0.585351,0.119628,0.060198
std,1.348587,13.707384,0.474507,0.972344,13.509171,33.306912,2.58898,1.449923,15.195492,0.34539,...,0.661517,0.465611,57.925473,5.195579,0.919682,1.859749,0.4928,0.492682,0.32454,0.237863
min,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,1.0,0.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,10.0,1.0,1.0,2.0,4.0,4.0,1.0,7.0,0.0,...,1.0,1.0,15.0,13.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,3.0,24.0,2.0,1.0,3.0,7.0,5.0,3.0,8.0,0.0,...,1.0,2.0,30.0,24.0,1.0,1.0,0.0,1.0,0.0,0.0
75%,4.0,35.0,2.0,2.0,7.0,11.0,7.0,4.0,8.0,0.0,...,1.0,2.0,60.0,24.0,2.0,3.0,1.0,1.0,0.0,0.0
max,7.0,47.0,2.0,3.0,96.0,96.0,24.0,5.0,97.0,5.0,...,5.0,2.0,600.0,96.0,8.0,8.0,1.0,1.0,1.0,1.0


## Data Cleaning
### 1. Data Types Handling

### Variable Type Assignment  

At this stage, we need to ensure each variable in the dataset is represented with the most appropriate data type.  
- **Identifiers** such as `CASEID` are stored as `object`.  
- **Continuous variables** (e.g., weight, height, z-scores) are explicitly coerced to `float64`.  
- **Discrete numeric variables** (e.g., household size, birth intervals, antenatal visits) are converted into nullable `Int64` for safe handling of missing values.  
- **Categorical variables** (nominal or ordinal, such as region, education level, sex of child, or vaccination status) are assigned the `category` type to optimize memory use and enable proper statistical summaries.  

This step is critical because accurate data types improve downstream analysis, descriptive statistics, and modeling.  


In [19]:
# Make a safe working copy
df = df.copy()

# Object ID
df['CASEID'] = df['CASEID'].astype('object')

# Continuous floats
float_vars = [
    'Age in 5-year groups',
    'Birth weight in kilograms (3 decimals)',
    'Childs weight in kilograms (1 decimal)',
    'Childs height in centimeters (1 decimal)',
    'Months of breastfeeding',
    'When child put to breast',
    'Height/Age standard deviation (new WHO)',
    'Weight/Age standard deviation (new WHO)',
    'Weight/Height standard deviation (new WHO)'
]
df[float_vars] = df[float_vars].apply(pd.to_numeric, errors="coerce").astype('float64')

# Discrete ints (nullable safe type)
int_vars = [
    'Number of household members (listed)',
    'Birth order number',
    'Age at death (months, imputed)',
    'Sons who have died',
    'Daughters who have died',
    'Preceding birth interval (months)',
    'Succeeding birth interval (months)',
    'Duration of pregnancy in months',
    'Timing of 1st antenatal check (months)',
    'Number of antenatal visits during pregnancy',
    'Entries in pregnancy and postnatal care roster',
    'Minutes to nearest healthcare facility',
    'Number of tetanus injections before birth',
    'Number of tetanus injections before pregnancy'
]
df[int_vars] = df[int_vars].apply(pd.to_numeric, errors="coerce").astype('Int64')

# Categorical (nominal/ordinal)
cat_vars = [
    'Region',
    'Type of place of residence',
    'Highest educational level',
    'Religion',
    'Ethnicity',
    'Wealth index combined',
    'Type of cooking fuel (smoke exposure, indoor air pollution)',
    'Sex of child',
    'Place of delivery',
    'Size of child at birth',
    'Prenatal: doctor',
    'Prenatal: nurse/midwife/clinical officer',
    'Prenatal: traditional birth attendant',
    'Prenatal: Community health worker/field worker',
    'Has health card and or other vaccination document',
    'Received BCG',
    'Received POLIO 0',
    'Received POLIO 1',
    'Received POLIO 2',
    'Received POLIO 3',
    'Received MEASLES 1',
    'Received MEASLES 2',
    'Received inactivated polio (IPV)',
    'Received Pentavalent 1',
    'Received Pentavalent 2',
    'Received Pentavalent 3',
    'Received Pneumococcal 1',
    'Received Pneumococcal 2',
    'Received Pneumococcal 3',
    'Received Rotavirus 1',
    'Received Rotavirus 2',
    'Place where most vaccinations were received',
    'Yellow fever vaccine',
    'Currently breastfeeding',
    'Given child anything other than breast milk',
    'In contact with someone with cough or TB',
    'Source of drinking water',
    'Main floor material',
    'Visited health facility last 12 months',
    'Getting medical help for self: distance to health facility',
    'Mode of transportation to nearest healthcare facility',
    'Assistance: doctor',
    'Assistance: nurse/midwife/clinical officer',
    'Assistance: Relative/friend',
    'Assistance: traditional birth attendant'
]
df[cat_vars] = df[cat_vars].astype('category')



The dataset now has a **clearer structure**, with variables consistently assigned to:  
- **Identifiers (`object`)**  
- **Continuous measures (`float64`)**  
- **Discrete counts (`Int64`)**  
- **Categorical descriptors (`category`)**  

This refined typing lays the groundwork for **exploratory data analysis (EDA)**, ensuring that numerical summaries, group comparisons, and visualizations will be reliable and meaningful.  


### 2. Handling special code columns(mixed numeric + special codes)

In [20]:
## when child is put to breast
def clean_breastfeeding_time(x):
    if pd.isna(x): 
        return np.nan
    if x == 0:        # immediately
        return 0
    if x == 100:      # within first hour
        return 0.5    # or keep as 0 to mean <1 hour
    if 101 <= x <= 198:  # hours
        return x - 100
    if x == 199:
        return np.nan
    if x == 201:      # days:1
        return 24
    if 202 <= x <= 298:
        return (x - 200) * 24
    if x == 299:
        return np.nan

df['When child put to breast'] = df['When child put to breast'].apply(clean_breastfeeding_time)


The variable **“When child put to breast”** has been cleaned to standardize its values:  
- `0` = immediately after birth  
- `0.5` = within the first hour  
- `1–98` = hours after birth  
- `24, 48, …` = converted days (multiplied by 24)  
- Special codes (`199, 299`) → set to `NaN`  

This ensures the variable is now measured consistently in **hours** and ready for descriptive statistics or modeling.  


In [21]:
def clean_months_breastfeeding(x):
    if pd.isna(x):
        return np.nan
    if 0 <= x <= 59:       # actual months
        return x
    if x == 93:            # ever breastfed, not currently
        return np.nan      # or 60 if you want to flag “>59”
    if x == 94:            # never breastfed
        return 0           # or np.nan if you prefer to handle separately
    return np.nan          # catch any unexpected value

df['Months of breastfeeding'] = df['Months of breastfeeding'].apply(clean_months_breastfeeding)


For **“Months of breastfeeding”**:  
- Valid responses between `0–59` are retained as months.  
- `93` (ever breastfed, not currently) → recoded as `NaN`  
- `94` (never breastfed) → recoded as `0`  
- Any unexpected codes → `NaN`  

This provides a continuous measure of breastfeeding duration, while appropriately handling outliers and special DHS codes.  


In [22]:
def clean_height_age_z(x):
    if pd.isna(x):
        return np.nan
    if -600 <= x <= 600:        # valid range, stored x100
        return x / 100.0
    return np.nan               # any unexpected code

df['Height/Age standard deviation (new WHO)'] = df['Height/Age standard deviation (new WHO)'].apply(clean_height_age_z)


In [23]:
def clean_who_z(x, lower=-600, upper=500):
    """Convert WHO z-score×100 to float and replace special codes with NaN."""
    if pd.isna(x):
        return np.nan
    if lower <= x <= upper:      # valid numeric value
        return x / 100.0
    return np.nan               # any other unexpected code


In [24]:
df['Weight/Age standard deviation (new WHO)'] = df['Weight/Age standard deviation (new WHO)'].apply(clean_who_z)
df['Weight/Height standard deviation (new WHO)'] = df['Weight/Height standard deviation (new WHO)'].apply(clean_who_z)


The WHO anthropometric z-scores were rescaled and cleaned:  
- **Height-for-Age Z (HAZ):** Original values multiplied by 100, now divided back to the standard scale.  
- **Weight-for-Age Z (WAZ) & Weight-for-Height Z (WHZ):** Similarly rescaled from ×100 and cleaned.  
- Out-of-range or special codes are replaced with `NaN`.  

With these corrections, the z-scores are now expressed in their conventional format (e.g., `-2.0` = 2 SD below the median) and can be directly compared against WHO reference standards.  


In [25]:
df.head()

Unnamed: 0,CASEID,Age in 5-year groups,Region,Type of place of residence,Highest educational level,Religion,Ethnicity,Number of household members (listed),Wealth index combined,"Type of cooking fuel (smoke exposure, indoor air pollution)",...,Entries in pregnancy and postnatal care roster,Getting medical help for self: distance to health facility,Minutes to nearest healthcare facility,Mode of transportation to nearest healthcare facility,Number of tetanus injections before birth,Number of tetanus injections before pregnancy,Assistance: doctor,Assistance: nurse/midwife/clinical officer,Assistance: traditional birth attendant,Assistance: Relative/friend
0,1 4 2,4.0,1.0,1.0,0.0,7.0,11.0,6,4.0,2.0,...,1,1.0,30.0,12.0,1.0,0.0,1.0,0.0,0.0,0.0
1,1 13 2,5.0,1.0,1.0,2.0,1.0,3.0,8,5.0,2.0,...,0,2.0,5.0,24.0,,,,,,
2,1 26 2,3.0,1.0,1.0,2.0,3.0,3.0,5,4.0,2.0,...,0,2.0,5.0,24.0,,,,,,
3,1 42 1,4.0,1.0,1.0,2.0,4.0,3.0,3,5.0,2.0,...,0,,,,,,,,,
4,1 55 2,4.0,1.0,1.0,2.0,2.0,3.0,4,5.0,2.0,...,2,,,,1.0,2.0,1.0,1.0,0.0,0.0


### 3. Create New columns

In [26]:
# --- Under-5 mortality (death before 60 months) ---
df['under5_mortality'] = 0
df.loc[
    (df['Child is alive'] == 0) & 
    (df['Age at death (months, imputed)'] < 60),
    'under5_mortality'
] = 1

# --- Infant mortality (death between 1 and <12 months) ---
df['infant_mortality'] = 0
df.loc[
    (df['Child is alive'] == 0) & 
    (df['Age at death (months, imputed)'] >= 1) & 
    (df['Age at death (months, imputed)'] < 12),
    'infant_mortality'
] = 1

# --- Neonatal mortality (death before 1 month) ---
df['neonatal_mortality'] = 0
df.loc[
    (df['Child is alive'] == 0) & 
    (df['Age at death (months, imputed)'] < 1),
    'neonatal_mortality'
] = 1


Together, these three mortality indicators form a nested hierarchy often used in public health:  

- **Neonatal Mortality** ⟶ deaths occurring before 1 month.  
- **Infant Mortality** ⟶ deaths before 12 months, which includes neonatal deaths.  
- **Under-5 Mortality** ⟶ deaths before 60 months, which includes both infant and neonatal deaths.  

This structure allows for granular analysis across different stages of early childhood while maintaining comparability with standard **SDG (Sustainable Development Goal) indicators**.  


### Creating an Age-at-Death Classification  

To make analysis easier, we group all children into mutually exclusive mortality categories.  
This categorical variable summarizes survival outcomes into standard public health definitions:  

- **Neonatal:** Died before 1 month.  
- **Infant:** Died between 1–11 months.  
- **Child:** Died between 12–59 months.  
- **Alive or 5+:** Survived beyond 5 years or still alive at survey time.  

This structured classification makes it possible to conduct comparisons across regions, socioeconomic groups, or risk factors.  


In [27]:
#age-at-death classification
df['mortality_category'] = 'Alive or 5+'
df.loc[(df['Child is alive'] == 0) & (df['Age at death (months, imputed)'] < 1), 'mortality_category'] = 'Neonatal'
df.loc[(df['Child is alive'] == 0) & (df['Age at death (months, imputed)'] >= 1) & (df['Age at death (months, imputed)'] < 12), 'mortality_category'] = 'Infant'
df.loc[(df['Child is alive'] == 0) & (df['Age at death (months, imputed)'] >= 12) & (df['Age at death (months, imputed)'] < 60), 'mortality_category'] = 'Child'


The new variable `mortality_category` provides a **single outcome column** for stratification and visualization.  
Instead of analyzing three separate binary indicators, we now have one categorical variable that can be directly used in:  

- **Frequency tables** (e.g., mortality distribution).  
- **Cross-tabulations** with predictors (e.g., wealth, education, region).  
- **Modeling** as either a categorical outcome (multinomial) or recoded into binary outcomes depending on the research question.  


### Creating a Binary Flag for Child Death History  

To capture whether a mother has ever experienced the loss of a child, we create the variable `child_death_history`.  
This binary indicator takes the value:  

- **1:** If the mother reported at least one son or daughter who has died.  
- **0:** If no child death was reported.  

This feature provides important contextual information, as maternal history of child mortality can be associated with socioeconomic, health, or environmental vulnerabilities.  


In [28]:
# Create a binary flag for child death history
df["child_death_history"] = (
    ((df["Sons who have died"] > 0) | (df["Daughters who have died"] > 0))
    .astype(int))


The new variable `child_death_history` simplifies analysis of maternal experiences with child loss.  
It can be used to:  

- **Stratify risk factors** (e.g., by wealth index, education, or region).  
- **Assess intergenerational effects**, since families with prior child deaths may face higher subsequent risks.  
- **Model mortality likelihood** with this history as a potential predictor.  

By consolidating information from both sons and daughters, we avoid redundancy and ensure a consistent definition of child death history.  


### Differentiating Between Skilled and Unskilled Medical Help
In order to assess the quality of help a pregnant woman receives, we shall utilize prenatal and intrpartum/delivery level variables to derive new skilled versus unskilled columns. If the woman is helped by a doctor, nurse, midwife or clinical officer, then she falls under skilled. Traditional birth attendants, relatives or friends summarily fall under unskilled.

In [29]:
def classify_prenatal(row):
    # Check if all relevant prenatal columns are missing
    if pd.isna(row['Prenatal: doctor']) and pd.isna(row['Prenatal: nurse/midwife/clinical officer']) and pd.isna(row['Prenatal: traditional birth attendant']) and pd.isna(row['Prenatal: Community health worker/field worker']):
        return np.nan
    elif row.get('Prenatal: doctor', 0) == 1 or row.get('Prenatal: nurse/midwife/clinical officer', 0) == 1:
        return 1 #skilled
    elif row.get('Prenatal: traditional birth attendant', 0) == 1 or row.get('Prenatal: Community health worker/field worker', 0) == 1:
        return 0 #unskilled
    else:
        return 'Unknown'

def classify_delivery(row):
    # Check if all relevant delivery columns are missing
    if pd.isna(row['Assistance: doctor']) and pd.isna(row['Assistance: nurse/midwife/clinical officer']) and pd.isna(row['Assistance: traditional birth attendant']) and pd.isna(row['Assistance: Relative/friend']):
        return np.nan
    elif row.get('Assistance: doctor', 0) == 1 or row.get('Assistance: nurse/midwife/clinical officer', 0) == 1:
        return 1 #skilled
    elif row.get('Assistance: traditional birth attendant', 0) == 1 or row.get('Assistance: Relative/friend', 0) == 1:
        return 0 #unskilled
    else:
        return 'Unknown'

# Apply to DataFrame
df['prenatal_help'] = df.apply(classify_prenatal, axis=1)
df['delivery_help'] = df.apply(classify_delivery, axis=1)


In [30]:
#convert the newly created columns into a categorical data type variable
cat_vars_labor = [
    'prenatal_help',
    'delivery_help'
]

df[cat_vars_labor] = df[cat_vars_labor].astype('category')

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19530 entries, 0 to 19529
Data columns (total 79 columns):
 #   Column                                                       Non-Null Count  Dtype   
---  ------                                                       --------------  -----   
 0   CASEID                                                       19530 non-null  object  
 1   Age in 5-year groups                                         19530 non-null  float64 
 2   Region                                                       19530 non-null  category
 3   Type of place of residence                                   19530 non-null  category
 4   Highest educational level                                    19530 non-null  category
 5   Religion                                                     19530 non-null  category
 6   Ethnicity                                                    19530 non-null  category
 7   Number of household members (listed)                         19530 

### Dropping Redundant or Non-Essential Columns  

Some variables in the dataset are either redundant (already captured by newly engineered features) or no longer necessary for subsequent analysis.  
To streamline the dataset and reduce noise, we safely drop the following columns:  

- **Ever had vaccination** – information now summarized through individual vaccine indicators.  
- **Child is alive** – replaced by mortality category and age-specific mortality flags.  
- **Sons who have died / Daughters who have died** – consolidated into the `child_death_history` binary variable.  
- **Child’s age in months** – not required once mortality windows are defined.  
- **Age at death (months, imputed)** – already embedded in derived mortality classifications.  
- **Prenatal: doctor** - falls under skilled labor
- **Prenatal: nurse/midwife/clinical officer** - falls under skilled labor
- **Prenatal: traditional birth attendant** - falls under unskilled labor
- **Prenatal: Community health worker/field worker** - falls under unskilled labor
- **Assistance: doctor** - falls under skilled labor
- **Assistance: nurse/midwife/clinical officer** - falls under skilled labor
- **Assistance: traditional birth attendant** - falls under unskilled labor
- **Assistance: Relative/friend** - falls under unskilled labor


We implement a custom `safe_drop()` function to ensure robustness:  
- Ignores duplicates in the list.  
- Ignores variables that are not present in the DataFrame.  


In [32]:
# define function for dropping columns with too many missing values
def safe_drop(df, cols_to_drop):
    """
    Drop columns from a DataFrame safely.
    - Ignores duplicates in the list.
    - Ignores columns that don't exist in df.
    """
    # Ensure unique names
    cols_to_drop = list(set(cols_to_drop))
    
    # Keep only those actually in the dataframe
    cols_to_drop = [col for col in cols_to_drop if col in df.columns]
    
    # Drop them
    return df.drop(columns=cols_to_drop)

cols = [
    'Child is alive',
    'Sons who have died',
    'Daughters who have died',
    'Childs age in months',
    'Age at death (months, imputed)',
    'Prenatal: doctor',
    'Prenatal: nurse/midwife/clinical officer',
    'Prenatal: traditional birth attendant',
    'Prenatal: Community health worker/field worker',
    'Assistance: doctor',
    'Assistance: nurse/midwife/clinical officer',
    'Assistance: traditional birth attendant',
    'Assistance: Relative/friend']

df = safe_drop(df, cols)

The DataFrame has been cleaned of redundant variables while preserving essential derived features.  
This step ensures that downstream analyses and models operate on a leaner dataset, minimizing collinearity and reducing unnecessary computational overhead.  


### Cleaning Birth Weight Values  

The variable **"Birth weight in kilograms (3 decimals)"** was originally stored with mixed units:  
- Most values are already in kilograms.  
- Some extreme values (greater than 10) are likely stored in grams and need to be rescaled.  

To standardize this variable, we apply a transformation:  
- If the value is greater than 10, divide it by 1000 to convert grams to kilograms.  
- Otherwise, retain the original value.  
This ensures all measurements are consistently expressed in kilograms.  


In [33]:
col = 'Birth weight in kilograms (3 decimals)'

df[col] = df[col].apply(
    lambda x: x / 1000 if pd.notnull(x) and x > 10 else x
)


The **birth weight** column has been successfully standardized to kilograms across all observations.  
This correction eliminates inconsistencies in units and prevents erroneous interpretations of infant size at birth in downstream analyses.  


### 4. Handling missing values

### Assessing Missing Data  

Before conducting further analysis, it is essential to examine the extent of missingness in the dataset.  
High proportions of missing values can bias results, reduce statistical power, or require imputation or exclusion.  

The following code computes:  
- The **count of missing values** per column,  
- The **percentage of missingness**, and  
- Ranks variables by missingness (highest first).  

We then display the **top 15 variables** with the largest proportion of missing data.  


In [34]:
# Calculate missing values per column
missing_data = (
    df.isnull().sum()  # count of missing values per column
    .to_frame("missing_count")  # convert to DataFrame
    .assign(missing_pct=lambda x: 100 * x["missing_count"] / len(df))  # calculate percentage
    .sort_values("missing_pct", ascending=False)  # sort descending by missing %
)

# Display the top 15 columns with the most missing values
missing_data.head(15)

Unnamed: 0,missing_count,missing_pct
Yellow fever vaccine,16686,85.437788
Ever had vaccination,16176,82.826421
Succeeding birth interval (months),14680,75.166411
Given child anything other than breast milk,14211,72.764977
When child put to breast,13571,69.487967
Size of child at birth,13382,68.520225
Birth weight in kilograms (3 decimals),13382,68.520225
Number of tetanus injections before pregnancy,12725,65.15617
Months of breastfeeding,12410,63.543267
Timing of 1st antenatal check (months),9495,48.617512


### Handling Special Missing Values  

Survey datasets like DHS often use **placeholder codes** (e.g., 99, 9999, 97, 998) to indicate missing or "not applicable" responses instead of true blanks.  
These values must be converted to proper `NaN` to avoid being misinterpreted as valid numeric entries during analysis.  

The function below standardizes missing values by:  
1. **Replacing placeholder codes** (e.g., 99, 9999, 997, 998) with `NaN` across all numeric columns.  
2. Creating a clean foundation for further imputation or missingness flagging.  


In [35]:

def handle_missing_values(df):
    """
    1. Replaces special placeholder codes (e.g. 99, 9999) with NaN
    2. Imputes selected columns
    3. Adds missingness flags where needed
    """

    # Step 1: Replace special missing values across numeric columns
    special_missing_values = {
        99, 999, 9999, 99999, 999999, 9999999,
        98, 998, 9998, 99998,
        97, 997, 9997, 99997,
        96, 996, 9996, 99996
    }
    for col in df.select_dtypes(include=["float64", "Int64"]).columns:
        df[col] = df[col].replace(special_missing_values, np.nan)

    return df


In [36]:
df = handle_missing_values(df)


### Imputation Strategy  

After standardizing special missing codes into `NaN`, we now handle missing data systematically:  

1. **Numeric columns**  
   - Columns with >20% missing values:  
     → Add a binary `_missing` flag (1 = missing, 0 = observed).  
   - Imputation: Replace missing values with the **median** (robust to outliers).  

2. **Categorical columns**  
   - Columns with >20% missing values:  
     → Add a binary `_missing` flag.  
     → Expand categories to include `"Missing"` if not already present.  
   - Imputation: Replace missing values with the **mode** (most common category).  
   - If a column has no valid mode (all missing):  
     → Fill with `"Unknown"`.  

This approach preserves potential **predictive power of missingness** while ensuring that all variables remain usable in downstream models.  


In [37]:
numeric_cols = df.select_dtypes(include=['float64','int64']).columns
categorical_cols = df.select_dtypes(include='category').columns

# --- Numeric columns: median impute + flag columns with >20% missing
for col in numeric_cols:
    if df[col].isna().mean() > 0.2:
        df[col + '_missing'] = df[col].isna().astype(int)   # flag
    df[col] = df[col].fillna(df[col].median())             # assign back!

# --- Categorical columns: mode impute + flag for >20% missing
for col in categorical_cols:
    if df[col].isna().mean() > 0.2:
        df[col + '_missing'] = df[col].isna().astype(int)   # flag
        # add "Missing" category so it's safe to fill with it if needed later
        if 'Missing' not in df[col].cat.categories:
            df[col] = df[col].cat.add_categories('Missing')
    mode_val = df[col].mode(dropna=True)
    if not mode_val.empty:
        df[col] = df[col].fillna(mode_val[0])
    else:
        # if there's no mode (all NaN), fill with 'Missing'
        df[col] = df[col].fillna('Unknown')

In [38]:
df.isnull().sum().sort_values(ascending=False)



CASEID                                                                0
Age in 5-year groups                                                  0
Region                                                                0
Type of place of residence                                            0
Highest educational level                                             0
                                                                     ..
Visited health facility last 12 months_missing                        0
Getting medical help for self: distance to health facility_missing    0
Mode of transportation to nearest healthcare facility_missing         0
prenatal_help_missing                                                 0
delivery_help_missing                                                 0
Length: 104, dtype: int64

### Results of Missing Value Handling  

- All original variables have now been **fully imputed** (no remaining `NaN`).  
- A total of 102 columns are present in the dataset, including:  
  - **Original variables** (numeric + categorical).  
  - **Missingness indicator columns** (e.g., `Mode of transportation to nearest healthcare facility_missing`),  
    which flag whether the original value was missing before imputation.  


### Save Cleaned Dataset  

The final cleaned dataset is now ready for downstream analysis and modeling.  
- All missing values have been handled (via imputation + missingness indicators).  
- Derived features have been created (e.g., `mortality_category`, `child_death_history`).  
- Irrelevant or redundant columns have been dropped.  

We save the resulting dataset as **`u5mr_clean.csv`**, which contains:  
- **{df.shape[1]} variables** (original, engineered, and missingness indicators).  
- **{df.shape[0]} rows** corresponding to individual survey records.  

This file serves as the **master dataset** for subsequent exploration and modeling steps.


In [39]:
#Save to CSV
df.to_csv("u5mr_clean.csv", index=False)

print(f"Saved essential subset with {df.shape[1]} variables and {df.shape[0]} rows.")

Saved essential subset with 104 variables and 19530 rows.


In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19530 entries, 0 to 19529
Columns: 104 entries, CASEID to delivery_help_missing
dtypes: Int64(11), category(39), float64(10), int64(42), object(2)
memory usage: 10.6+ MB
