# ANEMIA LEVEL PREDICTION IN CHILREN

# MODEL APPROACH : CRISP-DM

## COLLABORATORS:

- 1.Ruth Kitasi
- 2.Agatha Nyambati
- 3.Joseline Apiyo
- 4.Cecilia Ngunjiri
- 5.John Mbego
- 6.Leonard Koyio


<!-- from IPython.display import display, HTML -->

<img src="anemic_child.jpg" width="800">


# 1. BUSINESS UNDERSTANDING

## 1.1 Introduction

Anemia is a major public health concern globally, and its effects are especially pronounced among children under five, where it poses risks to growth, cognitive development, and immune health.
Defined as a condition characterized by lower-than-normal hemoglobin (Hb) levels, anemia limits oxygen transport in the blood, which can lead to significant health complications.The World Health Organization estimates that 42% of children worldwide in this age group are anemic, predominantly due to iron deficiency, poor diet, and infections. Anemia in young children can lead to developmental delays, weakened immunity, and long-term health issues if left untreated. Socioeconomic factors, including access to quality food and healthcare, play a significant role in these high rates, with the most severe impacts observed in low- and middle-income countries.

The African continent bears a disproportionate burden of anemia among children, with an estimated 60-80% prevalence in children under five. Causes include malnutrition, high prevalence of infectious diseases (like malaria and hookworm), and poor maternal health. Socioeconomic factors such as poverty, lack of healthcare, and limited access to nutritious food exacerbate the condition. Sub-Saharan Africa, in particular, faces a high prevalence due to ongoing challenges with food security, healthcare infrastructure, and education around health practices.

 Nigeria represents one of the highest anemia burdens in Africa, with an estimated 68% of children under five affected. Factors contributing to anemia in Nigerian children include limited access to nutritious foods, recurrent infections, and inadequate healthcare, especially in rural and low-income areas. Socioeconomic determinants, such as low household income, education levels, and lack of access to clean water, also contribute to anemia's prevalence among Nigerian children, impacting their physical and cognitive development. Addressing anemia in Nigeria is essential, given its long-term effects on individual health and its broader implications for national development.

## 1.2 Problem Statement

Despite the high prevalence of anemia among young children in Nigeria, research on the role of socioeconomic factors influencing its severity remains limited. This project seeks to address this gap by developing a classification model to determine anemia severity levels in Nigerian children up to 56 months old, with a focus on assessing socioeconomic determinants. By identifying the primary contributors to anemia severity, this model aims to support healthcare professionals and policymakers in designing targeted interventions to reduce anemia rates among vulnerable populations in Nigeria



Despite the high prevalence of anemia among young children in Nigeria, there remains a significant gap in understanding the socioeconomic determinants that influence its severity. Existing studies have largely focused on nutritional and clinical factors, but the role of underlying social and economic variables—such as parental education, household income, access to healthcare, and living conditions—has not been thoroughly explored. These factors are critical, as they often dictate access to essential resources like nutritious foods, clean water, and preventive health services.

This project seeks to address this gap by developing a classification model to predict the severity of anemia in Nigerian children up to 56 months old. The model will analyze key socioeconomic indicators to determine which factors most significantly contribute to the risk and severity of anemia in this vulnerable population. By identifying these primary contributors, the model aims to empower healthcare professionals, public health organizations, and policymakers with actionable insights. The ultimate goal is to enable more targeted and effective interventions that address  the socioeconomic roots of anemia.

*Citations*
1. https://iris.who.int/bitstream/handle/10665/85839/WHO_NMH_NHD_MNM_11.1_eng.pdf?sequence=22
2. https://cdn.who.int/media/docs/default-source/2021-dha-docs/ida_assessment_prevention_control.pdf

## 1.3 Formulating Hypothesis.

-  Inadequate access to healthcare facilities is associated with higher levels of anemia among children living in rural areas of Nigeria.

- Children exposed to secondhand smoke may experience detrimental health effects, including respiratory problems and compromised nutritional absorption.

- The level of education and health awareness among parents is inversely related to the prevalence of anemia in their children,

## 1.4 Objectives

1. `Develop a classification codel for Anemia severity in children`:Creating a predictive model that classifies the severity of anemia in children under 56 months in Nigeria, distinguishing between mild, moderate, and severe cases. This model will be designed to handle various input factors, primarily socioeconomic variables, to enhance its practical application in identifying at-risk populations.

2. `Identify key socioeconomic determinants influencing Anemia severity`:Conduct in-depth analysis to pinpoint the specific socioeconomic factors that have the most significant impact on anemia severity among young Nigerian children. 

3. `Evaluate the model’s predictive accuracy for effective Public Health Use`:Assess the model’s predictive performance using appropriate metrics to ensure high accuracy and reliability to provide a robust tool that healthcare professionals can rely on to anticipate anemia cases and determine the required intervention level.

4. `lassify and Monitor Anemia Severity Level`:susing the model to classify anemia cases into severity levels (mild, moderate, severe) based on hemoglobin levels. By tracking these severity levels, the project aims to provide insights that can guide resource prioritization and enable healthcare providers to tailor interventions according to the specific needs of each severity group.

5. `Provide Actionable Insights for Targeted Interventions`:Utilize the model’s findings to produce actionable insights that healthcare organizations and policymakers can leverage to develop targeted anemia prevention and treatment programs. 

6. `Support Long-Term Policy Development for Anemia Reduction`:Contribute to long-term health planning by providing data-backed recommendations for policy measures addressing the socioeconomic root causes of anemia.

## 1.5 Data Limitations

1. `Data quality and completeness`: Incomplete or poorly collected data can hinder accurate analysis. Missing values, inconsistencies, or errors in the dataset may result in biased outcomes or misinterpretations. A study by the World Bank (2018) highlights that data collection in resource-limited settings often lacks standardization and can be subject to human error.

2. `Sample size and representativeness`: Limited sample sizes or non-representative samples can affect the generalizability of the findings. If the data collected is skewed towards specific demographics (e.g., urban versus rural populations), the results may not accurately reflect the broader population. This challenge is particularly noted in studies conducted in developing countries where access to comprehensive datasets can be restricted 

3. `Socioeconomic data limitations`: Socioeconomic factors may be underreported or inaccurately recorded, especially in informal settlements or among marginalized populations. Factors such as income, education level, and access to healthcare are crucial for understanding anemia severity but may be inadequately captured in surveys.

4. `Temporal factors`: Anemia levels and their socioeconomic determinants may change over time due to policy changes, economic fluctuations, or health interventions. Data that is not longitudinal may fail to capture these dynamics, leading to outdated conclusions..

5. `Cultural and behavioral factors`: Anemia research often relies on self-reported data regarding dietary habits, health-seeking behaviors, and utilization of healthcare services. Cultural biases or reluctance to disclose certain behaviors can affect the reliability of this information.

# 2. DATA UNDERSTANDING

Data understanding is a critical step in the data analysis process, providing a structured approach to ensure that the data utilized is not only relevant and reliable but also thoroughly comprehended. This foundational phase is essential for making informed decisions during subsequent stages of analysis and modeling.
Key areas of focus will include:-

 - `Exploring the source of the data`to dentify where the data comes from. Understanding the source helps assess potential biases and the context in which the data was gathered.
 - `Import necessary libraries` to facillitate dataset loading.

 - `Inspect the dataset`using functions to view the first few rows, generate overview of data types, null values, data distribution, central tendencies, and variability, which are crucial for understanding the dataset's overall characteristics.

 - `Clearly define each column` in the dataset to understand what each variable represents.

## 2.1 Data Source Exploration

The data obtained from the  2018 Nigeria Demographic and Health Surveys were collected to answer research questions about the effect of mothers' age and other socioeconomic factors on children aged 0-59 months anemia level in Nigeria.
The Nigeria Demographic and Health Survey (NDHS) is a comprehensive survey designed to provide essential data on the health and demographic status of Nigeria's population.The NDHS aims to inform policymakers and health planners by offering up-to-date and reliable data that reflects the country's demographic trends and health conditions.
This survey data considered the 36 states of Nigeria, as well as the Federal Capital Territory.The targeted population in this study are children aged 0-59 months and mothers aged 15-49 years.


## 2.2 Importing Libraries

 Our first steP into data loading and inspection involves importing necessary librariesat the beginning of the script. We opt to import only those necessary at the moment of use. This strategy helps maintain a clean code structure and makes it easier to recognize when each library is being applied.

In [5]:
# Import libraries
import pandas as pd
import numpy as np

In [6]:
# loading th  dataset
df=pd.read_csv('anemia_dataset.csv')

## 2.3 Inspecting the dataset

In [7]:
# viewing the first few rows of the dataset
df.head()

Unnamed: 0,Age in 5-year groups,Type of place of residence,Highest educational level,Wealth index combined,Births in last five years,Age of respondent at 1st birth,Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal),Anemia level,Have mosquito bed net for sleeping (from household questionnaire),Smokes cigarettes,Current marital status,Currently residing with husband/partner,When child put to breast,Had fever in last two weeks,Hemoglobin level adjusted for altitude (g/dl - 1 decimal),Anemia level.1,"Taking iron pills, sprinkles or syrup"
0,40-44,Urban,Higher,Richest,1,22,,,Yes,No,Living with partner,Staying elsewhere,Immediately,No,,,Yes
1,35-39,Urban,Higher,Richest,1,28,,,Yes,No,Married,Living with her,Hours: 1,No,,,No
2,25-29,Urban,Higher,Richest,1,26,,,No,No,Married,Living with her,Immediately,No,,,No
3,25-29,Urban,Secondary,Richest,1,25,95.0,Moderate,Yes,No,Married,Living with her,105,No,114.0,Not anemic,No
4,20-24,Urban,Secondary,Richest,1,21,,,Yes,No,No longer living together/separated,,Immediately,No,,,No


In [9]:
# checking for the number of rows and colmns of the dataset

rows, columns = df.shape

print(f'The dataset has:\n {rows} number of rows and \n {columns} number of columns')

The dataset has:
 33924 number of rows and 
 17 number of columns


In [10]:
#Getting an overview of the structure of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33924 entries, 0 to 33923
Data columns (total 17 columns):
 #   Column                                                                 Non-Null Count  Dtype  
---  ------                                                                 --------------  -----  
 0   Age in 5-year groups                                                   33924 non-null  object 
 1   Type of place of residence                                             33924 non-null  object 
 2   Highest educational level                                              33924 non-null  object 
 3   Wealth index combined                                                  33924 non-null  object 
 4   Births in last five years                                              33924 non-null  int64  
 5   Age of respondent at 1st birth                                         33924 non-null  int64  
 6   Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)  13136 non-null 

In [13]:
#Determining the number of datatypes in our dataset
df.dtypes.value_counts()

object     13
float64     2
int64       2
dtype: int64

In [None]:
# Generates descriptive statistics of the numeric colums

df.describe() 

Unnamed: 0,Births in last five years,Age of respondent at 1st birth,Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal),Hemoglobin level adjusted for altitude (g/dl - 1 decimal)
count,33924.0,33924.0,13136.0,10182.0
mean,1.823783,19.570776,114.367235,101.270183
std,0.70546,4.313172,15.915408,15.569583
min,1.0,12.0,20.0,29.0
25%,1.0,16.0,105.0,92.0
50%,2.0,19.0,115.0,103.0
75%,2.0,22.0,125.0,112.0
max,6.0,48.0,218.0,170.0


### Observations obtained

- Upon examining the initial rows of the dataset, it becomes evident that several columns contain missing values, necessitating appropriate imputation methods during the data cleaning process.

- The dataset comprises 33,924 rows and 17 columns, providing a substantial foundation for analysis.

- The data types within the dataset include a mix of objects, int64, and float64, indicating a variety of categorical and numerical data. This diversity allows for different analytical approaches tailored to each data type.

- The statistical summary using the describe()method, reveals that the average age at first birth among respondents is relatively young, accompanied by a moderate mean number of births in the last five years. Notably, the hemoglobin levels indicate varying degrees of anemia across the population, with significant variability in both adjusted and unadjusted values. This underscores the importance of further investigation into the socioeconomic and health-related factors that contribute to these health outcomes.



## 2.3 Data Dictonary

# 3 DATA CLEANING AND PREPARATION