## Title

### **Anemia Level Prediction in Children**

## Intoroduction 

Anemia affects millions of children worldwide, impacting their health, growth, and cognitive development. Early detection is crucial for timely intervention, yet traditional diagnostic methods can be costly, time-consuming, and often inaccessible, especially in low-resource settings. This therefore brings a problem highlighted clearly in the **problem statement** below 

### Problem Statement

Childhood anemia is a critical public health issue,  ([World Health Organisation](https://www.who.int/data/gho/data/indicators/indicator-details/GHO/prevalence-of-anaemia-in-children-under-5-years-(-))), leading to adverse impacts on cognitive and physical development, educational attainment, and overall quality of life. Despite its prevalence, anemia in Nigerian children often goes undiagnosed and untreated due to limited healthcare access, high diagnostic costs, and a lack of awareness about the condition([UNICEF, 2021](https://www.unicef.org/nigeria/reports/childhood-anemia-report)). This situation calls for innovative, accessible approaches to early diagnosis and intervention to reduce the burden of anemia and improve the health outcomes of vulnerable children across Nigeria.In our model we seek to outdo this nightmare, with a combination of machine learning,good featured dataset and data science experts the bussiness statement therefore takes you through an highlight of how this is possible.

### Business Understanding 


This project addresses childhood anemia, a significant public health concern that adversely affects cognitive and physical development in children. By analyzing a comprehensive dataset that captures key demographic, health, and socio-economic variables, we aim to identify and mitigate the causes of anemia in diverse communities. The dataset includes age groups ranging from 15 to 49 years, with respondents living in both urban and rural areas. Educational levels span from no education to higher education, while the wealth index is classified from poorest to richest.

Key health indicators, including hemoglobin levels, iron supplementation status, and malaria exposure, are crucial for assessing anemia risk (World Health Organization, 2017; Nigerian Ministry of Health, 2018) [(WHO, 2017)](https://www.who.int/publications/i/item/9789241513067). The dataset reveals a range of hemoglobin levels from 20 to 218 g/dL and categorizes anemia into mild, moderate, severe, and not anemic. It also reflects that respondents have had between 1 to 6 births in the last five years, with information on whether children experienced fever in the last two weeks (yes, no, or don’t know) and if they are taking iron supplements (yes, no, or don’t know). Additionally, it examines living conditions, such as the availability of mosquito bed nets and the smoking status of caregivers.

This project is crucial for developing targeted interventions tailored to high-risk groups, optimizing resource allocation. It will raise awareness about the importance of proper nutrition and iron intake, empowering families with the knowledge needed to prevent anemia. Community engagement will be a cornerstone of the initiative, fostering a collaborative approach to health and encouraging families to support each other in adopting healthier practices. Ultimately, the goal is to decrease anemia rates among children, leading to better health outcomes, improved school performance, and enhanced quality of life.

Key stakeholders include families and caregivers, healthcare providers (such as pediatricians and community health workers), local government and health departments, educational institutions, non-governmental organizations (NGOs), and research and academic institutions that focus on maternal and child health (UNICEF, 2021; Nigerian Ministry of Health, 2018) [(UNICEF, 2021)](https://www.unicef.org/nigeria/reports/childhood-anemia-report). By engaging these stakeholders, the project aims to create a comprehensive approach to combat childhood anemia, ensuring sustainable improvements in child health and nutrition.



## Objective 

This project leverages advanced data analytics and machine learning to predict anemia levels in children based on key health indicators. By employing predictive models, we aim to support healthcare providers with fast, reliable insights that help identify children at risk, enabling proactive management and treatment.

Here is a list of Objectives that merge for the success of this project 

1. **Analyze Socioeconomic Factors**  
   Investigate the relationship between various socioeconomic factors—such as mother's age, education level, and household wealth index—and anemia levels in children aged 0-59 months.

2. **Examine Health Resource Impact**  
   Assess the impact of health resources, including the availability of mosquito nets and maternal health indicators (e.g., iron intake, smoking status), on the likelihood of anemia in children.

3. **Evaluate Demographic Indicators**  
   Determine the predictive power of demographic indicators like residence type and marital status for anemia levels.

4. **Formulate Targeted Interventions**  
   Develop data-driven recommendations for targeted health interventions to assist policymakers in addressing high-risk demographics and regions.

Upon completion of the Objectives the project has the following  **Key Benefits**


- **Cost-Effective Screening**: Reduces the need for expensive lab tests through accessible data-driven predictions.
- **Rapid Decision-Making**: Provides near-instant insights, facilitating quicker diagnosis and intervention.
- **Scalable Solution**: Adaptable for use in various healthcare settings, from rural clinics to large hospitals.

### **A quick summary of How It Works**
Utilizing an extensive [Anemia Dataset](./Anemia-Level-Prediction-in-Children\anemia_dataset.csv)
 our model assesses various demographic, health, and socio-economic metrics to predict anemia levels accurately.The metrics have been explained in the column description file  This predictive tool empowers healthcare professionals to allocate resources efficiently, prioritize high-risk cases, and enhance patient outcomes.

### Data Understanding 

This part is a comprehensive assessment of variables such as hemoglobin levels, age, nutritional factors, and living conditions, as well as a quality check to identify missing values, outliers, and data inconsistencies. By ensuring our data is robust and aligned with the project's goals, we lay the groundwork for building effective models that can accurately predict anemia risk and support better health outcomes for children in at-risk populations.

By closely examining our dataset—which includes a variety of demographic, health, and socio-economic indicators—we aim to understand the characteristics and patterns that may signal anemia risk.

### Data collection 

where is the from

### Data loading

In [5]:
import pandas as pd
df = pd.read_csv('../anemia_dataset.csv')
df.head(5)

Unnamed: 0,Age in 5-year groups,Type of place of residence,Highest educational level,Wealth index combined,Births in last five years,Age of respondent at 1st birth,Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal),Anemia level,Have mosquito bed net for sleeping (from household questionnaire),Smokes cigarettes,Current marital status,Currently residing with husband/partner,When child put to breast,Had fever in last two weeks,Hemoglobin level adjusted for altitude (g/dl - 1 decimal),Anemia level.1,"Taking iron pills, sprinkles or syrup"
0,40-44,Urban,Higher,Richest,1,22,,,Yes,No,Living with partner,Staying elsewhere,Immediately,No,,,Yes
1,35-39,Urban,Higher,Richest,1,28,,,Yes,No,Married,Living with her,Hours: 1,No,,,No
2,25-29,Urban,Higher,Richest,1,26,,,No,No,Married,Living with her,Immediately,No,,,No
3,25-29,Urban,Secondary,Richest,1,25,95.0,Moderate,Yes,No,Married,Living with her,105,No,114.0,Not anemic,No
4,20-24,Urban,Secondary,Richest,1,21,,,Yes,No,No longer living together/separated,,Immediately,No,,,No


### Columns info

We have loaded our data and we can see what it entails, lets get in a deep dive of what each column represents 

In [9]:
columns = df.columns
columns = pd.DataFrame(columns,columns=['column names'])
columns

Unnamed: 0,column names
0,Age in 5-year groups
1,Type of place of residence
2,Highest educational level
3,Wealth index combined
4,Births in last five years
5,Age of respondent at 1st birth
6,Hemoglobin level adjusted for altitude and smo...
7,Anemia level
8,Have mosquito bed net for sleeping (from house...
9,Smokes cigarettes



### Column Descriptions

| <span style="color:white">Column Name</span>                                               | <span style="color:white">Description</span>                                                                                  |
|-----------------------------------------------------------|----------------------------------------------------------------------------------------------|
| <span style="color:white">Age in 5-year groups</span>                                  | <span style="color:cyan">Age of the mother, categorized into 5-year intervals.</span>                                        |
| <span style="color:white">Type of place of residence</span>                            | <span style="color:cyan">Residential classification: Urban or Rural.</span>                                                  |
| <span style="color:white">Highest educational level</span>                             | <span style="color:cyan">Mother's highest educational attainment, e.g., "No Education", "Secondary".</span>                  |
| <span style="color:white">Wealth index combined</span>                                 | <span style="color:cyan">Economic status of the mother, ranked as "Poorest", "Poorer", etc.</span>                          |
| <span style="color:white">Births in last five years</span>                             | <span style="color:cyan">Count of live births by the mother in the past five years.</span>                                   |
| <span style="color:white">Age of respondent at 1st birth</span>                        | <span style="color:cyan">Mother's age at her first live birth.</span>                                                         |
| <span style="color:white">Hemoglobin level adjusted for altitude and smoking</span>    | <span style="color:cyan">Hemoglobin level in g/dL, adjusted for altitude and smoking status.</span>                         |
| <span style="color:white">Anemia level</span>                                          | <span style="color:cyan">Categorical anemia status of the mother (e.g., "Not anemic", "Mild", "Moderate", "Severe").</span>  |
| <span style="color:white">Have mosquito bed net for sleeping</span>                    | <span style="color:cyan">Presence of a mosquito net in the household (True/False).</span>                                   |
| <span style="color:white">Smokes cigarettes</span>                                     | <span style="color:cyan">Mother's smoking status (True/False).</span>                                                         |
| <span style="color:white">Current marital status</span>                                | <span style="color:cyan">Mother's marital status (e.g., "Married", "Single").</span>                                        |
| <span style="color:white">Currently residing with husband/partner</span>               | <span style="color:cyan">Co-habitation status with spouse or partner (True/False).</span>                                    |
| <span style="color:white">When child put to breast</span>                              | <span style="color:cyan">Timing of initial breastfeeding (hours after birth).</span>                                          |
| <span style="color:white">Had fever in last two weeks</span>                           | <span style="color:cyan">Fever incidence in the child during the past two weeks (True/False).</span>                        |
| <span style="color:white">Hemoglobin level adjusted for altitude (g/dl)</span>         | <span style="color:cyan">Hemoglobin level in g/dL, adjusted solely for altitude.</span>                                      |
| <span style="color:white">Anemia level.1</span>                                        | <span style="color:cyan">Alternative or duplicate anemia categorization.</span>                                               |
| <span style="color:white">Taking iron pills, sprinkles or syrup</span>                 | <span style="color:cyan">Iron supplementation status of the mother (True/False).</span>                                       |



### Data Exploration

In [16]:
df.describe()

Unnamed: 0,Births in last five years,Age of respondent at 1st birth,Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal),Hemoglobin level adjusted for altitude (g/dl - 1 decimal)
count,33924.0,33924.0,13136.0,10182.0
mean,1.823783,19.570776,114.367235,101.270183
std,0.70546,4.313172,15.915408,15.569583
min,1.0,12.0,20.0,29.0
25%,1.0,16.0,105.0,92.0
50%,2.0,19.0,115.0,103.0
75%,2.0,22.0,125.0,112.0
max,6.0,48.0,218.0,170.0


#### Explanation of Statistics



##### 1. **Births in Last Five Years**
- **Count**: 33,924 — This indicates that there are 33,924 valid entries for this column.
- **Mean**: 1.82 — On average, mothers reported about 1.82 births in the last five years.
- **Standard Deviation (std)**: 0.71 — This reflects the variability in the number of births, showing that the number of births typically varies by about 0.71 from the mean.
- **Min**: 1 — The minimum number of births reported is 1.
- **25% (1st Quartile)**: 1 — 25% of the mothers reported 1 birth or fewer in the last five years.
- **50% (Median)**: 2 — The median number of births is 2, meaning half of the mothers reported 2 or fewer births.
- **75% (3rd Quartile)**: 2 — 75% of the mothers reported 2 births or fewer.
- **Max**: 6 — The maximum number of births reported is 6.

##### 2. **Age of Respondent at 1st Birth**
- **Count**: 33,924 — Valid entries for this column.
- **Mean**: 19.57 — The average age of mothers at their first birth is approximately 19.6 years.
- **Standard Deviation**: 4.31 — Indicates a moderate variability in the ages at first birth.
- **Min**: 12 — The youngest reported age at first birth is 12 years.
- **25%**: 16 — 25% of mothers had their first birth at age 16 or younger.
- **50%**: 19 — The median age is 19.
- **75%**: 22 — 75% of mothers had their first birth by age 22 or younger.
- **Max**: 48 — The oldest age reported for a first birth is 48 years.

##### 3. **Hemoglobin Level Adjusted for Altitude and Smoking (g/dl - 1 decimal)**
- **Count**: 13,136 — Only 13,136 valid entries, indicating missing values for some records.
- **Mean**: 114.37 g/dL — The average hemoglobin level, adjusted for altitude and smoking, is approximately 114.4 g/dL.
- **Standard Deviation**: 15.92 — Reflects the variability in hemoglobin levels; levels typically vary by about 15.9 g/dL from the mean.
- **Min**: 20 — The lowest reported hemoglobin level is 20 g/dL, which indicates severe anemia.
- **25%**: 105 — 25% of the values are 105 g/dL or lower.
- **50%**: 115 — The median hemoglobin level is 115 g/dL.
- **75%**: 125 — 75% of the hemoglobin levels are 125 g/dL or lower.
- **Max**: 218 — The highest reported hemoglobin level is 218 g/dL, which is significantly above normal ranges.

##### 4. **Hemoglobin Level Adjusted for Altitude (g/dl - 1 decimal)**
- **Count**: 10,182 — Valid entries for this column, indicating more missing values than the previous column.
- **Mean**: 101.27 g/dL — The average hemoglobin level, adjusted solely for altitude, is approximately 101.3 g/dL.
- **Standard Deviation**: 15.57 — Shows variability in the adjusted hemoglobin levels.
- **Min**: 29 — The lowest reported hemoglobin level in this context is 29 g/dL.
- **25%**: 92 — 25% of the hemoglobin levels are 92 g/dL or lower.
- **50%**: 103 — The median hemoglobin level is 103 g/dL.
- **75%**: 112 — 75% of values are at or below 112 g/dL.
- **Max**: 170 — The highest reported hemoglobin level is 170 g/dL.



In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33924 entries, 0 to 33923
Data columns (total 17 columns):
 #   Column                                                                 Non-Null Count  Dtype  
---  ------                                                                 --------------  -----  
 0   Age in 5-year groups                                                   33924 non-null  object 
 1   Type of place of residence                                             33924 non-null  object 
 2   Highest educational level                                              33924 non-null  object 
 3   Wealth index combined                                                  33924 non-null  object 
 4   Births in last five years                                              33924 non-null  int64  
 5   Age of respondent at 1st birth                                         33924 non-null  int64  
 6   Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)  13136 non-null 

*data info expalined*

- **Total Entries**: The DataFrame contains **33,924 entries**, indexed from 0 to 33,923.


- **Missing Values**:
   - Some columns have missing values. For example, "Hemoglobin level adjusted for altitude and smoking" has only **13,136 non-null** entries, meaning there are missing data points for **20,788** entries in that column.
   - "Currently residing with husband/partner" has **32,226** non-null entries, indicating missing values for **1,698** entries.
   - "When child put to breast" has **21,168** non-null entries, suggesting missing values for **12,756** entries.

- **Data Types**:
   - The DataFrame consists of:
     - **13 object columns**: Typically used for categorical data or text.
     - **2 integer columns**: Representing counts or whole numbers (e.g., births, age).
     - **2 float columns**: Used for continuous numerical values (e.g., hemoglobin levels), indicating decimal values.




After Exploration the Following is a classification of the relevance in our columns

#### Target Variable
- **Anemia level**: Indicates anemia status (e.g., Not anemic, Mild, Moderate, Severe).

#### Identifying Key Features
- **Type of place of residence**
- **Highest educational level**
- **Wealth index combined**
- **Current marital status**

#### Potential Predictors
- **Age in 5-year groups**
- **Births in last five years**
- **Age of respondent at 1st birth**
- **Hemoglobin level adjusted for altitude and smoking**
- **Have mosquito bed net for sleeping**
- **Smokes cigarettes**
- **Currently residing with husband/partner**
- **When child put to breast**
- **Had fever in last two weeks**
- **Taking iron pills, sprinkles, or syrup**


#### Access Data Quality 

ie checking outliers , duplicates and data consistency 

In [22]:
## checking for duplicates 
duplicates = df.duplicated()
duplicates_sum = duplicates.sum()
print(f'total duplicates:{duplicates_sum}')
duplicates

total duplicates:4678


0        False
1        False
2        False
3        False
4        False
         ...  
33919    False
33920    False
33921    False
33922    False
33923    False
Length: 33924, dtype: bool

outliers

consistency 