# ANEMIA LEVEL PREDICTION IN CHILDREN

## COLLABORATORS:

- 1.Ruth Kitasi
- 2.Agatha Nyambati
- 3.Joseline Apiyo
- 4.Cecilia Ngunjiri
- 5.John Mbego
- 6.Leonard Koyio


# MODEL APPROACH : CRISP-DM

<!-- from IPython.display import display, HTML -->

<img src="crisp-DM.png" width="800">

## TABLE OF CONTENT

it will be placed here

<!-- from IPython.display import display, HTML -->

<img src="anemic_child.jpg" width="800">


# 1. BUSINESS UNDERSTANDING

## 1.1 Introduction

Anemia is a major public health concern globally, and its effects are especially pronounced among children under five, where it poses risks to growth, cognitive development, and immune health.
Defined as a condition characterized by lower-than-normal hemoglobin (Hb) levels, anemia limits oxygen transport in the blood, which can lead to significant health complications.The World Health Organization estimates that 42% of children worldwide in this age group are anemic, predominantly due to iron deficiency, poor diet, and infections. Anemia in young children can lead to developmental delays, weakened immunity, and long-term health issues if left untreated. Socioeconomic factors, including access to quality food and healthcare, play a significant role in these high rates, with the most severe impacts observed in low- and middle-income countries.

The African continent bears a disproportionate burden of anemia among children, with an estimated 60-80% prevalence in children under five. Causes include malnutrition, high prevalence of infectious diseases (like malaria and hookworm), and poor maternal health. Socioeconomic factors such as poverty, lack of healthcare, and limited access to nutritious food exacerbate the condition. Sub-Saharan Africa, in particular, faces a high prevalence due to ongoing challenges with food security, healthcare infrastructure, and education around health practices.

 Nigeria represents one of the highest anemia burdens in Africa, with an estimated 68% of children under five affected. Factors contributing to anemia in Nigerian children include limited access to nutritious foods, recurrent infections, and inadequate healthcare, especially in rural and low-income areas. Socioeconomic determinants, such as low household income, education levels, and lack of access to clean water, also contribute to anemia's prevalence among Nigerian children, impacting their physical and cognitive development. Addressing anemia in Nigeria is essential, given its long-term effects on individual health and its broader implications for national development.

 Key stakeholders include families and caregivers, healthcare providers (such as pediatricians and community health workers), local government and health departments, educational institutions, non-governmental organizations (NGOs), and research and academic institutions that focus on maternal and child health (UNICEF, 2021; Nigerian Ministry of Health, 2018) [(UNICEF, 2021)](https://www.unicef.org/nigeria/reports/childhood-anemia-report). By engaging these stakeholders, the project aims to create a comprehensive approach to combat childhood anemia, ensuring sustainable improvements in child health and nutrition.

## 1.2 Problem Statement

Despite the high prevalence of anemia among young children in Nigeria, research on the role of socioeconomic factors influencing its severity remains limited. This project seeks to address this gap by developing a classification model to determine anemia severity levels in Nigerian children up to 59 months old, with a focus on assessing socioeconomic determinants. By identifying the primary contributors to anemia severity, this model aims to support healthcare professionals and policymakers in designing targeted interventions to reduce anemia rates among vulnerable populations in Nigeria.

This project seeks to address this gap by developing a classification model to predict the severity of anemia in Nigerian children up to 59 months old. The model will analyze key socioeconomic indicators to determine which factors most significantly contribute to the risk and severity of anemia in this vulnerable population. By identifying these primary contributors, the model aims to empower healthcare professionals, public health organizations, and policymakers with actionable insights. The ultimate goal is to enable more targeted and effective interventions that address  the socioeconomic roots of anemia.


## 1.3 Hypothesis.

-  Inadequate access to healthcare facilities is associated with higher levels of anemia among children living in rural areas of Nigeria.

- Children exposed to secondhand smoke may experience detrimental health effects, including respiratory problems and compromised nutritional absorption.

- The level of education and health awareness among parents is inversely related to the prevalence of anemia in their children,

## 1.4 Objectives

1. `Develop a classification model for Anemia severity in children`:Creating a predictive model that classifies the severity of anemia in children under 56 months in Nigeria, distinguishing between mild, moderate, and severe cases. This model will be designed to handle various input factors, primarily socioeconomic variables, to enhance its practical application in identifying at-risk populations.

2. `Identify key socioeconomic determinants influencing Anemia severity`:Conduct in-depth analysis to pinpoint the specific socioeconomic factors that have the most significant impact on anemia severity among young Nigerian children. 

3. `Evaluate the model’s predictive accuracy for effective Public Health Use`:Assess the model’s predictive performance using appropriate metrics to ensure high accuracy and reliability to provide a robust tool that healthcare professionals can rely on to anticipate anemia cases and determine the required intervention level.

4. `Classify and Monitor Anemia Severity Level`:susing the model to classify anemia cases into severity levels (mild, moderate, severe) based on hemoglobin levels. By tracking these severity levels, the project aims to provide insights that can guide resource prioritization and enable healthcare providers to tailor interventions according to the specific needs of each severity group.

5. `Provide Actionable Insights for Targeted Interventions`:Utilize the model’s findings to produce actionable insights that healthcare organizations and policymakers can leverage to develop targeted anemia prevention and treatment programs. 

6. `Support Long-Term Policy Development for Anemia Reduction`:Contribute to long-term health planning by providing data-backed recommendations for policy measures addressing the socioeconomic root causes of anemia.

# 2. DATA UNDERSTANDING

Data understanding is a critical step in the data analysis process, providing a structured approach to ensure that the data utilized is not only relevant and reliable but also thoroughly comprehended. This foundational phase is essential for making informed decisions during subsequent stages of analysis and modeling.
Key areas of focus will include:-

 - `Exploring the source of the data`to dentify where the data comes from. Understanding the source helps assess potential biases and the context in which the data was gathered.

 - `Loading and inspecting the dataset`using functions to view the first few rows, generate overview of data types, null values, data distribution, central tendencies, and variability, which are crucial for understanding the dataset's overall characteristics.

 - `Clearly define each column` in the dataset to understand what each variable represents.

 - Giving `observations` obtained from data inspection

 - Listing possible `limitations of the dataset`

## 2.1 Data Source Exploration

The data obtained from the  2018 Nigeria Demographic and Health Surveys were collected to answer research questions about the effect of mothers' age and other socioeconomic factors on children aged 0-59 months anemia level in Nigeria.
The Nigeria Demographic and Health Survey (NDHS) is a comprehensive survey designed to provide essential data on the health and demographic status of Nigeria's population.The NDHS aims to inform policymakers and health planners by offering up-to-date and reliable data that reflects the country's demographic trends and health conditions.
This survey data considered the 36 states of Nigeria, as well as the Federal Capital Territory.The targeted population in this study are children aged 0-59 months and mothers aged 15-49 years.


## 2.2 Loading and Inspecting dataset

In [437]:
# Import necessary libraries

#We choose to import libraries only when they are needed to track their usage and keep the code organized.

import pandas as pd
import numpy as np

In [438]:
# loading th  dataset
df=pd.read_csv('anemia_dataset.csv')

In [439]:
# viewing the first few rows of the dataset
df.head()

Unnamed: 0,Age in 5-year groups,Type of place of residence,Highest educational level,Wealth index combined,Births in last five years,Age of respondent at 1st birth,Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal),Anemia level,Have mosquito bed net for sleeping (from household questionnaire),Smokes cigarettes,Current marital status,Currently residing with husband/partner,When child put to breast,Had fever in last two weeks,Hemoglobin level adjusted for altitude (g/dl - 1 decimal),Anemia level.1,"Taking iron pills, sprinkles or syrup"
0,40-44,Urban,Higher,Richest,1,22,,,Yes,No,Living with partner,Staying elsewhere,Immediately,No,,,Yes
1,35-39,Urban,Higher,Richest,1,28,,,Yes,No,Married,Living with her,Hours: 1,No,,,No
2,25-29,Urban,Higher,Richest,1,26,,,No,No,Married,Living with her,Immediately,No,,,No
3,25-29,Urban,Secondary,Richest,1,25,95.0,Moderate,Yes,No,Married,Living with her,105,No,114.0,Not anemic,No
4,20-24,Urban,Secondary,Richest,1,21,,,Yes,No,No longer living together/separated,,Immediately,No,,,No


In [440]:
# checking for the number of rows and columns of the dataset

rows, columns = df.shape

print(f'The dataset has:\n {rows} number of rows and \n {columns} number of columns')

The dataset has:
 33924 number of rows and 
 17 number of columns


In [441]:
#Generating an overview of the structure of the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33924 entries, 0 to 33923
Data columns (total 17 columns):
 #   Column                                                                 Non-Null Count  Dtype  
---  ------                                                                 --------------  -----  
 0   Age in 5-year groups                                                   33924 non-null  object 
 1   Type of place of residence                                             33924 non-null  object 
 2   Highest educational level                                              33924 non-null  object 
 3   Wealth index combined                                                  33924 non-null  object 
 4   Births in last five years                                              33924 non-null  int64  
 5   Age of respondent at 1st birth                                         33924 non-null  int64  
 6   Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)  13136 non-null 

In [442]:
#Determining the number of datatypes in our dataset

df.dtypes.value_counts()

object     13
float64     2
int64       2
dtype: int64

In [443]:
# Generating  descriptive statistics of the numeric columns

df.describe() 

Unnamed: 0,Births in last five years,Age of respondent at 1st birth,Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal),Hemoglobin level adjusted for altitude (g/dl - 1 decimal)
count,33924.0,33924.0,13136.0,10182.0
mean,1.823783,19.570776,114.367235,101.270183
std,0.70546,4.313172,15.915408,15.569583
min,1.0,12.0,20.0,29.0
25%,1.0,16.0,105.0,92.0
50%,2.0,19.0,115.0,103.0
75%,2.0,22.0,125.0,112.0
max,6.0,48.0,218.0,170.0


In [444]:
#listing the names of the columns
df.columns

Index(['Age in 5-year groups', 'Type of place of residence',
       'Highest educational level', 'Wealth index combined',
       'Births in last five years', 'Age of respondent at 1st birth',
       'Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)',
       'Anemia level',
       'Have mosquito bed net for sleeping (from household questionnaire)',
       'Smokes cigarettes', 'Current marital status',
       'Currently residing with husband/partner', 'When child put to breast',
       'Had fever in last two weeks',
       'Hemoglobin level adjusted for altitude (g/dl - 1 decimal)',
       'Anemia level.1', 'Taking iron pills, sprinkles or syrup'],
      dtype='object')

In [445]:
#Describing the colums names 

| Column Name                                           | Description                                                                                       |
|------------------------------------------------------|---------------------------------------------------------------------------------------------------|
| Age in 5-year groups                                 | Age of the mother, categorized into 5-year intervals.                                           |
| Type of place of residence                           | Residential classification: Urban or Rural.                                                     |
| Highest educational level                            | Mother's highest educational attainment, e.g., "No Education", "Secondary".                     |
| Wealth index combined                                | Economic status of the mother, ranked as "Poorest", "Poorer", etc.                             |
| Births in last five years                            | Count of live births by the mother in the past five years.                                      |
| Age of respondent at 1st birth                       | Mother's age at her first live birth.                                                            |
| Hemoglobin level adjusted for altitude and smoking    | Hemoglobin level in g/dL, adjusted for altitude and smoking status.                            |
| Anemia level                                         | Categorical anemia status of the child (e.g., "Not anemic", "Mild", "Moderate", "Severe").     |
| Have mosquito bed net for sleeping                   | Presence of a mosquito net in the household (True/False).                                       |
| Smokes cigarettes                                    | Mother's smoking status (True/False).                                                            |
| Current marital status                               | Mother's marital status (e.g., "Married", "Single").                                           |
| Currently residing with husband/partner              | Co-habitation status with spouse or partner (True/False).                                       |
| When child put to breast                             | Timing of initial breastfeeding (hours after birth).                                             |
| Had fever in last two weeks                          | Fever incidence in the child during the past two weeks (True/False).                           |
| Hemoglobin level adjusted for altitude (g/dl)       | Hemoglobin level in g/dL, adjusted solely for altitude.                                         |
| Anemia level.1                                      | Alternative or duplicate anemia categorization.                                                  |
| Taking iron pills, sprinkles or syrup                | Iron supplementation status of the child (True/False).                                          |


 ## 2.3 Observations

- Upon examining the initial rows of the dataset, it becomes evident that several columns contain missing values, necessitating appropriate imputation methods during the data cleaning process.

- The dataset comprises 33,924 rows and 17 columns, providing a substantial foundation for analysis.

- The data types within the dataset include a mix of objects, int64, and float64, indicating a variety of categorical and numerical data. This diversity allows for different analytical approaches tailored to each data type.

- The statistical summary using the describe()method, reveals that the average age at first birth among respondents is relatively young, accompanied by a moderate mean number of births in the last five years. Notably, the hemoglobin levels indicate varying degrees of anemia across the population, with significant variability in both adjusted and unadjusted values. This underscores the importance of further investigation into the socioeconomic and health-related factors that contribute to these health outcomes.



## 2.4 Data Limitation

1. `Data quality and completeness`: Incomplete or poorly collected data can hinder accurate analysis. Missing values, inconsistencies, or errors in the dataset may result in biased outcomes or misinterpretations. A study by the World Bank (2018) highlights that data collection in resource-limited settings often lacks standardization and can be subject to human error.

2. `Sample size and representativeness`: Limited sample sizes or non-representative samples can affect the generalizability of the findings. If the data collected is skewed towards specific demographics (e.g., urban versus rural populations), the results may not accurately reflect the broader population. This challenge is particularly noted in studies conducted in developing countries where access to comprehensive datasets can be restricted 

3. `Socioeconomic data limitations`: Socioeconomic factors may be underreported or inaccurately recorded, especially in informal settlements or among marginalized populations. Factors such as income, education level, and access to healthcare are crucial for understanding anemia severity but may be inadequately captured in surveys.

4. `Temporal factors`: Anemia levels and their socioeconomic determinants may change over time due to policy changes, economic fluctuations, or health interventions. Data that is not longitudinal may fail to capture these dynamics, leading to outdated conclusions..

5. `Cultural and behavioral factors`: Anemia research often relies on self-reported data regarding dietary habits, health-seeking behaviors, and utilization of healthcare services. Cultural biases or reluctance to disclose certain behaviors can affect the reliability of this information.

# 3.0 DATA PREPARATION

## 3.1 Data Cleaning

Data cleaning involves identifying and correcting inaccuracies or inconsistencies in data to improve its quality and reliability for analysis and modeling. We will start by separating numerical columns from categorical ones, as each type requires different cleaning approaches.

We will first separate the numerical from the categorical columns and clean thse two datsets separately.Some of the cleaning techniques applied include the following:-

- Checking for missing values.
- Imputing missing values
- Dropping unncessary  columns and rows.
- Renaming columns.
- Encoding categorical columns
- changing datatypes for consistency.

Once the dataset has been clean appropriately, we will procced and concatenate the two datasets for futher analysis.

  ## Handling numerical columns

In [446]:
# Separate numerical columns (int and float types)
numeric_cols = df.select_dtypes(include=['int64', 'float64'])

numeric_cols.sample(n=5)


Unnamed: 0,Births in last five years,Age of respondent at 1st birth,Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal),Hemoglobin level adjusted for altitude (g/dl - 1 decimal)
12181,2,17,,
7660,2,19,126.0,99.0
19415,2,18,,
575,2,38,,
17207,1,17,,


In [447]:
#checking for missing value
if numeric_cols.isnull().value_counts:
    print(True)
else:
    print(None)

True


Cheking for missing values has returned True. This means that we confirmed that our numerical_cols data set has missing values, we will check how many or what percentage has the missing value and sort the results in descending order. 

In [448]:
# Calculating the percentage of missing values for each column

numeric_cols.isnull().sum().sort_values(ascending=False)


Hemoglobin level adjusted for altitude (g/dl - 1 decimal)                23742
Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)    20788
Age of respondent at 1st birth                                               0
Births in last five years                                                    0
dtype: int64

After a keen observation, we will proceed and drop the column containing Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal), since we can get the same kind of information from the smoke ciggarate column in the categorical_cols

In [449]:
# Dropping the column 'Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)' from numeric_cols

numeric_cols.drop(['Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)'], axis=1, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


For the 'Hemoglobin level adjusted for altitude (g/dl - 1 decimal)' we will proceed and drop rows with empty values since they are equal to the number of rows with empty values in the 'anemia.1' hence manintaing the structure of our dataset.

In [450]:
# Dropping rows with missing values in the numeric_cols dataset
numeric_cols = numeric_cols.dropna()

# checking to confirm that my numerical_cols dataset has no missing values
numeric_cols.isnull().sum()

Births in last five years                                    0
Age of respondent at 1st birth                               0
Hemoglobin level adjusted for altitude (g/dl - 1 decimal)    0
dtype: int64

In [451]:
#Renaming columns to improve clarity

numeric_cols.rename(columns={
    'Births in last five years': 'Births_Last_5Y',
    'Age of respondent at 1st birth': 'Age_First_Birth',
    'Hemoglobin level adjusted for altitude (g/dl - 1 decimal)': 'Hemo_Alt'
}, inplace=True)

# Display the updated DataFrame
numeric_cols.head()

Unnamed: 0,Births_Last_5Y,Age_First_Birth,Hemo_Alt
3,1,25,114.0
5,1,30,119.0
6,2,32,102.0
9,1,19,113.0
12,1,24,109.0


In [452]:
#converting the 'hemo_Alt to int
numeric_cols['Hemo_Alt'] = numeric_cols['Hemo_Alt'].astype(int)

In [453]:
# checking at the consiesummary report of the numerical_cols dataset.
numeric_cols.head()

Unnamed: 0,Births_Last_5Y,Age_First_Birth,Hemo_Alt
3,1,25,114
5,1,30,119
6,2,32,102
9,1,19,113
12,1,24,109


## Insights drawn after numerical columns cleaning

- All columns are fully populated, with no missing values, guaranteeing that analyses conducted on this DataFrame will leverage the entirety of the dataset.
- The total number of rows has been significantly reduced from 33,924 to 10,181.
- The DataFrame exhibits efficient memory usage, making it highly manageable for computational tasks and data transformations.

  ## Handling categorical columns

In [454]:
# Separate categorical columns (object and category types)
categorical_cols = df.select_dtypes(include=['object', 'category'])
categorical_cols.head()

Unnamed: 0,Age in 5-year groups,Type of place of residence,Highest educational level,Wealth index combined,Anemia level,Have mosquito bed net for sleeping (from household questionnaire),Smokes cigarettes,Current marital status,Currently residing with husband/partner,When child put to breast,Had fever in last two weeks,Anemia level.1,"Taking iron pills, sprinkles or syrup"
0,40-44,Urban,Higher,Richest,,Yes,No,Living with partner,Staying elsewhere,Immediately,No,,Yes
1,35-39,Urban,Higher,Richest,,Yes,No,Married,Living with her,Hours: 1,No,,No
2,25-29,Urban,Higher,Richest,,No,No,Married,Living with her,Immediately,No,,No
3,25-29,Urban,Secondary,Richest,Moderate,Yes,No,Married,Living with her,105,No,Not anemic,No
4,20-24,Urban,Secondary,Richest,,Yes,No,No longer living together/separated,,Immediately,No,,No


In [455]:
#checking for missing value
if categorical_cols.isnull().value_counts:
    print(True)
else:
    print(None)

True


Checking for missing values has returned True. This means that we confirmed that our categorical_cols data set has missing values. We will check how many or what percentage has the missing value and sort the results in descending order. 

In [456]:
# Calculating the percentage of missing values in each categorical column
categorical_cols_col_missing_val_percentage = categorical_cols.isnull().sum().sort_values(ascending=False)

# Display the sorted result
categorical_cols_col_missing_val_percentage

Anemia level.1                                                       23742
Anemia level                                                         20788
When child put to breast                                             12756
Taking iron pills, sprinkles or syrup                                 3211
Had fever in last two weeks                                           3211
Currently residing with husband/partner                               1698
Current marital status                                                   0
Smokes cigarettes                                                        0
Have mosquito bed net for sleeping (from household questionnaire)        0
Wealth index combined                                                    0
Highest educational level                                                0
Type of place of residence                                               0
Age in 5-year groups                                                     0
dtype: int64

We will remove the columns `When child put to breast` and `Anemia level` from our `categorical_cols` DataFrame. The `When child put to breast` column is considered vague, while the `Anemia level` column provides information similar to that found in `Anemia level.1`, despite not being an exact match.

In [457]:
# Dropping unnecessary columns safely
categorical_cols = categorical_cols.loc[:, ~categorical_cols.columns.isin(['When child put to breast', 'Anemia level'])]


In [458]:
categorical_cols.isnull().sum()

Age in 5-year groups                                                     0
Type of place of residence                                               0
Highest educational level                                                0
Wealth index combined                                                    0
Have mosquito bed net for sleeping (from household questionnaire)        0
Smokes cigarettes                                                        0
Current marital status                                                   0
Currently residing with husband/partner                               1698
Had fever in last two weeks                                           3211
Anemia level.1                                                       23742
Taking iron pills, sprinkles or syrup                                 3211
dtype: int64

In [459]:
# Defining the columns to fill with 'unknown'
columns_to_fill = [
    'Taking iron pills, sprinkles or syrup',
    'Had fever in last two weeks',
    'Currently residing with husband/partner'
]

# Filling empty cells with 'unknown'
categorical_cols[columns_to_fill] = categorical_cols[columns_to_fill].fillna('unknown')

In [460]:
#Displaying the columns after imputing empty collumns of selected columns with 'unknown'
categorical_cols.isnull().sum()

Age in 5-year groups                                                     0
Type of place of residence                                               0
Highest educational level                                                0
Wealth index combined                                                    0
Have mosquito bed net for sleeping (from household questionnaire)        0
Smokes cigarettes                                                        0
Current marital status                                                   0
Currently residing with husband/partner                                  0
Had fever in last two weeks                                              0
Anemia level.1                                                       23742
Taking iron pills, sprinkles or syrup                                    0
dtype: int64

After imputation, we notice that the number of empty values in `Anemia Level.1` is equal to the number of empty values in `Hemoglobin level adjusted for altitude (g/dl - 1 decimal) `.We will proceed and drop the rows with missing values to keep our dataset structure compact

In [461]:
#dropping empty values of Hemoglobin level adjusted for altitude (g/dl - 1 decimal) column
categorical_cols= categorical_cols.dropna()


# checking to confirm that my numerical_cols dataset has no missing values
categorical_cols.isnull().sum()

Age in 5-year groups                                                 0
Type of place of residence                                           0
Highest educational level                                            0
Wealth index combined                                                0
Have mosquito bed net for sleeping (from household questionnaire)    0
Smokes cigarettes                                                    0
Current marital status                                               0
Currently residing with husband/partner                              0
Had fever in last two weeks                                          0
Anemia level.1                                                       0
Taking iron pills, sprinkles or syrup                                0
dtype: int64

In [462]:
categorical_cols.columns

Index(['Age in 5-year groups', 'Type of place of residence',
       'Highest educational level', 'Wealth index combined',
       'Have mosquito bed net for sleeping (from household questionnaire)',
       'Smokes cigarettes', 'Current marital status',
       'Currently residing with husband/partner',
       'Had fever in last two weeks', 'Anemia level.1',
       'Taking iron pills, sprinkles or syrup'],
      dtype='object')

In [463]:
#renaming columns to improve clarity
categorical_cols.rename(columns={'Have mosquito bed net for sleeping (from household questionnaire)':'Have mosquito net',
                                 'Currently residing with husband/partner':'Residing with spouse','Type of place of residence':'Area_Type',
                                 'Taking iron pills, sprinkles or syrup':'Taking_meds','Anemia level.1':'Anemia level','Age in 5-year groups':'Age_group'}, inplace=True)

# Display the updated DataFrame
categorical_cols.head()

Unnamed: 0,Age_group,Area_Type,Highest educational level,Wealth index combined,Have mosquito net,Smokes cigarettes,Current marital status,Residing with spouse,Had fever in last two weeks,Anemia level,Taking_meds
3,25-29,Urban,Secondary,Richest,Yes,No,Married,Living with her,No,Not anemic,No
5,30-34,Urban,Higher,Richest,Yes,No,Married,Living with her,No,Not anemic,No
6,35-39,Urban,Secondary,Richest,Yes,No,Married,Living with her,No,Mild,Yes
9,20-24,Urban,Secondary,Richest,Yes,No,Married,Living with her,No,Not anemic,Yes
12,25-29,Urban,Higher,Richest,Yes,No,Married,Living with her,No,Mild,No


## Endcoding categorical columns

Encoding categorical columns is the process of transforming categorical (non-numeric) data into a numerical format so that it can be used effectively in machine learning models and other analytical processes that require numerical inputs. Since our categorical colums contains a meaning order, wewill proceed and apply the `ordinal encoding`

### Ordinal Encoding

In [464]:
# Importing necessary library
from sklearn.preprocessing import OrdinalEncoder

# Initializing OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

# Applying OrdinalEncoder to categorical columns
categorical_cols_encoded = ordinal_encoder.fit_transform(categorical_cols)

# Converting the result back to a DataFrame with the same column names
categorical_cols_encoded = pd.DataFrame(categorical_cols_encoded, columns=categorical_cols.columns)

# Displaying the first few rows of the encoded DataFrame
categorical_cols_encoded.head()



Unnamed: 0,Age_group,Area_Type,Highest educational level,Wealth index combined,Have mosquito net,Smokes cigarettes,Current marital status,Residing with spouse,Had fever in last two weeks,Anemia level,Taking_meds
0,2.0,1.0,3.0,4.0,1.0,0.0,2.0,0.0,1.0,2.0,1.0
1,3.0,1.0,0.0,4.0,1.0,0.0,2.0,0.0,1.0,2.0,1.0
2,4.0,1.0,3.0,4.0,1.0,0.0,2.0,0.0,1.0,0.0,2.0
3,1.0,1.0,3.0,4.0,1.0,0.0,2.0,0.0,1.0,2.0,2.0
4,2.0,1.0,0.0,4.0,1.0,0.0,2.0,0.0,1.0,0.0,1.0


In [465]:
#converting the results of the categorical_cols_encoded to int type

categorical_cols_encoded = pd.DataFrame(OrdinalEncoder().fit_transform(categorical_cols), columns=categorical_cols.columns).astype(int)
categorical_cols_encoded.head()

Unnamed: 0,Age_group,Area_Type,Highest educational level,Wealth index combined,Have mosquito net,Smokes cigarettes,Current marital status,Residing with spouse,Had fever in last two weeks,Anemia level,Taking_meds
0,2,1,3,4,1,0,2,0,1,2,1
1,3,1,0,4,1,0,2,0,1,2,1
2,4,1,3,4,1,0,2,0,1,0,2
3,1,1,3,4,1,0,2,0,1,2,2
4,2,1,0,4,1,0,2,0,1,0,1


Now that both our numerica columns and our categorical columns have been clean appropriately, we will now proceed and concate the two datasets for furher anaysis.

## Concatenate the Numrical and Categorical Columns

In [466]:
#Checking for the number of rows and colums of the rows datasets befor concatenating
print(numeric_cols.shape)
print(categorical_cols_encoded.shape)


(10182, 3)
(10182, 11)


In [467]:
## resetting the index of both my numerical and the categorical_cols to ensure consistency.
numeric_cols.reset_index(drop=True, inplace=True)
categorical_cols_encoded.reset_index(drop=True, inplace=True)

In [468]:
# concatenatingthetwo datasets
anemia_df = pd.concat([numeric_cols, categorical_cols_encoded], axis=1)

# displaying the shape of the new combined set to check if it matches the rows f the numerical_cols and the categorical_cols_emcoded
anemia_df.shape


(10182, 14)

In [None]:
#confirming that all my datatypes are of int type
anemia_df.dtypes

Births_Last_5Y                 int64
Age_First_Birth                int64
Hemo_Alt                       int32
Age_group                      int32
Area_Type                      int32
Highest educational level      int32
Wealth index combined          int32
Have mosquito net              int32
Smokes cigarettes              int32
Current marital status         int32
Residing with spouse           int32
Had fever in last two weeks    int32
Anemia level                   int32
Taking_meds                    int32
dtype: object

In [469]:
#displaying the combined dataset
anemia_df.sample(n=5)

Unnamed: 0,Births_Last_5Y,Age_First_Birth,Hemo_Alt,Age_group,Area_Type,Highest educational level,Wealth index combined,Have mosquito net,Smokes cigarettes,Current marital status,Residing with spouse,Had fever in last two weeks,Anemia level,Taking_meds
1803,1,16,72,0,1,1,0,1,0,2,1,1,1,1
3370,3,18,121,2,1,1,0,1,0,2,0,2,2,1
1305,2,18,97,3,0,1,1,1,0,2,0,2,1,1
3609,2,18,99,5,1,1,4,1,0,2,0,1,1,1
3437,2,17,112,2,0,1,2,0,0,2,0,1,2,1


## 3.2 Explanatory Data Analysis (EDA)

Now That our dataset is all cleanup.we will proceed to `Explanatory data analyis` which involves examining and visualizing datasets to summarize their main characteristics, often with the help of statistical graphics