# Forecasting Future Healthcare Expenditure Trends in Eastern Africa

![image.png](attachment:image.png)

## Overview

Nyika Analytika embarks on a comprehensive exploration of healthcare financing and expenditure in Eastern Africa, delving into the unique challenges and opportunities prevalent in the region. The countries within Eastern Africa, including Djibouti, Eritrea, Ethiopia, Kenya, and others, grapple with multifaceted healthcare obstacles shaped by diverse economic, political, and infrastructural landscapes. A primary concern lies in the socioeconomic inequalities leading to disparities in healthcare access and affordability. Of particular note is the region's high reliance on out-of-pocket (OOP) payments, a factor that not only hinders medical service utilization but also poses a risk of pushing individuals into poverty. Nyika Analytika recognizes the urgency of addressing these issues, especially in light of the disproportionately high out-of-pocket health expenses in Africa compared to other continents.

Governments in Eastern Africa allocate less than 30% of total health expenditure, a stark contrast to high-income countries where government contributions hover around 80%. The economic challenges exacerbated by the COVID-19 pandemic have further strained the healthcare financing system, necessitating innovative strategies to fortify health systems and ensure equitable access to healthcare for all. Nyika Analytika seeks to unravel the complex dynamics of healthcare financing, investigating the percentage of GDP governments allocate to health expenditure, the fraction attributed to out-of-pocket expenses, and the interplay between public and private spending. Through a meticulous analysis of World Bank data and relevant health system indicators, the project aims to construct a predictive model that forecasts future healthcare expenditure trends, providing valuable insights for policy decisions and fostering collaborations between stakeholders.

As the project unfolds, Nyika Analytika anticipates not only revealing key trends, challenges, and opportunities in healthcare financing but also offering data-driven recommendations to government finance ministries, healthcare startups, and international donors and NGOs. The success metric is defined by achieving a low root mean square error, ensuring the accuracy and reliability of the predictive model. The stakeholders involved, including government finance ministries, healthcare startups, and international donors, play pivotal roles in shaping the future of healthcare financing in Eastern Africa. Nyika Analytika's commitment is anchored in providing actionable insights that pave the way for improved healthcare accessibility, affordability, and sustainability in the region.

## Business Understanding

The project initiated by Nyika Analytika centers around a profound understanding of the healthcare financing and expenditure landscape in Eastern Africa. This region, encompassing nations such as Djibouti, Eritrea, Ethiopia, Kenya, and others, grapples with unique challenges emanating from diverse economic, political, and infrastructural contexts. Several critical aspects drive the urgency and significance of this investigation:

1. **Socioeconomic Inequalities and Out-of-Pocket Payments:**
   - Eastern Africa faces a pressing issue of socioeconomic disparities influencing healthcare access and affordability. The high dependence on out-of-pocket (OOP) payments poses a tangible threat, potentially deterring individuals from seeking essential medical services and, alarmingly, pushing them into poverty. Comparative analyses reveal that OOP health expenses in Africa significantly surpass those in other continents, underscoring the need for targeted interventions.

2. **Low Government Health Spending and COVID-19 Impact:**
   - Government health spending in these nations remains notably low, with government sources contributing less than 30% of total health expenditure in low-income countries. The recent economic challenges heightened by the COVID-19 pandemic have exacerbated the strain on healthcare financing systems. This necessitates a thorough examination of the current economic priorities and the potential impact of external factors on the allocation of healthcare budgets.

3. **Crucial Role of Healthcare Financing Understanding:**
   - A nuanced comprehension of healthcare financing is pivotal as it serves as a barometer for a country's economic priorities. The proportion of GDP allocated to health is indicative of the nation's emphasis on healthcare, reflecting its commitment to the well-being of its population. High personal healthcare costs can act as deterrents, leading to untreated illnesses and potential public health crises. Additionally, a significant reliance on external sources for health expenditure raises sustainability concerns, making it imperative to scrutinize the balance between government and private healthcare spending.

4. **Need for Innovative Financing Strategies:**
   - The challenges underscored by the project emphasize the immediate need for innovative financing strategies. As the COVID-19 pandemic exacerbates existing vulnerabilities, there is an urgent call to strengthen health systems. Analyzing historical data becomes a critical component in identifying patterns, trends, and potential areas of improvement. This analysis can form the foundation for developing predictive models that accurately forecast future healthcare expenditure, aiding stakeholders in strategic decision-making.

Nyika Analytika recognizes the complexity of the healthcare financing landscape in Eastern Africa and aims to dissect it comprehensively. By addressing the socioeconomic disparities, low government spending, and the impact of external factors, the project seeks to provide actionable insights that pave the way for improved healthcare accessibility and sustainability in the region.

## Problem Statement

In the face of complex healthcare challenges within Eastern Africa, Nyika Analytika identifies a critical need to address the prevailing issues in healthcare financing and expenditure. The overarching problem stems from the multifaceted nature of the region's healthcare landscape, marked by socioeconomic inequalities, a high reliance on out-of-pocket (OOP) payments, and notably low government health spending. The recent economic strain exacerbated by the COVID-19 pandemic further intensifies the urgency to strengthen healthcare financing systems. Therefore, the central problem can be defined as follows:

**Primary Problem:** The healthcare financing system in Eastern Africa is confronted with profound challenges rooted in socioeconomic disparities, wherein individuals face barriers to accessing essential medical services due to the high dependency on out-of-pocket payments. The inadequate allocation of government funds, especially in the wake of the economic challenges triggered by the COVID-19 pandemic, poses a significant threat to the region's healthcare sustainability. This predicament necessitates a comprehensive exploration of historical data to understand the intricacies of healthcare financing and expenditure. The ultimate aim is to construct a predictive model that can accurately forecast future healthcare expenditure trends, offering actionable insights to stakeholders.

**Sub-problems:**
1. *Socioeconomic Disparities and Out-of-Pocket Expenditure:* 
   - Individuals in Eastern Africa encounter challenges in accessing healthcare due to socioeconomic inequalities, compounded by a substantial reliance on out-of-pocket payments. This results in delayed or forgone medical treatments, posing risks to public health.

2. *Low Government Health Spending and COVID-19 Impact:*
   - Governments in the region allocate less than 30% of total health expenditure, contributing to a fragile healthcare financing system. The recent economic strains due to the COVID-19 pandemic exacerbate this issue, demanding an examination of the impact on healthcare budgets and the need for innovative financing strategies.

3. *Lack of Predictive Models for Future Expenditure:*
   - There exists a gap in predictive modeling for healthcare expenditure trends in Eastern Africa. The absence of such models hinders the ability of stakeholders, including government finance ministries, healthcare startups, and international donors, to make informed decisions and implement effective strategies for sustained healthcare accessibility and affordability.

The convergence of these issues necessitates a focused effort to analyze historical data, identify patterns, and develop a predictive model that not only elucidates the current state of healthcare financing but also acts as a strategic tool for shaping the future of healthcare in Eastern Africa. Nyika Analytika endeavors to address this problem by providing a data-driven framework for stakeholders to enhance healthcare financing and ensure equitable access to medical services.

## Objectives

**Main Objective:**
The primary objective of this project undertaken by Nyika Analytika is to conduct a comprehensive analysis of healthcare financing and expenditure trends in Eastern Africa. The overarching goal is to construct a predictive model that accurately forecasts future healthcare expenditure percentage changes. This main objective encapsulates the broader aim of understanding, predicting, and contributing to the improvement of the healthcare financing landscape in the region.

**Specific Objectives:**

1. **Explore World Bank Data:**
   - Analyze World Bank data to identify and gather relevant health system indicators specific to Eastern Africa. This involves a meticulous examination of datasets to extract key variables that play a crucial role in understanding the dynamics of healthcare financing within the region.

2. **Distinguish Between Public and Private Spending:**
   - Conduct a detailed analysis of healthcare expenditure, with a focus on distinguishing between public and private spending. By dissecting the contributions from both sectors, the project aims to unravel the nuances of their roles, implications, and trends in shaping the healthcare financing landscape.

3. **Identify Correlations Between Healthcare Financing Indicators:**
   - Explore and analyze correlations between various healthcare financing indicators. This involves identifying patterns, relationships, and dependencies among different indicators, offering insights into the interconnected nature of elements influencing healthcare expenditure in Eastern Africa.

4. **Offer Recommendations for Governments and Stakeholders:**
   - Based on the analysis, provide actionable recommendations for government finance ministries and other stakeholders. These recommendations aim to highlight areas of improvement, suggest potential collaborations, and propose strategies to enhance healthcare financing, ultimately fostering equitable access to healthcare in the region.

**Success Metric:**
The success of this project will be measured by achieving a low root mean square error (RMSE) in the predictive model. A low RMSE indicates the accuracy and reliability of the model in forecasting healthcare expenditure percentage changes. This metric serves as a quantitative benchmark to ensure the efficacy of the predictive tool developed through this analysis.

**Stakeholders:**
The key stakeholders in this endeavor include:
- **Government Finance Ministries:** Utilizing insights to allocate healthcare budgets based on past trends and accurate future predictions, ensuring healthcare remains accessible and affordable.
- **Healthcare Startups:** Leveraging information to launch innovative financing solutions or health tech services in regions with the highest private expenditure.
- **International Donors and NGOs:** Allocating aid efficiently and effectively, targeting regions with the most significant healthcare needs based on the findings of the analysis.

## Data Understanding

Nyika Analytika's project on healthcare financing and expenditure in Eastern Africa places a significant emphasis on comprehending the intricacies of health development indicators. The focus revolves around health system indicators, and key metrics have been identified to shed light on the current state of healthcare financing in the region. Here are the selected indicators along with their descriptions:

- **Current Health Expenditure (% of GDP):**
   - Definition: Represents the total health expenditure as a percentage of the Gross Domestic Product (GDP).
   - Significance: Indicates the priority given to health in the national budget and reflects the nation's commitment to healthcare.

- **External Health Expenditure (% of Current Health Expenditure):**
   - Definition: Indicates the percentage of health expenditure that comes from external sources, such as foreign aid or grants.
   - Significance: Highlights the extent of dependency on foreign aid, raising sustainability concerns and influencing policy decisions.

- **Domestic General Government Health Expenditure (% of Current Health Expenditure):**
   - Definition: Indicates the percentage of health expenditure that is funded by the domestic government.
   - Significance: Offers insights into the government's financial commitment to healthcare and its role in supporting the health system.

- **Domestic General Government Health Expenditure (% of GDP):**
   - Definition: Represents the government health expenditure as a percentage of GDP.
   - Significance: Provides a macroeconomic perspective on the allocation of resources to healthcare within the overall national economic framework.

- **Domestic General Government Health Expenditure (% of General Government Expenditure):**
   - Definition: Measures health expenditure as a percentage of the total government expenditure.
   - Significance: Illustrates the relative importance of healthcare in the broader context of government spending priorities.

- **Out-of-Pocket Expenditure (% of Current Health Expenditure):**
   - Definition: Measures the percentage of health expenditure that is paid directly by individuals, without insurance or reimbursement.
   - Significance: Highlights the burden on individuals and potential barriers to healthcare access, impacting public health outcomes.

- **Domestic Private Health Expenditure (% of Current Health Expenditure):**
   - Definition: Measures the percentage of health expenditure that is funded by private domestic sources, such as private insurance or businesses.
   - Significance: Provides insights into the role of the private sector in healthcare financing, influencing policy decisions and collaborations.

The selected indicators offer a comprehensive view of the healthcare financing landscape in Eastern Africa. Nyika Analytika aims to leverage these data points to uncover patterns, relationships, and trends, ultimately contributing to the development of a predictive model for future healthcare expenditure. The understanding of these indicators forms the foundation for meaningful insights and recommendations for stakeholders involved in shaping the healthcare future of the region.

### Importing Relevant Libraries

In [1]:
# Importing fundamental libraries
import numpy as np
import pandas as pd
import os
import pickle
import re
import warnings

# Setting up display options
warnings.filterwarnings("ignore")
%matplotlib inline

# Importing necessary libraries for Data Visualization
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns

# Importing necessary libraries for Machine Learning and Time Series Analysis
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import ParameterGrid
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.api import SimpleExpSmoothing
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from tensorflow import keras

# Importing necessary libraries for Neural Network (Keras)
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.layers import LSTM, GRU, Dense, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

### Preliminary Analysis
A preliminary examination of the datasets will aid in understanding the nature, structure, and quality of the data. This involves evaluating the variables, identifying any missing or anomalous values, and ensuring the data is conducive for modeling.

Let's initiate this by loading and previewing the datasets:

In [2]:
# Loading dataframes
wdi_df = pd.read_csv("data/WDIData.csv")
series_df = pd.read_csv("data/WDISeries.csv")

# Previweing the datasets
(wdi_df.head(), series_df.head())

(                  Country Name Country Code   
 0  Africa Eastern and Southern          AFE  \
 1  Africa Eastern and Southern          AFE   
 2  Africa Eastern and Southern          AFE   
 3  Africa Eastern and Southern          AFE   
 4  Africa Eastern and Southern          AFE   
 
                                       Indicator Name     Indicator Code  1960   
 0  Access to clean fuels and technologies for coo...     EG.CFT.ACCS.ZS   NaN  \
 1  Access to clean fuels and technologies for coo...  EG.CFT.ACCS.RU.ZS   NaN   
 2  Access to clean fuels and technologies for coo...  EG.CFT.ACCS.UR.ZS   NaN   
 3            Access to electricity (% of population)     EG.ELC.ACCS.ZS   NaN   
 4  Access to electricity, rural (% of rural popul...  EG.ELC.ACCS.RU.ZS   NaN   
 
    1961  1962  1963  1964  1965  ...       2014       2015       2016   
 0   NaN   NaN   NaN   NaN   NaN  ...  17.392349  17.892005  18.359993  \
 1   NaN   NaN   NaN   NaN   NaN  ...   6.720331   7.015917   7.2813

### Initial Insights

The datasets contain information related to various indicators across different countries, focusing on aspects such as access to clean fuels, electricity, agricultural production, fertilizer consumption, land use, and more. Here are some initial insights:

- **Structure of the Data:**
   - The dataset appears to be structured with multiple columns, including `Country Name`, `Country Code`, `Indicator Name`, `Indicator Code`, and years from 1960 to 2022.

- **Indicator Categories:**
   - Indicators are categorized into different topics or series, such as `Environment: Agricultural production` and `Environment: Land use`. Each category has specific indicators related to the topic.

- **Availability of Historical Data:**
   - The dataset covers a substantial time range from 1960 to 2022, providing a historical perspective on the evolution of various indicators over the years.

- **Numerical Values and NaNs:**
   - The numerical values in the dataset represent percentages, quantities, or other relevant measures for each indicator. However, there are many NaN (Not a Number) entries, indicating missing or unavailable data for certain years.

- **Multiple Sources and Licenses:**
   - The data seems to be compiled from various sources, as indicated by the `Source` column. Additionally, there is information about the license type for the data (e.g., "CC BY-4.0," which stands for Creative Commons Attribution 4.0 International License).

- **Development Relevance and Methodology:**
   - Some columns provide information on the development relevance of the indicators, statistical concepts, and methodologies used in data collection.

- **Aggregation Methods:**
   - The `Aggregation method` column suggests how the data is aggregated, with methods like `Sum` or `Weighted average` being used.

- **Additional Notes and Web Links:**
   - There are columns providing additional notes, general comments, and links to related sources or web links.

**Potential Next Steps:**
   - Explore specific indicators or topics of interest in more detail.
   - Clean and preprocess the data by handling NaN values and ensuring consistency.
   - Visualize trends and patterns in the data to gain deeper insights.
   - Consider feature engineering or extraction for specific analyses.

Further analysis would require a more in-depth exploration of specific indicators, their trends, and potential correlations.

## Data Preparation

In [3]:
# Dropping unnecessary columns
wdi_df.drop('Unnamed: 67', axis=1, inplace=True)

The dataframe underwent a transformation from a wide format to a long format using the Pandas `melt` function, enhancing the data's analyzability and manipulability.

In [4]:
# Highlighting the column names
id_vars = ['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code']

# Using pd.melt to make the years into one column
wdi_melt = pd.melt(wdi_df, 
                   id_vars=id_vars, 
                   var_name='Years', 
                   value_name='Score')

# Loading first five rows
wdi_melt.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,Years,Score
0,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,1960,
1,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,1960,
2,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,1960,
3,Africa Eastern and Southern,AFE,Access to electricity (% of population),EG.ELC.ACCS.ZS,1960,
4,Africa Eastern and Southern,AFE,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,1960,


Let's focus solely on health development indicators specific to Eastern African countries.

In [5]:
# Isolating Eastern Africa countries
eastern_countries = [
"Djibouti",
"Eritrea",
"Ethiopia",
"Kenya",
"Rwanda",
"Sudan",
"Tanzania",
"Uganda"]

# Creating new dataframe focusing on Eastern Africa countries
eastern_df = wdi_melt[wdi_melt['Country Name'].isin(eastern_countries)]

# Loading the new dataframe
eastern_df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,Years,Score
153058,Djibouti,DJI,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,1960,
153059,Djibouti,DJI,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,1960,
153060,Djibouti,DJI,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,1960,
153061,Djibouti,DJI,Access to electricity (% of population),EG.ELC.ACCS.ZS,1960,
153062,Djibouti,DJI,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,1960,


In [6]:
eastern_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 748944 entries, 153058 to 24881583
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Country Name    748944 non-null  object 
 1   Country Code    748944 non-null  object 
 2   Indicator Name  748944 non-null  object 
 3   Indicator Code  748944 non-null  object 
 4   Years           748944 non-null  object 
 5   Score           280395 non-null  float64
dtypes: float64(1), object(5)
memory usage: 40.0+ MB


- The code defines a list named `eastern_countries` containing the names of Eastern African countries.

- A new dataframe, `eastern_df`, is created by filtering the original dataframe (`wdi_melt`) to include only rows where the 'Country Name' is in the list of Eastern African countries.

- The resulting `eastern_df` dataframe contains information on various health development indicators for the specified Eastern African countries.

- The dataframe has 748,944 entries and consists of six columns: 'Country Name', 'Country Code', 'Indicator Name', 'Indicator Code', 'Years', and 'Score'.

- The 'Score' column has 280,395 non-null entries, indicating the presence of missing values.

- The 'Years' column is likely intended to represent the years, but its data type is currently listed as an object. It might need conversion to a numeric data type for proper analysis.

- The 'Score' column is in float64 data type, suitable for numerical computations.

- Further analysis and cleaning may be required to handle missing values and ensure the data is ready for exploration and interpretation.

In [7]:
# Years used
eastern_df = eastern_df[(eastern_df['Years'] >= '2010') & (eastern_df['Years'] <= '2020')]

- A decision was made to focus on development indicators from Eastern African countries.

- The selection specifically includes indicators spanning a duration of 10 years.

- The chosen dataset comprises a total of 8 countries from Eastern Africa.

- This decision suggests a deliberate effort to narrow down the scope of analysis to a specific region and timeframe, likely aimed at gaining more targeted insights into the development trends within the specified countries over the specified period.

In [8]:
# Specifying the indicators
expenditure = [i for i in eastern_df['Indicator Name'] if 'expenditure' in i]
expenditure_df = eastern_df[eastern_df['Indicator Name'].isin(expenditure)]

# Printing overall information of the dataset
expenditure_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7480 entries, 19916894 to 24090766
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country Name    7480 non-null   object 
 1   Country Code    7480 non-null   object 
 2   Indicator Name  7480 non-null   object 
 3   Indicator Code  7480 non-null   object 
 4   Years           7480 non-null   object 
 5   Score           5042 non-null   float64
dtypes: float64(1), object(5)
memory usage: 409.1+ KB


In [9]:
# Merging expenditure_df and series_df
merged_df = pd.merge(expenditure_df, 
                     series_df,
                     left_on="Indicator Code",
                     right_on="Series Code",
                     how='left')

# Printing first five rows
merged_df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name_x,Indicator Code,Years,Score,Series Code,Topic,Indicator Name_y,Short definition,...,Notes from original source,General comments,Source,Statistical concept and methodology,Development relevance,Related source links,Other web links,Related indicators,License Type,Unnamed: 20
0,Djibouti,DJI,Adjusted savings: education expenditure (% of ...,NY.ADJ.AEDU.GN.ZS,2010,7.802883,NY.ADJ.AEDU.GN.ZS,Economic Policy & Debt: National accounts: Adj...,Adjusted savings: education expenditure (% of ...,,...,,,World Bank staff estimates using data from the...,,,,,,CC BY-4.0,
1,Djibouti,DJI,Adjusted savings: education expenditure (curre...,NY.ADJ.AEDU.CD,2010,89418720.0,NY.ADJ.AEDU.CD,Economic Policy & Debt: National accounts: Adj...,Adjusted savings: education expenditure (curre...,,...,,,World Bank staff estimates using data from the...,,,,,,CC BY-4.0,
2,Djibouti,DJI,"Current education expenditure, primary (% of t...",SE.XPD.CPRM.ZS,2010,,SE.XPD.CPRM.ZS,Education: Inputs,"Current education expenditure, primary (% of t...",,...,,,UNESCO Institute for Statistics (UIS). UIS.Sta...,"Current expenditure, primary is calculated by ...",,,,,CC BY-4.0,
3,Djibouti,DJI,"Current education expenditure, secondary (% of...",SE.XPD.CSEC.ZS,2010,,SE.XPD.CSEC.ZS,Education: Inputs,"Current education expenditure, secondary (% of...",,...,,,UNESCO Institute for Statistics (UIS). UIS.Sta...,"Current expenditure, secondary is calculated b...",,,,,CC BY-4.0,
4,Djibouti,DJI,"Current education expenditure, tertiary (% of ...",SE.XPD.CTER.ZS,2010,,SE.XPD.CTER.ZS,Education: Inputs,"Current education expenditure, tertiary (% of ...",,...,,,UNESCO Institute for Statistics (UIS). UIS.Sta...,"Current expenditure, tertiary is calculated by...",,,,,CC BY-4.0,


In [10]:
# Identifying necessary columns in the merged dataset
neccessary_columns = ['Country Name', 'Indicator Name_x', 'Topic', 'Years', 'Score']

# Dataframe with necessary columns
merged_df = merged_df[neccessary_columns]

# Printing overall info of the dataset.
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7480 entries, 0 to 7479
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Country Name      7480 non-null   object 
 1   Indicator Name_x  7480 non-null   object 
 2   Topic             7480 non-null   object 
 3   Years             7480 non-null   object 
 4   Score             5042 non-null   float64
dtypes: float64(1), object(4)
memory usage: 292.3+ KB


- The indicators related to human resources and expenditure in the healthcare sector were initially grouped under the overarching topic of "Health Systems."

- Subsequently, a decision was taken to segregate these indicators, with a specific and exclusive focus on analyzing the expenditure aspects of health systems.

- This strategic move aligns with the previously established business understanding, highlighting a deliberate choice to narrow down the analysis to the financial components of health systems. This approach suggests a keen interest in gaining insights specifically related to the expenditure patterns within the healthcare sector, in line with the overall objectives of the analysis.

In [11]:
# Isolating necessary indicators
indicators = ['Current health expenditure (% of GDP)',
'Domestic general government health expenditure (% of current health expenditure)',
'Domestic general government health expenditure (% of GDP)',
'Domestic general government health expenditure (% of general government expenditure)',
'Domestic private health expenditure (% of current health expenditure)',
'External health expenditure (% of current health expenditure)',
'Out-of-pocket expenditure (% of current health expenditure)']

# Dataframe with necessary columns
health_df = merged_df[merged_df['Indicator Name_x'].isin(indicators)]

# Printing first five rows
health_df.head()

Unnamed: 0,Country Name,Indicator Name_x,Topic,Years,Score
6,Djibouti,Current health expenditure (% of GDP),Health: Health systems,2010,3.061504
11,Djibouti,Domestic general government health expenditure...,Health: Health systems,2010,60.679157
12,Djibouti,Domestic general government health expenditure...,Health: Health systems,2010,1.857695
13,Djibouti,Domestic general government health expenditure...,Health: Health systems,2010,6.970487
16,Djibouti,Domestic private health expenditure (% of curr...,Health: Health systems,2010,29.537765


In [12]:
# Converting years column to datetime object
health_df['Years'] = pd.to_datetime(health_df['Years'])

# Making years column index of dataframe
health_df.set_index('Years', inplace=True)

# Printing last five rows
health_df.tail()

Unnamed: 0_level_0,Country Name,Indicator Name_x,Topic,Score
Years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-01,Uganda,Domestic general government health expenditure...,Health: Health systems,0.672564
2020-01-01,Uganda,Domestic general government health expenditure...,Health: Health systems,3.13883
2020-01-01,Uganda,Domestic private health expenditure (% of curr...,Health: Health systems,41.942467
2020-01-01,Uganda,External health expenditure (% of current heal...,Health: Health systems,41.082459
2020-01-01,Uganda,Out-of-pocket expenditure (% of current health...,Health: Health systems,37.445602


In [13]:
# Renaming columns
health_df.rename(columns={
    'Country Name': 'Country',
    'Indicator Name_x': 'Indicator Name',
    'Score' : 'Percentage'
}, inplace=True)

# Printing the first five rows
health_df.head()

Unnamed: 0_level_0,Country,Indicator Name,Topic,Percentage
Years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-01,Djibouti,Current health expenditure (% of GDP),Health: Health systems,3.061504
2010-01-01,Djibouti,Domestic general government health expenditure...,Health: Health systems,60.679157
2010-01-01,Djibouti,Domestic general government health expenditure...,Health: Health systems,1.857695
2010-01-01,Djibouti,Domestic general government health expenditure...,Health: Health systems,6.970487
2010-01-01,Djibouti,Domestic private health expenditure (% of curr...,Health: Health systems,29.537765


In [14]:
health_df[health_df.duplicated()]

Unnamed: 0_level_0,Country,Indicator Name,Topic,Percentage
Years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2011-01-01,Eritrea,Domestic general government health expenditure...,Health: Health systems,2.354107
2012-01-01,Eritrea,Domestic general government health expenditure...,Health: Health systems,2.354107
2013-01-01,Eritrea,Domestic general government health expenditure...,Health: Health systems,2.354107
2014-01-01,Eritrea,Domestic general government health expenditure...,Health: Health systems,2.354107
2015-01-01,Eritrea,Domestic general government health expenditure...,Health: Health systems,2.354107
2016-01-01,Djibouti,Domestic general government health expenditure...,Health: Health systems,4.06951
2016-01-01,Eritrea,Domestic general government health expenditure...,Health: Health systems,2.354107
2017-01-01,Djibouti,Domestic general government health expenditure...,Health: Health systems,4.06951
2017-01-01,Eritrea,Domestic general government health expenditure...,Health: Health systems,2.354107
2017-01-01,Rwanda,Domestic general government health expenditure...,Health: Health systems,8.882771


- Considering the nature of the data and the contextual relevance of the duplicate entries, a decision was made not to drop these duplicates.

- The acknowledgment of the data's characteristics and the context surrounding the duplicated entries led to the determination that retaining these duplicates was appropriate.

- This approach suggests a deliberate choice to preserve all instances of the data, even if there are identical records, potentially indicating that each occurrence holds valuable information or contributes to a comprehensive understanding within the given context.

In [15]:
health_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 616 entries, 2010-01-01 to 2020-01-01
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country         616 non-null    object 
 1   Indicator Name  616 non-null    object 
 2   Topic           616 non-null    object 
 3   Percentage      616 non-null    float64
dtypes: float64(1), object(3)
memory usage: 24.1+ KB


- Following the preprocessing tasks, including handling duplicates and missing values, the resulting dataframe has been refined to encompass 616 rows and 4 columns.

- As part of the data refinement process, certain column names were modified for clarity, and the "Year" column underwent conversion to datetime format. This conversion was undertaken with the objective of facilitating subsequent time series analysis on the dataframe.

- The stage is now set for the subsequent section, which delves into Exploratory Data Analysis (EDA). This phase aims to unravel insights, patterns, and trends within the data through visualizations and statistical exploration.

## Exploratory Data Analysis