## Purpose of this notebook

This notebook is to merge the data across all the files (processed input files and raw output file) we have so we have a consolidated file to perform exploratory data analysis.
Additionally, we narrow down the data to use data from 2000 to 2019 alone.

In [4]:
# Combining all the processed data into a single CSV file for analysis
import pandas as pd
import os

df = pd.DataFrame()

# Define the path to the processed data directory
processed_data_dir = "../data/processed"

# Loop through each CSV file in the processed data directory
# look into files within subdirectories as well
for root, dirs, files in os.walk(processed_data_dir):
    for filename in files:
        if filename.endswith(".csv"):
            file_path = os.path.join(root, filename)
            print(f"Processing file: {file_path}")
            
            temp_df = pd.read_csv(file_path)
            
            if df.empty:
                df = temp_df.copy()
            else:
                df = pd.merge(df, temp_df, on=["Country Name", "Year"], how="outer")
            
            print(f"Merged {filename}, resulting dataframe has {len(df)} rows and {len(df.columns)} columns.")
            
# Save the combined dataframe to a new CSV file
output_filepath = "../data/combined_data.csv"
df.to_csv(output_filepath, index=False)

Processing file: ../data/processed/mortality_rate/data.csv
Merged data.csv, resulting dataframe has 13251 rows and 3 columns.
Processing file: ../data/processed/slum_population/data.csv
Merged data.csv, resulting dataframe has 13290 rows and 4 columns.
Processing file: ../data/processed/gdp/data.csv
Merged data.csv, resulting dataframe has 13787 rows and 5 columns.
Processing file: ../data/processed/maternal_education/data.csv
Merged data.csv, resulting dataframe has 14919 rows and 6 columns.
Processing file: ../data/processed/population_density/data.csv
Merged data.csv, resulting dataframe has 16516 rows and 7 columns.
Processing file: ../data/processed/water_access/data.csv
Merged data.csv, resulting dataframe has 16527 rows and 8 columns.
Processing file: ../data/processed/income_classification/data.csv
Merged data.csv, resulting dataframe has 17841 rows and 9 columns.
Processing file: ../data/processed/health_expenditure/data.csv
Merged data.csv, resulting dataframe has 17841 rows 

In [5]:
df.head()

Unnamed: 0,Country Name,Year,Mortality rate under-5,Population living in slums percentage,GDP per capita,"Secondary education, pupils female percentage",Population density (people per sq. km of land area),People using safely managed drinking water services percentage,Income classification,Current health expenditure percentage
0,Afghanistan,1960,353.2,,,,,,,
1,Afghanistan,1961,347.6,,,,14.127046,,,
2,Afghanistan,1962,342.3,,,,14.418849,,,
3,Afghanistan,1963,336.8,,,,14.725614,,,
4,Afghanistan,1964,331.7,,,,15.047327,,,


In [6]:
# Restrict data to years 2000-2019
final_df = df[(df['Year'] >= 2000) & (df['Year'] <= 2019)]
print(f"Data restricted to years 2000-2019, resulting dataframe has {len(final_df)} rows and {len(final_df.columns)} columns.")
final_df.to_csv("../data/final_data_2000_2019.csv", index=False)

Data restricted to years 2000-2019, resulting dataframe has 5987 rows and 10 columns.


In [7]:
final_df.head()

Unnamed: 0,Country Name,Year,Mortality rate under-5,Population living in slums percentage,GDP per capita,"Secondary education, pupils female percentage",Population density (people per sq. km of land area),People using safely managed drinking water services percentage,Income classification,Current health expenditure percentage
40,Afghanistan,2000,131.7,,1617.826475,,30.863847,11.093326,Low-income countries,
41,Afghanistan,2001,127.4,,1454.110782,0.0,31.099929,11.105221,Low-income countries,
42,Afghanistan,2002,123.1,,1774.308743,,32.776961,12.007733,Low-income countries,9.443391
43,Afghanistan,2003,118.7,,1815.9282,24.44685,34.854344,12.909922,Low-income countries,8.941258
44,Afghanistan,2004,114.2,,1776.918207,16.27781,36.12323,13.818684,Low-income countries,9.808474
