# Exploratory Data Analysis (EDA)

## 1. Preliminary Steps

### Load Data
This section involves importing the necessary Python libraries and loading our five core datasets to analyze the relationship between health expenditure, lifestyle factors, and obesity:
1.  **Health Expenditure:** Government spending on health as a share of GDP.
2.  **Obesity Rates:** The share of the adult population defined as obese.
3.  **Internet Usage:** A proxy for sedentary "digital" lifestyles.
4.  **Urbanization:** The percentage of the population living in urban areas.
5.  **GDP Per Capita:** A measure of a country's economic output per person.

In [8]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Set the visual style for graphs
sns.set(style="whitegrid")

# Load the datasets
# Note: 'Health' uses commas (,), while others use semicolons (;)
df_health = pd.read_csv('public-healthcare-spending-share-gdp.csv', sep=',')
df_obesity = pd.read_csv('share-of-adults-defined-as-obese.csv', sep=';')
df_internet = pd.read_csv('share-of-individuals-using-the-internet.csv', sep=';')
df_urban = pd.read_csv('share-of-population-urban.csv', sep=';')
df_gdp = pd.read_csv('gdp-per-capita-worldbank.csv', sep=';')


### Data Cleaning and Merging

In this section, we preprocess the datasets to ensure they are ready for analysis.
1.  **Renaming Columns:** We standardize column names (e.g., shortening long World Bank descriptions to simple terms like `Health_Expenditure`) to make the code readable.
2.  **Data Type Conversion:** We ensure the `Year` column is numeric across all files to prevent errors during merging.
3.  **Merging:** We perform an **Inner Join** on `Entity` (Country) and `Year`. This creates a unified dataset containing only the records where we have complete data for all variables.

In [10]:
# 1. Standardize Column Names
df_health = df_health.rename(columns={'Domestic general government health expenditure (% of GDP)': 'Health_Expenditure'})
df_obesity.columns = ['Entity', 'Code', 'Year', 'Obesity_Rate']
df_internet.columns = ['Entity', 'Code', 'Year', 'Internet_Usage']
df_urban = df_urban.rename(columns={'Urban population (% of total population)': 'Urban_Rate'})
# GDP file often has extra columns; we select the first 4 and rename
df_gdp = df_gdp.iloc[:, :4] 
df_gdp.columns = ['Entity', 'Code', 'Year', 'GDP_Per_Capita']

# 2. Clean 'Year' Column (Convert to numeric and drop bad rows)
for df in [df_health, df_obesity, df_internet, df_urban, df_gdp]:
    df['Year'] = pd.to_numeric(df['Year'], errors='coerce')
    df.dropna(subset=['Year'], inplace=True)
    df['Year'] = df['Year'].astype(int)

# 3. Merge Datasets (Inner Join on Entity and Year)
df_merged = pd.merge(df_obesity[['Entity', 'Year', 'Obesity_Rate']], 
                     df_health[['Entity', 'Year', 'Health_Expenditure']], 
                     on=['Entity', 'Year'], how='inner')

df_merged = pd.merge(df_merged, df_internet[['Entity', 'Year', 'Internet_Usage']], 
                     on=['Entity', 'Year'], how='inner')

df_merged = pd.merge(df_merged, df_urban[['Entity', 'Year', 'Urban_Rate']], 
                     on=['Entity', 'Year'], how='inner')

df_merged = pd.merge(df_merged, df_gdp[['Entity', 'Year', 'GDP_Per_Capita']], 
                     on=['Entity', 'Year'], how='inner')


## 2. Dataset Overview

Now that the data is merged, we inspect its structure to verify data quality.
- **Shape:** Checks the total number of data points (rows and columns).
- **Head:** Displays the first few rows to visually confirm the merge.
- **Info:** Verifies data types and checks for any remaining null values.

In [9]:
# Check dimensions
print(f"Dataset Shape: {df_merged.shape}")

# Display first 5 rows
print("\nFirst 5 rows of the merged dataset:")
display(df_merged.head())

# Check data types
print("\nData Info:")
df_merged.info()

Dataset Shape: (4107, 7)

First 5 rows of the merged dataset:


Unnamed: 0,Entity,Year,Obesity_Rate,Health_Expenditure,Internet_Usage,Urban_Rate,GDP_Per_Capita
0,Afghanistan,2002,4.33978,0.084181,0.004561,22.261,1774.3087
1,Afghanistan,2003,4.69862,0.650963,0.087891,22.353,1815.9282
2,Afghanistan,2004,5.08183,0.542926,0.105809,22.5,1776.9182
3,Afghanistan,2005,5.48633,0.529184,1.22415,22.703,1908.1147
4,Afghanistan,2006,5.91647,0.49784,2.10712,22.907,1929.7239



Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4107 entries, 0 to 4106
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Entity              4107 non-null   object 
 1   Year                4107 non-null   int32  
 2   Obesity_Rate        4107 non-null   float64
 3   Health_Expenditure  4107 non-null   float64
 4   Internet_Usage      4107 non-null   float64
 5   Urban_Rate          4107 non-null   float64
 6   GDP_Per_Capita      4107 non-null   float64
dtypes: float64(5), int32(1), object(1)
memory usage: 208.7+ KB


## 3. Summary Statistics

Here we examine the statistical properties of our variables (Mean, Standard Deviation, Min/Max).
- **Key Observation:** We want to check if the data ranges make sense (e.g., Obesity Rate should be between 0 and 100).
- **Count:** Verifies that we have consistent data coverage across all years.

In [11]:
# Generate descriptive statistics
print("Summary Statistics:")
display(df_merged.describe())

Summary Statistics:


Unnamed: 0,Year,Obesity_Rate,Health_Expenditure,Internet_Usage,Urban_Rate,GDP_Per_Capita
count,4107.0,4107.0,4107.0,4107.0,4107.0,4107.0
mean,2011.050889,18.424823,3.275499,36.880464,56.3774,21580.346714
std,6.608906,11.58694,2.298393,31.292948,22.650103,23352.011778
min,2000.0,0.28014,0.06221,0.0,8.246,711.9764
25%,2005.0,8.76124,1.534732,6.8767,37.6535,4688.853
50%,2011.0,18.20624,2.729109,29.6431,56.685,13249.065
75%,2017.0,25.42587,4.572787,65.4774,74.6135,30661.5085
max,2022.0,70.18286,22.254263,100.0,100.0,145591.02
