# Hypothesis Testing

## 1. Hypothesis Definitions

We aim to evaluate the drivers of obesity by testing three key relationships. For each, we define a **Null Hypothesis ($H_0$)** and an **Alternative Hypothesis ($H_1$)**.

### **Test 1: Health Expenditure vs. Obesity**
* **$H_0$ (Null):** There is no statistically significant correlation between government health expenditure and obesity rates.
* **$H_1$ (Alternative):** There is a statistically significant correlation (we originally predicted negative, i.e., more spending = less obesity).

### **Test 2: The "Digital Lifestyle" (Internet vs. Obesity)**
* **$H_0$ (Null):** There is no correlation between internet usage and obesity.
* **$H_1$ (Alternative):** There is a positive correlation (higher internet usage is linked to higher obesity due to sedentary lifestyles).

### **Test 3: Urbanization vs. Obesity**
* **$H_0$ (Null):** Urbanization levels have no relationship with obesity prevalence.
* **$H_1$ (Alternative):** Higher urbanization is associated with higher obesity rates.

## 2. Load & Merge Data

In [4]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# 1. Load Data
df_health = pd.read_csv('public-healthcare-spending-share-gdp.csv', sep=',')
df_obesity = pd.read_csv('share-of-adults-defined-as-obese.csv', sep=';')
df_internet = pd.read_csv('share-of-individuals-using-the-internet.csv', sep=';')
df_urban = pd.read_csv('share-of-population-urban.csv', sep=';')
df_gdp = pd.read_csv('gdp-per-capita-worldbank.csv', sep=';')

# 2. Clean & Rename
df_health = df_health.rename(columns={'Domestic general government health expenditure (% of GDP)': 'Health_Expenditure'})
df_obesity.columns = ['Entity', 'Code', 'Year', 'Obesity_Rate']
df_internet.columns = ['Entity', 'Code', 'Year', 'Internet_Usage']
df_urban = df_urban.rename(columns={'Urban population (% of total population)': 'Urban_Rate'})
df_gdp = df_gdp.iloc[:, :4]
df_gdp.columns = ['Entity', 'Code', 'Year', 'GDP_Per_Capita']

for df in [df_health, df_obesity, df_internet, df_urban, df_gdp]:
    df['Year'] = pd.to_numeric(df['Year'], errors='coerce')
    df.dropna(subset=['Year'], inplace=True)
    df['Year'] = df['Year'].astype(int)

# 3. Merge
df_merged = pd.merge(df_obesity[['Entity', 'Year', 'Obesity_Rate']], df_health[['Entity', 'Year', 'Health_Expenditure']], on=['Entity', 'Year'], how='inner')
df_merged = pd.merge(df_merged, df_internet[['Entity', 'Year', 'Internet_Usage']], on=['Entity', 'Year'], how='inner')
df_merged = pd.merge(df_merged, df_urban[['Entity', 'Year', 'Urban_Rate']], on=['Entity', 'Year'], how='inner')
df_merged = pd.merge(df_merged, df_gdp[['Entity', 'Year', 'GDP_Per_Capita']], on=['Entity', 'Year'], how='inner')

# I preferred to use 2016 data
df_2016 = df_merged[df_merged['Year'] == 2016]

## 4. Pearson Correlation Test (Statistical Test)

We compute the **Pearson correlation coefficient ($r$)** and the **p-value** for each relationship.
* **$r$**: Strength of relationship (-1 to +1).
* **p-value**: Significance (if $p < 0.05$, we reject the Null Hypothesis).

In [6]:
def run_pearson_test(x_col, y_col, label):
    r, p = stats.pearsonr(df_2016[x_col], df_2016[y_col])
    print(f"--- {label} ---")
    print(f"Correlation (r): {r:.4f}")
    print(f"P-Value (p):     {p:.4e}")  # Scientific notation for very small numbers
    if p < 0.05:
        print("Result: Significant (Reject H0)\n")
    else:
        print("Result: Not Significant (Fail to reject H0)\n")

run_pearson_test('Health_Expenditure', 'Obesity_Rate', 'Test 1: Health Spend vs. Obesity')
run_pearson_test('Internet_Usage', 'Obesity_Rate', 'Test 2: Internet vs. Obesity')
run_pearson_test('Urban_Rate', 'Obesity_Rate', 'Test 3: Urbanization vs. Obesity')

--- Test 1: Health Spend vs. Obesity ---
Correlation (r): 0.4391
P-Value (p):     5.0476e-10
Result: Significant (Reject H0)

--- Test 2: Internet vs. Obesity ---
Correlation (r): 0.4617
P-Value (p):     4.7607e-11
Result: Significant (Reject H0)

--- Test 3: Urbanization vs. Obesity ---
Correlation (r): 0.3452
P-Value (p):     1.7032e-06
Result: Significant (Reject H0)

