# Economic Freedom & Happiness (2019)  
## Exploratory Data Analysis, Hypothesis Testing & Machine Learning

**Course/Project:** DSA210  
**Author:** *Yusuf Efe*  
**Date:** *09.01.2026*

> This notebook is the **final report** for the project investigating the relationship between **Economic Freedom** and **Happiness** in **2019**.  

---

### Table of Contents
1. Abstract  
2. Motivation & Research Question  
3. Data Sources (2019)  
4. Phase 2 — Data Preparation & Merge  
5. Exploratory Data Analysis (EDA)  
6. Hypothesis Testing  
7. Phase 3 — Machine Learning  
8. Discussion & Limitations  
9. Conclusion  
10. References  
11. Appendix (Reproducibility)

---


## 1) Abstract
This project examines whether countries with higher **economic freedom** tend to have higher **happiness** in **2019**.  
We combine the **World Happiness Report 2019** dataset with the **Index of Economic Freedom 2019** dataset at the country level, perform cleaning and standardization, then analyze the relationship using:
- **EDA** (scatter plots and distributions)
- **Hypothesis testing** using a *median split* (High vs Low Freedom)
- **Machine learning** (regression + optional classification) to assess predictive power and feature importance.

**Key finding (rank-based):** Countries in the **High Freedom** group have substantially better (lower) average happiness rank than countries in the **Low Freedom** group (≈ 47 vs ≈ 104), indicating a statistically meaningful difference.

---


## 2) Motivation & Research Question
Happiness is a multi-dimensional measure of well-being and is not explained solely by GDP.  
Economic freedom—capturing aspects such as **property rights**, **government integrity**, and **business freedom**—may be associated with improved living conditions and institutions.

### Research Question
> **Do countries with higher economic freedom tend to be happier in 2019?**

### Operational Definitions
- **Happiness** is measured via:
  - `Happiness_Score` (higher = happier), and/or
  - `Happiness_Rank` (lower = happier).
- **Economic freedom** is measured via:
  - Overall index score (`Index Score` / `Index_Score`), and
  - sub-scores (e.g., `Property Rights`, `Government Integrity`, `Business Freedom`, etc.).

---


## 3) Data Sources (2019)
This project uses only **2019** data to maintain temporal consistency.

### Dataset A — World Happiness Report 2019 (Kaggle)
Key columns:
- `Country`
- `Happiness_Rank`
- `Happiness_Score`

### Dataset B — Index of Economic Freedom (Heritage / Kaggle)
Key columns:
- `Country`
- `Year` (filtered to 2019)
- `Index Score` (overall freedom score)
- sub-scores (e.g., `Property Rights`, `Business Freedom`, `Government Integrity`, …)

---


## 4) Phase 2 — Data Preparation & Merge
The Phase 2 pipeline is responsible for producing a clean, merged dataset.

### Steps
1. Load the two datasets (handle potential encoding differences).
2. Standardize country names (examples: **Turkiye → Turkey**, **Korea, South → South Korea**).
3. Filter the economic freedom dataset to **Year = 2019**.
4. Perform an **inner join** on `Country`.
5. Save merged output as `final_dataset_2019.csv` (recommended).

### Resulting Dataset
- A merged country-level dataset containing roughly **145 countries**, representing the intersection of both sources.

---


## 5) Exploratory Data Analysis (EDA)
EDA is used to:
- visualize the relationship between economic freedom and happiness,
- detect outliers,
- check whether the relationship appears roughly linear.

### Suggested Figures
- **Figure 1:** `Index Score` vs `Happiness_Score` (scatter)
- **Figure 2:** `Index Score` vs `Happiness_Rank` (scatter; invert y-axis so “happier” is higher)
- **Figure 3:** Distribution of `Index Score` by High vs Low freedom group
- **Figure 4:** Distribution of `Happiness_Score` / `Happiness_Rank` by group

> Add exported plots under `reports/figures/` if you want the notebook to look polished on GitHub.

---


In [None]:
# Optional: quick EDA loader (safe to run even if files are missing)
from pathlib import Path
import pandas as pd

DATA_PATH = Path("data/final_dataset_2019.csv")  # recommended location

if DATA_PATH.exists():
    df = pd.read_csv(DATA_PATH)
    display(df.head())
    print("Rows:", len(df), "| Columns:", df.shape[1])
else:
    print("Could not find:", DATA_PATH)
    print("Tip: place the merged dataset here, or update DATA_PATH.")

In [None]:
# Optional: EDA plots (run after df is loaded)
import matplotlib.pyplot as plt

def plot_scatter(df, x, y, invert_y=False):
    plt.figure()
    plt.scatter(df[x], df[y])
    plt.xlabel(x)
    plt.ylabel(y)
    if invert_y:
        plt.gca().invert_yaxis()
    plt.title(f"{x} vs {y}")
    plt.show()

if 'df' in globals():
    # Try common column naming variations:
    x_candidates = ['Index Score', 'Index_Score', 'Index_Score_2019']
    y_score_candidates = ['Happiness_Score', 'Happiness Score', 'Score']
    y_rank_candidates = ['Happiness_Rank', 'Happiness Rank', 'Overall rank', 'Rank']

    x = next((c for c in x_candidates if c in df.columns), None)
    y_score = next((c for c in y_score_candidates if c in df.columns), None)
    y_rank = next((c for c in y_rank_candidates if c in df.columns), None)

    if x and y_score:
        plot_scatter(df, x, y_score, invert_y=False)
    if x and y_rank:
        plot_scatter(df, x, y_rank, invert_y=True)
else:
    print("Load df first (run the previous cell).")

## 6) Hypothesis Testing
We test whether **High Freedom** countries are significantly happier than **Low Freedom** countries.

### Group Definition (Median Split)
- Compute the median of the freedom score (overall index).
- **High Freedom:** `Index Score ≥ median`
- **Low Freedom:** `Index Score < median`

### Tests Used
- **Welch Two-Sample t-test** (robust to unequal variances)
- *(Optional robustness)* **Mann–Whitney U test** (non-parametric)
- *(Optional)* Effect size: **Cohen’s d**

### Reported Outcome (Rank-based)
- High Freedom mean happiness rank ≈ **47**
- Low Freedom mean happiness rank ≈ **104**
- The difference is statistically significant at p < 0.05.

---


In [None]:
# Optional: compute High/Low groups + run Welch t-test (if scipy is available)
import numpy as np

try:
    from scipy.stats import ttest_ind, mannwhitneyu
except Exception as e:
    print("scipy not available:", e)
    ttest_ind = None
    mannwhitneyu = None

def cohens_d(a, b):
    a = np.asarray(a); b = np.asarray(b)
    a = a[~np.isnan(a)]; b = b[~np.isnan(b)]
    na, nb = len(a), len(b)
    sa2, sb2 = np.var(a, ddof=1), np.var(b, ddof=1)
    sp = np.sqrt(((na-1)*sa2 + (nb-1)*sb2) / (na+nb-2))
    return (np.mean(a) - np.mean(b)) / sp if sp != 0 else np.nan

if 'df' in globals():
    # detect columns
    x_candidates = ['Index Score', 'Index_Score']
    y_rank_candidates = ['Happiness_Rank', 'Overall rank', 'Rank']
    x = next((c for c in x_candidates if c in df.columns), None)
    y = next((c for c in y_rank_candidates if c in df.columns), None)

    if not x or not y:
        print("Could not find required columns. Found x:", x, "y:", y)
    else:
        median = df[x].median()
        high = df[df[x] >= median][y].astype(float)
        low  = df[df[x] <  median][y].astype(float)

        print("High freedom mean rank:", high.mean())
        print("Low  freedom mean rank:", low.mean())

        if ttest_ind:
            t_stat, p_val = ttest_ind(high, low, equal_var=False, nan_policy='omit')
            print("\nWelch t-test: t =", t_stat, "p =", p_val)

        if mannwhitneyu:
            u_stat, p_u = mannwhitneyu(high.dropna(), low.dropna(), alternative='two-sided')
            print("Mann–Whitney U: U =", u_stat, "p =", p_u)

        print("Cohen's d (high - low):", cohens_d(high, low))
else:
    print("Load df first.")

## 7) Phase 3 — Machine Learning
Phase 3 extends the analysis by evaluating how well economic freedom metrics predict happiness and by identifying which freedom dimensions are most informative.

### 7.1 Regression
- **Target:** `Happiness_Rank` (lower = happier)
- **Models:** Linear Regression, Ridge, Lasso, ElasticNet, Random Forest, Gradient Boosting  
- **Validation:** Train/test split + 5-fold CV  
- **Metrics:** MAE, RMSE, R²  

### 7.2 Classification (Optional)
- Convert happiness into a binary label using the median rank:
  - `Rank <= median` → “High happiness” (1)
  - else → 0
- **Models:** Logistic Regression, Random Forest Classifier, Gradient Boosting Classifier  
- **Metrics:** Accuracy, F1, ROC-AUC

### 7.3 Interpretability
- Use **Permutation Importance** on the best regression model to rank feature importance.

---


In [None]:
# Optional: load Phase 3 outputs (if you saved them under outputs/)
from pathlib import Path
import pandas as pd

OUT = Path("outputs")

files = [
    "ml_regression_cv_results.csv",
    "ml_classification_cv_results.csv",
    "ml_feature_importance.csv",
]

for f in files:
    p = OUT / f
    if p.exists():
        print("Loaded:", p)
        display(pd.read_csv(p).head())
    else:
        print("Missing:", p)

## 8) Discussion & Limitations
### Discussion
The hypothesis testing results suggest that countries with higher economic freedom are significantly happier (rank-based) in 2019.  
The ML results provide a complementary view by testing predictive performance and identifying which freedom dimensions contribute most.

### Limitations
1. **Single-year analysis (2019)** limits generalization across time.  
2. **Inner join coverage** removes countries not present in both datasets (~145 remain).  
3. **Association vs causality:** results are correlational and may be influenced by confounders (income, institutions, conflict, inequality, etc.).  

---


## 9) Conclusion
Using 2019 data, this project finds a strong association between **economic freedom** and **happiness** across countries.  
High Freedom countries have substantially better happiness ranks than Low Freedom countries, and ML modeling provides a reproducible pipeline to evaluate prediction quality and interpret the most informative freedom dimensions.

---


## 10) References
- World Happiness Report 2019 (Kaggle)  
- Index of Economic Freedom (Heritage / Kaggle)

---


## 11) Appendix — Reproducibility
### Recommended Repo Paths
- `data/final_dataset_2019.csv`  ← output of Phase 2
- `outputs/`  ← Phase 3 generated results (CSV + joblib)

### Minimal Run Instructions
1. Run **Phase 2** notebook to generate the merged dataset.  
2. Run **Phase 3** script/notebook to generate ML outputs.  
3. Open this notebook and load `data/` + `outputs/` to populate the “Results Summary” section.

---
