In [None]:
## Final Report — Global Happiness & Economic Freedom (2019)

### 1. Abstract

This project investigates whether countries with higher **economic freedom** tend to be **happier** in **2019**, using the *World Happiness Report 2019* and the *Index of Economic Freedom* datasets. The workflow combines (i) data cleaning + merging into a unified country-level dataset, (ii) exploratory analysis to visualize the relationship between freedom and happiness, (iii) hypothesis testing using a median-based High vs Low freedom split, and (iv) machine learning models to assess predictive power and identify which freedom dimensions matter most.

---

### 2. Introduction & Research Question

Happiness is often discussed alongside economic development, but GDP alone does not fully explain differences in well-being. This study focuses on “structural” aspects of economic freedom—such as **property rights**, **government integrity**, and **business freedom**—and tests whether these freedom indicators are associated with better (lower) happiness rank and/or higher happiness score.

**Core research question:**
Do countries with higher economic freedom have higher happiness in 2019 (measured by better rank / higher score)?

---

### 3. Data Sources & Variables (2019)

All analysis is restricted to **2019** for temporal consistency.

**Dataset 1: World Happiness Report 2019 (Kaggle)**
Key fields used include:

* `Country`
* `Happiness_Rank` (ordinal; 1 = happiest)
* `Happiness_Score` (continuous; higher = happier)

**Dataset 2: Index of Economic Freedom (Heritage / Kaggle)**
Key fields used include:

* `Country`, `Year` (filtered to 2019)
* Overall freedom score (`Index_Score` / “Index Score”)
* Sub-scores like `Property Rights`, `Business Freedom`, `Government Integrity`, etc.

---

### 4. Phase 2 — Data Preparation, Cleaning, and Merge

Phase 2 implements the full data engineering pipeline:

1. **Programmatic download** using `kagglehub`.
2. **Robust CSV loading** to handle encoding issues (UTF-8 vs Latin-1).
3. **Country name standardization** (e.g., “Turkiye → Turkey”, “Korea, South → South Korea”) to ensure consistent join keys.
4. **Filter economic freedom dataset to Year = 2019**.
5. **Inner join on `Country`**, producing a final merged dataset of approximately **145 countries** that exist in both sources.

**Output artifact:** A clean merged dataset intended to be saved and reused for downstream analysis (Phase 3 expects `final_dataset_2019.csv` produced in Phase 2).

---

### 5. Exploratory Data Analysis (EDA)

EDA is used to examine the raw relationship between economic freedom and happiness and to sanity-check distributions and outliers.

Planned/implemented visuals include:

* `Index_Score` vs `Happiness_Score` (main continuous relationship)
* `Index_Score` vs `Happiness_Rank` as a robustness view, with an inverted y-axis so “happier” (lower rank) appears higher in plots

Interpretation at the EDA stage is descriptive (trend visibility, linearity/nonlinearity, and whether extreme outliers drive the relationship) rather than causal.

---

### 6. Phase 2 — Hypothesis Testing

#### 6.1 Grouping Strategy

Countries are split into **High Freedom** and **Low Freedom** groups using the **median** of the economic freedom score:

* `Index_Score ≥ median` → High Freedom
* `Index_Score < median` → Low Freedom

#### 6.2 Statistical Tests

The main test described and used is the **Welch Two-Sample t-test**, chosen because it does not assume equal variances between groups.
(README also proposes a Mann–Whitney U nonparametric robustness test; include it if your course expects an ordinal-focused nonparametric comparison.)

#### 6.3 Key Finding

Using `Happiness_Rank` as the outcome, the project reports a large and statistically significant difference:

* High Freedom mean rank ≈ **47** (better/happier)
* Low Freedom mean rank ≈ **104** (worse/less happy)
* p-value reported as extremely small, rejecting the null hypothesis at conventional levels

**Conclusion from inference:** In 2019, countries with higher economic freedom tend to be significantly happier by rank-based measures.

---

### 7. Phase 3 — Machine Learning (Predictive + Interpretability)

Phase 3 extends beyond significance testing and asks: *How well can freedom metrics predict happiness, and which freedom dimensions are most informative?*

#### 7.1 Data Input & Consistency with Phase 2

Phase 3 explicitly loads the merged dataset “exactly like Phase 2,” either by reusing `merged_df` in memory or loading `final_dataset_2019.csv`.
It validates that required columns exist before modeling (at minimum `Happiness_Rank` and `Index Score`).

#### 7.2 Features and Target

**Target (implemented in Phase 3):** `Happiness_Rank` (regression target).
**Features:** Primarily economic freedom metrics (overall index plus sub-scores when present), with a fallback to “all numeric columns except target” if expected subscores are missing.

#### 7.3 Validation Strategy & Metrics

Because the dataset is small (~145 rows), Phase 3 uses:

* **Train/test split** (20% test) and **5-fold cross-validation**
* Regression metrics: **MAE, RMSE, R²**
* Classification metrics (optional track): **Accuracy, F1, ROC-AUC**

#### 7.4 Models Trained (Implemented)

**Regression models:** LinearRegression, Ridge, Lasso, ElasticNet, RandomForestRegressor, GradientBoostingRegressor.
Linear models include scaling; tree/boosting models omit scaling (pipelines are built accordingly).

**Classification (optional):** High vs Low happiness classification based on whether a country’s rank is <= the median rank (note: lower rank means happier).
Models: LogisticRegression, RandomForestClassifier, GradientBoostingClassifier.

#### 7.5 Model Selection, Outputs, and Reproducibility

* Best regression model is chosen by lowest cross-validated RMSE, then evaluated on the held-out test set; results are saved to CSV and the best model is saved as a `.joblib` artifact.
* Phase 3 computes **permutation feature importance** (30 repeats) for interpretability and saves it to `ml_feature_importance.csv`.
* Classification results are also cross-validated and the best classifier is saved similarly.

> Note: The Phase 3 file defines the evaluation framework and saving behavior, but the actual numeric results (e.g., CV_RMSE, Test R², Test AUC) depend on running the pipeline on the produced dataset.

---

### 8. Discussion

The project’s statistical testing indicates a clear difference in happiness rank between high- and low-freedom groups in 2019.  Phase 3 complements this by measuring predictive strength and producing a ranked list of the most predictive freedom dimensions via permutation importance.

**Interpretation guidance (important for the final write-up):**

* These results demonstrate *association*, not causality. Cross-country happiness is influenced by many confounders (income, institutions, health, conflict, inequality, etc.), some of which are partially correlated with freedom indicators.
* Using rank as the primary target is defensible (robust, interpretable), but score is continuous and may improve regression behavior; the README frames score as “preferred,” while Phase 3 currently targets rank.

---

### 9. Limitations

1. **Single-year scope (2019):** Limits generalizability across time; relationships may change year-to-year.
2. **Country coverage:** Inner join keeps only overlapping countries (~145), which may bias toward countries with complete reporting.
3. **Potential confounding:** Economic freedom correlates with development and institutional quality; controlling variables would be required for causal claims.
4. **Metric mismatch:** README emphasizes `Happiness_Score` as primary continuous target, but Phase 3 implementation currently predicts `Happiness_Rank`.

---

### 10. Conclusion

Using 2019 data, this project finds that countries with **higher economic freedom** are **significantly happier** when happiness is measured by global happiness rank.  The ML extension provides a practical predictive framework with cross-validation, model comparison, and interpretability via permutation feature importance, enabling the project to move from “is there a difference?” to “which freedom factors matter most and how predictable is happiness from them?”

---

If you want, I can also format this into a **proper PDF/Word-style report layout** (title page, figures list, references section) in the exact template your course wants—just tell me the required format (APA/IEEE/none).
