# Step 2 & 3: Earnings and Costs Data

## Objective
Enrich the panel with economic determinants: Earnings (Benefits) and Costs (Tuition + Living).

## Data Sources
1.  **ILO / World Bank**: Median earnings by country.
2.  **Education Costs Database**: Tuition fees for international students.
3.  **Numbeo / World Bank**: Cost of living indices.

## Methodology
- **Currency Conversion**: All values are converted to **USD (PPP)** to ensure comparability across countries.
- **Inflation Adjustment**: Values are adjusted to constant base-year dollars.
- **Gap Calculation**: We calculate the *difference* between Destination and Origin:
    - `log_earnings_diff` = $\ln(Earnings_{dest}) - \ln(Earnings_{orig})$
    - `log_tuition_diff` = $\ln(Tuition_{dest}) - \ln(Tuition_{orig})$
    - `log_living_diff` = $\ln(Living_{dest}) - \ln(Living_{orig})$

In [1]:
from pathlib import Path
import re

import numpy as np
import pandas as pd
from IPython.display import display

BASE_DIR = Path("/Users/simonedinato/Documents/Classes/Applied Econometrics/Project")
DATA_DIR = BASE_DIR / "Datasets"

## 2.1 Earnings Data
We load median income data and fill gaps using GDP per capita (PPP) where direct income data is missing.

In [2]:
EARNINGS_PATH = DATA_DIR / "02_earnings_gap"
earn_df = pd.read_csv(EARNINGS_PATH / "median-income-by-country-2025.csv")

In [3]:
earn_df["earnings_val"] = earn_df["GDPPerCapitaPPPInt_2022"]
earn_df["earnings_val"] = earn_df["earnings_val"].fillna(earn_df["GDPPerCapitaPPPInt_2023"])

In [4]:
earnings_country = earn_df.set_index("country")["earnings_val"]
earnings_country

country
Luxembourg                  143381.90
United Arab Emirates         73777.74
Norway                      123150.20
Switzerland                  91326.13
United States                77860.91
                              ...    
Turks and Caicos Islands     30220.82
Marshall Islands              7332.30
San Marino                   75941.49
Palau                        16581.02
Nauru                        13346.98
Name: earnings_val, Length: 200, dtype: float64

## 2.2 Costs Data
We load tuition and living cost data. Note that tuition is specific to *international* students, which is often higher than domestic tuition.

In [5]:
COSTS_PATH = DATA_DIR / "03_cost_gap"
costs = pd.read_csv(COSTS_PATH / "International_Education_Costs.csv")

In [6]:
costs["Country"] = costs["Country"].replace({"USA": "United States", "UK": "United Kingdom"})

In [7]:
costs["Annual_Tuition"]     = costs["Tuition_USD"] / costs["Duration_Years"]
costs["Annual_Rent"]        = costs["Rent_USD"] * 12
costs["Annual_Visas_Ins"]   = (costs["Visa_Fee_USD"] + costs["Insurance_USD"]) / costs["Duration_Years"]
costs["Annual_Total_Cost"]  = (
    costs["Annual_Tuition"] 
    + costs["Annual_Rent"] 
    + costs["Annual_Visas_Ins"]
)


In [8]:
country_cost = (
    costs.groupby("Country")[["Annual_Total_Cost", "Annual_Tuition", "Annual_Rent", "Annual_Visas_Ins"]]
    .median()
)
country_cost["cost_living"] = country_cost["Annual_Rent"] + country_cost["Annual_Visas_Ins"]
country_cost = country_cost.rename(columns={"Annual_Total_Cost": "cost_val", "Annual_Tuition": "cost_tuition"})
country_cost = country_cost[["cost_val", "cost_tuition", "cost_living"]]
country_cost

Unnamed: 0_level_0,cost_val,cost_tuition,cost_living
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Algeria,2800.000000,400.000000,2493.333333
Argentina,3425.000000,0.000000,3332.500000
Australia,28887.500000,12166.666667,17166.666667
Austria,10436.666667,500.000000,9696.666667
Bahrain,10332.500000,2250.000000,7622.500000
...,...,...,...
United Kingdom,25495.000000,9633.333333,13630.000000
United States,38332.000000,12875.000000,23215.000000
Uruguay,5510.000000,0.000000,4780.000000
Uzbekistan,4305.000000,1600.000000,2845.000000


# Merge

In [9]:
MOBILITY_PATH = DATA_DIR / "01_mobility_OD" / "merged_mobility.csv"
fact = pd.read_csv(MOBILITY_PATH)

# Merge Earnings
fact = fact.merge(earnings_country.rename("earnings_dest"), left_on="destination_country", right_index=True, how="left")
fact = fact.merge(earnings_country.rename("earnings_orig"), left_on="origin_country", right_index=True, how="left")

# Merge Costs
fact = fact.merge(country_cost.add_suffix("_dest"), left_on="destination_country", right_index=True, how="left")
fact = fact.merge(country_cost.add_suffix("_orig"), left_on="origin_country", right_index=True, how="left")

# Save
FACT_DIR = DATA_DIR / "07_fact_tables"
FACT_DIR.mkdir(parents=True, exist_ok=True)
fact.to_csv(FACT_DIR / "od_fact_table.csv", index=False)
print(f"Saved fact table to {FACT_DIR / 'od_fact_table.csv'} with columns:", fact.columns.tolist())

Saved fact table to /Users/simonedinato/Documents/Classes/Applied Econometrics/Project/Datasets/07_fact_tables/od_fact_table.csv with columns: ['indicatorId', 'origin_country_code', 'year', 'students_outbound_total', 'qualifier', 'magnitude', 'origin_country', 'destination_country_code', 'destination_country', 'students_inbound_destination', 'share_inbound_destination', 'students_enrolled', 'students_graduated', 'students_new_entrants', 'flow_source', 'share_mobile_destination', 'share_mobile_origin', 'students_national_abroad', 'weight_od', 'earnings_dest', 'earnings_orig', 'cost_val_dest', 'cost_tuition_dest', 'cost_living_dest', 'cost_val_orig', 'cost_tuition_orig', 'cost_living_orig']
