# Step 6: Macro and Gravity Variables

## Objective
Finalize the dataset by adding standard Gravity Model variables and Macroeconomic controls.

## Data Sources
1.  **CEPII Gravity Database**:
    - **Distance**: Geodesic distance between population centers ($dist_{ij}$).
    - **Common Language**: Official or major languages spoken in both countries ($comlang\_off$).
    - **Colonial Ties**: Historical colonial relationships ($colony$).
2.  **World Bank WDI**: GDP per capita (PPP) for both Origin and Destination.

## Methodology
- **Gravity Variables**: These are time-invariant structural determinants of migration.
- **GDP per Capita**: Used as a robust proxy for general economic development and quality of life.
- **Final Merge**: All components (Mobility, Earnings, Costs, Employability, Gravity) are merged into the final `od_fact_table.csv` used for regression analysis.

In [1]:
import pandas as pd
import numpy as np
import re
from pathlib import Path
from IPython.display import display

PROJECT_ROOT = Path("/Users/simonedinato/Documents/Classes/Applied Econometrics/Project")
DATASETS_DIR = PROJECT_ROOT / "Datasets"
FACT_PATH = DATASETS_DIR / "07_fact_tables/od_fact_table.csv"
QC_PATH = DATASETS_DIR / "07_fact_tables/od_fact_table_macro_qc.csv"

pd.options.display.float_format = "{:,.4f}".format


In [2]:
fact = pd.read_csv(FACT_PATH)
print(f"Loaded fact table: {fact.shape[0]:,} rows × {fact.shape[1]} columns")

iso_codes = set(fact["origin_country_code"]).union(fact["destination_country_code"])
country_lookup = (
    pd.concat(
        [
            fact[["origin_country_code", "origin_country"]].rename(
                columns={"origin_country_code": "country_code", "origin_country": "country"}
            ),
            fact[["destination_country_code", "destination_country"]].rename(
                columns={"destination_country_code": "country_code", "destination_country": "country"}
            ),
        ],
        ignore_index=True,
    )
    .drop_duplicates(subset="country_code")
    .set_index("country_code")
    .to_dict()["country"]
)

print(f"Unique ISO3 codes in fact table: {len(iso_codes)}")


Loaded fact table: 134,820 rows × 27 columns
Unique ISO3 codes in fact table: 210


In [3]:
# Construct Gravity Variables from Country Data
CEPII_PATH = DATASETS_DIR / "08_geo/geo_cepii.xls"
geo = pd.read_excel(CEPII_PATH)

# 1. Prepare Country Data
geo_cols = ["iso3", "lat", "lon", "langoff_1", "langoff_2", "langoff_3", 
            "colonizer1", "colonizer2", "colonizer3", "colonizer4"]
geo_clean = geo[geo_cols].copy()
geo_clean["iso3"] = geo_clean["iso3"].astype(str).str.upper()
geo_clean = geo_clean.drop_duplicates(subset="iso3").set_index("iso3")

# 2. Define Distance Function (Haversine)
def calculate_distance(lat1, lon1, lat2, lon2):
    R = 6371  # km
    phi1, phi2 = np.radians(lat1), np.radians(lat2)
    dphi = np.radians(lat2 - lat1)
    dlambda = np.radians(lon2 - lon1)
    a = np.sin(dphi/2)**2 + np.cos(phi1)*np.cos(phi2)*np.sin(dlambda/2)**2
    return R * 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))

# 3. Process Fact Table Rows
# We iterate over unique OD pairs in fact to avoid massive Cartesian product
od_pairs = fact[["origin_country_code", "destination_country_code"]].drop_duplicates()

gravity_data = []
for _, row in od_pairs.iterrows():
    o, d = row["origin_country_code"], row["destination_country_code"]
    if o not in geo_clean.index or d not in geo_clean.index:
        continue
    
    # Distance
    dist = calculate_distance(
        geo_clean.at[o, "lat"], geo_clean.at[o, "lon"],
        geo_clean.at[d, "lat"], geo_clean.at[d, "lon"]
    )
    
    # Common Language
    langs_o = set(geo_clean.loc[o, ["langoff_1", "langoff_2", "langoff_3"]].dropna())
    langs_d = set(geo_clean.loc[d, ["langoff_1", "langoff_2", "langoff_3"]].dropna())
    comlang = 1 if not langs_o.isdisjoint(langs_d) else 0
    
    # Colony (Direct)
    # Check if O colonized D or D colonized O
    cols_o = set(geo_clean.loc[o, ["colonizer1", "colonizer2", "colonizer3", "colonizer4"]].dropna())
    cols_d = set(geo_clean.loc[d, ["colonizer1", "colonizer2", "colonizer3", "colonizer4"]].dropna())
    
    colony = 0
    if o in cols_d or d in cols_o:
        colony = 1
        
    gravity_data.append({
        "origin_country_code": o,
        "destination_country_code": d,
        "dist": dist,
        "comlang_off": comlang,
        "colony": colony
    })

gravity_df = pd.DataFrame(gravity_data)

# 4. Merge
fact = fact.merge(gravity_df, on=["origin_country_code", "destination_country_code"], how="left")
print(f"Constructed gravity variables for {len(gravity_df)} pairs.")
print("New columns:", gravity_df.columns.tolist())

Constructed gravity variables for 27263 pairs.
New columns: ['origin_country_code', 'destination_country_code', 'dist', 'comlang_off', 'colony']


In [4]:
GDP_PRIMARY_PATH = DATASETS_DIR / "06_macro_ppp/GDP_per_capita/gdp_per_capita.csv"
GDP_WDI_PATH = DATASETS_DIR / (
    "06_macro_ppp/P_Data_Extract_From_World_Development_Indicators/"
    "8b934fbb-2876-49d1-8b77-887ed9f8f17b_Data.csv"
)

GDP_SERIES_PRIORITY = {
    "NY.GDP.PCAP.PP.KD": 0,  # PPP constant international $
    "NY.GDP.PCAP.PP.CD": 1,  # PPP current international $
    "NY.GDP.PCAP.KD": 2,     # Constant local currency
    "NY.GDP.PCAP.CD": 3,     # Current US$
}

GDP_SOURCE_PRIORITY = {
    "gdp_per_capita.csv": 0,
    "WDI_extract.csv": 1,
}

GDP_YEAR_TOLERANCE = 1


def reshape_wide(df: pd.DataFrame, value_name: str) -> pd.DataFrame:
    year_cols = []
    for col in df.columns:
        col_str = str(col).strip()
        if re.match(r"^\d{4}(\s*\[YR\d{4}\])?$", col_str):
            year_cols.append(col)
    if not year_cols:
        return pd.DataFrame(columns=["country_code", "country", "series_name", "series_code", "year", value_name])
    tidy = df.melt(
        id_vars=[c for c in df.columns if c not in year_cols],
        value_vars=year_cols,
        var_name="year_raw",
        value_name=value_name,
    )
    tidy["year"] = tidy["year_raw"].astype(str).str.extract(r"^(\d{4})").astype(float)
    tidy = tidy.dropna(subset=["year"])
    tidy["year"] = tidy["year"].astype(int)
    tidy[value_name] = pd.to_numeric(tidy[value_name], errors="coerce")
    tidy = tidy.dropna(subset=[value_name])
    return tidy


def detect_ppp(series_name: str, series_code: str) -> int:
    tokens = f"{series_name} {series_code}".upper()
    flags = ["PPP", "PURCHASING POWER", "INT$", "INTERNATIONAL $"]
    return int(any(token in tokens for token in flags) or ".PP." in series_code.upper())


def series_priority(series_code: str, mapping: dict, default_start: int = 9) -> int:
    for key, priority in mapping.items():
        if series_code == key:
            return priority
    return default_start


def parse_oecd_like(df: pd.DataFrame) -> pd.DataFrame:
    code_candidates = ["LOCATION", "Country Code", "Code", "ISO3", "country_code"]
    year_candidates = ["TIME", "Year", "year", "Reference period"]
    value_candidates = ["Value", "OBS_VALUE", "value", "GDP_PC", "GDP_PER_CAPITA"]
    name_candidates = ["Country", "Country Name", "COUNTRY", "Label", "Country_Label"]
    measure_candidates = ["MEASURE", "Measure", "MEAS"]

    code_col = next((c for c in code_candidates if c in df.columns), None)
    year_col = next((c for c in year_candidates if c in df.columns), None)
    val_col = next((c for c in value_candidates if c in df.columns), None)

    if not all([code_col, year_col, val_col]):
        return pd.DataFrame(columns=[
            "country_code",
            "year",
            "gdp_pc_val",
            "country",
            "series_name",
            "series_code",
            "gdp_pc_ppp_flag",
            "series_priority",
            "source",
        ])

    out = df[[code_col, year_col, val_col]].copy()
    out.columns = ["country_code", "year", "gdp_pc_val"]
    out["country_code"] = out["country_code"].astype(str).str.upper()
    out["year"] = pd.to_numeric(out["year"], errors="coerce")
    out["gdp_pc_val"] = pd.to_numeric(out["gdp_pc_val"], errors="coerce")

    name_col = next((c for c in name_candidates if c in df.columns and c != code_col), None)
    if name_col:
        out["country"] = df.loc[out.index, name_col]
    else:
        out["country"] = out["country_code"]

    measure_col = next((c for c in measure_candidates if c in df.columns), None)
    if measure_col:
        measure = df.loc[out.index, measure_col].astype(str)
        out["gdp_pc_ppp_flag"] = measure.str.contains("PPP", case=False, na=False).astype(int)
    else:
        out["gdp_pc_ppp_flag"] = np.nan

    out = out.dropna(subset=["year", "gdp_pc_val"])
    out["year"] = out["year"].astype(int)

    out["series_name"] = "OECD GDP per capita"
    out["series_code"] = "OECD_GDPPERCAP"
    out["series_priority"] = 0
    out["source"] = "gdp_per_capita.csv"
    return out


def expand_with_year_tolerance(
    df: pd.DataFrame, tolerance: int = GDP_YEAR_TOLERANCE
) -> pd.DataFrame:
    if df.empty:
        return df.copy()

    frames = []
    base = df.copy()
    base["year_source"] = base["year"]
    base["year_distance"] = 0
    base["offset_priority"] = 0
    frames.append(base)

    if tolerance > 0:
        for offset in range(1, tolerance + 1):
            lag = df.copy()
            lag["year_source"] = lag["year"]
            lag["year"] = lag["year"] + offset
            lag["year_distance"] = offset
            lag["offset_priority"] = 1  # prefer lagged values before leads
            frames.append(lag)

            lead = df.copy()
            lead["year_source"] = lead["year"]
            lead["year"] = lead["year"] - offset
            lead = lead[lead["year"].notna()]
            lead = lead[lead["year"] > 0]
            if not lead.empty:
                lead["year_distance"] = offset
                lead["offset_priority"] = 2
                frames.append(lead)

    expanded = pd.concat(frames, ignore_index=True)
    expanded = expanded.dropna(subset=["year"])
    expanded["year"] = expanded["year"].astype(int)
    expanded = expanded.sort_values([
        "country_code",
        "year",
        "year_distance",
        "offset_priority",
    ])
    expanded = expanded.drop_duplicates(subset=["country_code", "year"], keep="first")
    expanded = expanded.drop(columns=["offset_priority"], errors="ignore")
    return expanded



In [5]:
gdp_frames = []

if GDP_PRIMARY_PATH.exists():
    gdp_primary = pd.read_csv(GDP_PRIMARY_PATH)
    if {"Series Name", "Series Code", "Country Name", "Country Code"}.issubset(gdp_primary.columns):
        gdp_primary = gdp_primary.rename(
            columns={
                "Series Name": "series_name",
                "Series Code": "series_code",
                "Country Name": "country",
                "Country Code": "country_code",
            }
        )
        gdp_primary_long = reshape_wide(gdp_primary, "gdp_pc_val")
        if not gdp_primary_long.empty:
            gdp_primary_long["gdp_pc_ppp_flag"] = gdp_primary_long.apply(
                lambda row: detect_ppp(row["series_name"], row["series_code"]), axis=1
            )
            gdp_primary_long["series_priority"] = gdp_primary_long["series_code"].apply(
                lambda code: series_priority(code, GDP_SERIES_PRIORITY)
            )
            gdp_primary_long["source"] = "gdp_per_capita.csv"
            gdp_frames.append(gdp_primary_long)
        else:
            print("Warning: gdp_per_capita.csv had no year columns recognised after reshape.")
    else:
        gdp_primary_long = parse_oecd_like(gdp_primary)
        if gdp_primary_long.empty:
            print("Warning: gdp_per_capita.csv format not recognised; skipping.")
        else:
            gdp_frames.append(gdp_primary_long)
else:
    print("Warning: GDP primary path missing")

if GDP_WDI_PATH.exists():
    wdi_raw = pd.read_csv(GDP_WDI_PATH)
    wdi_raw = wdi_raw.rename(
        columns={
            "Country Name": "country",
            "Country Code": "country_code",
            "Series Name": "series_name",
            "Series Code": "series_code",
        }
    )
    gdp_wdi = wdi_raw[
        wdi_raw["series_code"].str.startswith("NY.GDP.PCAP", na=False)
        & ~wdi_raw["series_name"].str.contains("growth", case=False, na=False)
    ].copy()
    if not gdp_wdi.empty:
        gdp_wdi_long = reshape_wide(gdp_wdi, "gdp_pc_val")
        gdp_wdi_long["gdp_pc_ppp_flag"] = gdp_wdi_long.apply(
            lambda row: detect_ppp(row["series_name"], row["series_code"]), axis=1
        )
        gdp_wdi_long["series_priority"] = gdp_wdi_long["series_code"].apply(
            lambda code: series_priority(code, GDP_SERIES_PRIORITY)
        )
        gdp_wdi_long["source"] = "WDI_extract.csv"
        gdp_frames.append(gdp_wdi_long)
else:
    print("Warning: WDI GDP path missing")

print(f"GDP tables loaded: {len(gdp_frames)} sources")


GDP tables loaded: 1 sources


In [6]:
def select_best_series(group: pd.DataFrame, value_col: str, source_priority: dict) -> pd.Series:
    group = group.copy()
    flag_col = value_col.replace("_val", "_ppp_flag")
    has_flag = flag_col in group.columns
    if has_flag:
        group["flag_rank"] = 1 - group[flag_col].fillna(0)
    else:
        group["flag_rank"] = 1
    max_source_priority = max(source_priority.values()) if source_priority else 0
    group["source_rank"] = group["source"].map(source_priority).fillna(max_source_priority + 1)
    group = group.sort_values(["flag_rank", "series_priority", "source_rank"])  # lower is better
    best_priority = group.iloc[0]["series_priority"]
    best_subset = group[group["series_priority"] == best_priority]
    best_value = best_subset[value_col].median()
    top_row = best_subset.sort_values(["flag_rank", "source_rank"]).iloc[0]
    result = {
        value_col: best_value,
        "country": top_row.get("country"),
        "series_name": top_row["series_name"],
        "series_code": top_row["series_code"],
        "source": top_row["source"],
        "series_priority": best_priority,
    }
    if has_flag:
        result[flag_col] = int(best_subset[flag_col].max())
    return pd.Series(result)



In [7]:
if not gdp_frames:
    raise RuntimeError("No GDP per capita tables were loaded.")

gdp_all = pd.concat(gdp_frames, ignore_index=True)
gdp_all["country_code"] = gdp_all["country_code"].str.upper()
gdp_all = gdp_all[gdp_all["country_code"].str.fullmatch(r"[A-Z]{3}")]
gdp_all = gdp_all[gdp_all["country_code"].isin(iso_codes)]

best_gdp = (
    gdp_all.groupby(["country_code", "year"], group_keys=False)
    .apply(lambda grp: select_best_series(grp, "gdp_pc_val", GDP_SOURCE_PRIORITY))
    .reset_index()
)

best_gdp["country"] = best_gdp["country"].fillna(best_gdp["country_code"].map(country_lookup))
if "gdp_pc_ppp_flag" not in best_gdp.columns:
    best_gdp["gdp_pc_ppp_flag"] = np.nan

summary_cols = ["series_name", "series_code"]
if "gdp_pc_ppp_flag" in best_gdp.columns:
    summary_cols.append("gdp_pc_ppp_flag")

gdp_series_used = (
    best_gdp[summary_cols]
    .drop_duplicates()
    .sort_values("series_code")
    .reset_index(drop=True)
)

fact_years = sorted(set(fact["year"].unique()))
gdp_years = sorted(best_gdp["year"].dropna().unique())
missing_years = sorted(set(fact_years) - set(gdp_years))
missing_iso = sorted(iso_codes - set(best_gdp["country_code"]))

print(f"Normalized GDP entries: {best_gdp.shape[0]:,}")
print(gdp_series_used)
print(
    f"GDP year span: {gdp_years[0] if gdp_years else '—'} to {gdp_years[-1] if gdp_years else '—'}"
)
if missing_years:
    print(f"Fact years without GDP coverage: {missing_years}")
else:
    print("GDP coverage available for all fact years.")
if missing_iso:
    preview = missing_iso[:10]
    suffix = " ..." if len(missing_iso) > 10 else ""
    print(f"ISO3 codes missing GDP values: {preview}{suffix}")
else:
    print("All fact ISO3 codes have GDP coverage after tolerance fill.")



Normalized GDP entries: 2,312
                                         series_name        series_code  \
0  GDP per capita, PPP (constant 2021 internation...  NY.GDP.PCAP.PP.KD   

   gdp_pc_ppp_flag  
0                1  
GDP year span: 1990 to 2024
GDP coverage available for all fact years.
ISO3 codes missing GDP values: ['AIA', 'COK', 'CUB', 'ERI', 'GIB', 'LIE', 'MCO', 'MSR', 'NIU', 'PRK'] ...


  .apply(lambda grp: select_best_series(grp, "gdp_pc_val", GDP_SOURCE_PRIORITY))


In [8]:
fact_macro = fact.drop(
    columns=[
        "gdp_pc_dest",
        "gdp_dest_ppp_flag",
        "gdp_dest_year_source",
        "gdp_dest_year_distance",
        "gdp_pc_orig",
        "gdp_orig_ppp_flag",
        "gdp_orig_year_source",
        "gdp_orig_year_distance",
        "log_gdp_gap",
        "gdp_pc_ratio",
    ],
    errors="ignore",
).copy()

# Merge GDP per capita for destination and origin
merge_cols = ["country_code", "year", "gdp_pc_val", "gdp_pc_ppp_flag"]

gdp_for_merge = expand_with_year_tolerance(best_gdp[merge_cols], tolerance=GDP_YEAR_TOLERANCE)

gdp_dest = gdp_for_merge.rename(
    columns={
        "country_code": "destination_country_code",
        "gdp_pc_val": "gdp_pc_dest",
        "gdp_pc_ppp_flag": "gdp_dest_ppp_flag",
        "year_source": "gdp_dest_year_source",
        "year_distance": "gdp_dest_year_distance",
    }
)

gdp_orig = gdp_for_merge.rename(
    columns={
        "country_code": "origin_country_code",
        "gdp_pc_val": "gdp_pc_orig",
        "gdp_pc_ppp_flag": "gdp_orig_ppp_flag",
        "year_source": "gdp_orig_year_source",
        "year_distance": "gdp_orig_year_distance",
    }
)

fact_macro = fact_macro.merge(
    gdp_dest,
    on=["destination_country_code", "year"],
    how="left",
)
fact_macro = fact_macro.merge(
    gdp_orig,
    on=["origin_country_code", "year"],
    how="left",
)

positive_mask = (fact_macro["gdp_pc_dest"] > 0) & (fact_macro["gdp_pc_orig"] > 0)
fact_macro.loc[positive_mask, "log_gdp_gap"] = (
    np.log(fact_macro.loc[positive_mask, "gdp_pc_dest"]) - np.log(fact_macro.loc[positive_mask, "gdp_pc_orig"])
)
fact_macro.loc[~positive_mask, "log_gdp_gap"] = np.nan
fact_macro.loc[positive_mask, "gdp_pc_ratio"] = (
    fact_macro.loc[positive_mask, "gdp_pc_dest"] / fact_macro.loc[positive_mask, "gdp_pc_orig"]
)



In [9]:
coverage_cols = ["gdp_pc_dest", "gdp_pc_orig", "log_gdp_gap"]
coverage = {
    col: fact_macro[col].notna().mean() for col in coverage_cols
}
coverage_df = (
    pd.Series(coverage)
    .to_frame(name="coverage_share")
    .assign(coverage_pct=lambda df: (df["coverage_share"] * 100).round(2))
)

weight_sums = fact_macro.groupby(["origin_country_code", "year"])["weight_od"].sum()
weight_violations = weight_sums[(weight_sums < 0.999) | (weight_sums > 1.001)]

duplicate_count = fact_macro.duplicated(
    subset=["origin_country_code", "destination_country_code", "year"]
).sum()

print("Coverage (% non-null):")
display(coverage_df)
print(f"Weight sum violations: {len(weight_violations)}")
print(f"Duplicate OD-year rows: {duplicate_count}")

if not weight_violations.empty:
    display(weight_violations)

if {"gdp_dest_year_source", "gdp_orig_year_source"}.issubset(fact_macro.columns):
    dest_year_diff = fact_macro["year"] - fact_macro["gdp_dest_year_source"]
    orig_year_diff = fact_macro["year"] - fact_macro["gdp_orig_year_source"]
    dest_valid = dest_year_diff.notna()
    orig_valid = orig_year_diff.notna()
    dest_shift_share = (
        ((dest_year_diff != 0) & dest_valid).sum() / dest_valid.sum()
        if dest_valid.any()
        else 0.0
    )
    orig_shift_share = (
        ((orig_year_diff != 0) & orig_valid).sum() / orig_valid.sum()
        if orig_valid.any()
        else 0.0
    )
    print(
        f"Fallback year usage (dest/orig): {dest_shift_share:.2%} / {orig_shift_share:.2%}"
    )
else:
    dest_shift_share = np.nan
    orig_shift_share = np.nan

top_positive = (
    fact_macro.dropna(subset=["log_gdp_gap"])
    .sort_values("log_gdp_gap", ascending=False)
    .head(10)
    [[
        "origin_country_code",
        "origin_country",
        "destination_country_code",
        "destination_country",
        "year",
        "log_gdp_gap",
    ]]
)

top_negative = (
    fact_macro.dropna(subset=["log_gdp_gap"])
    .sort_values("log_gdp_gap", ascending=True)
    .head(10)
    [[
        "origin_country_code",
        "origin_country",
        "destination_country_code",
        "destination_country",
        "year",
        "log_gdp_gap",
    ]]
)

print("Top 10 log_gdp_gap (dest > orig):")
display(top_positive)
print("Bottom 10 log_gdp_gap (dest < orig):")
display(top_negative)



Coverage (% non-null):


Unnamed: 0,coverage_share,coverage_pct
gdp_pc_dest,0.9751,97.51
gdp_pc_orig,0.9286,92.86
log_gdp_gap,0.9054,90.54


Weight sum violations: 0
Duplicate OD-year rows: 0
Fallback year usage (dest/orig): 0.16% / 0.08%


Top 10 log_gdp_gap (dest > orig):


Unnamed: 0,origin_country_code,origin_country,destination_country_code,destination_country,year,log_gdp_gap
8733,BDI,Burundi,LUX,Luxembourg,2021,5.0966
8865,BDI,Burundi,SGP,Singapore,2022,5.0817
8413,BDI,Burundi,MAC,Macau (China),2018,5.081
8839,BDI,Burundi,LUX,Luxembourg,2022,5.0742
8757,BDI,Burundi,SGP,Singapore,2021,5.0658
8970,BDI,Burundi,SGP,Singapore,2023,5.0521
8519,BDI,Burundi,MAC,Macau (China),2019,5.0503
8626,BDI,Burundi,LUX,Luxembourg,2020,5.0481
8946,BDI,Burundi,LUX,Luxembourg,2023,5.048
8517,BDI,Burundi,LUX,Luxembourg,2019,5.0441


Bottom 10 log_gdp_gap (dest < orig):


Unnamed: 0,origin_country_code,origin_country,destination_country_code,destination_country,year,log_gdp_gap
73196,MAC,Macau (China),BDI,Burundi,2018,-5.081
105835,SGP,Singapore,BDI,Burundi,2023,-5.0521
72451,LUX,Luxembourg,BDI,Burundi,2023,-5.048
71912,LUX,Luxembourg,BDI,Burundi,2018,-5.0217
101341,QAT,Qatar,BDI,Burundi,2023,-4.943
55759,IRL,Ireland,BDI,Burundi,2023,-4.9374
105296,SGP,Singapore,BDI,Burundi,2018,-4.9278
100802,QAT,Qatar,BDI,Burundi,2018,-4.8851
73735,MAC,Macau (China),BDI,Burundi,2023,-4.8416
15955,BMU,Bermuda,BDI,Burundi,2023,-4.8233


In [10]:
qc_metrics = {
    "rows": len(fact_macro),
    "gdp_pc_dest_coverage_pct": round(coverage["gdp_pc_dest"] * 100, 2),
    "gdp_pc_orig_coverage_pct": round(coverage["gdp_pc_orig"] * 100, 2),
    "log_gdp_gap_coverage_pct": round(coverage["log_gdp_gap"] * 100, 2),
    "weight_od_violations": int(len(weight_violations)),
    "log_gdp_gap_min": float(fact_macro["log_gdp_gap"].min(skipna=True)),
    "log_gdp_gap_max": float(fact_macro["log_gdp_gap"].max(skipna=True)),
    "gdp_dest_fallback_pct": (
        round(dest_shift_share * 100, 2) if dest_shift_share == dest_shift_share else np.nan
    ),
    "gdp_orig_fallback_pct": (
        round(orig_shift_share * 100, 2) if orig_shift_share == orig_shift_share else np.nan
    ),
}

qc_df = pd.DataFrame([qc_metrics])
qc_df.to_csv(QC_PATH, index=False)
fact_macro.to_csv(FACT_PATH, index=False)

print(f"Saved enhanced fact table to {FACT_PATH}")
print(f"Saved QC summary to {QC_PATH}")
qc_df


Saved enhanced fact table to /Users/simonedinato/Documents/Classes/Applied Econometrics/Project/Datasets/07_fact_tables/od_fact_table.csv
Saved QC summary to /Users/simonedinato/Documents/Classes/Applied Econometrics/Project/Datasets/07_fact_tables/od_fact_table_macro_qc.csv


Unnamed: 0,rows,gdp_pc_dest_coverage_pct,gdp_pc_orig_coverage_pct,log_gdp_gap_coverage_pct,weight_od_violations,log_gdp_gap_min,log_gdp_gap_max,gdp_dest_fallback_pct,gdp_orig_fallback_pct
0,134820,97.51,92.86,90.54,0,-5.081,5.0966,0.16,0.08


**Notes**

- GDP per capita series prioritised `NY.GDP.PCAP.PP.KD` (PPP, constant 2021 international &#36;) from World Development Indicators. Whenever PPP coverage was unavailable, the fallback `NY.GDP.PCAP.CD` (current US &#36;) was used and flagged with `gdp_*_ppp_flag = 0`.
- OECD-style long files are handled via a dedicated parser, and year columns now accept plain `YYYY` as well as `YYYY [YRYYYY]` headers so mixed exports melt correctly.
- A ±1-year tolerance backfills gaps when an exact GDP year is missing; helper columns (`gdp_*_year_source`, `gdp_*_year_distance`) record the year actually used.
- No standalone price level tables were found in the current macro data directory, so `price_level_*` columns remain `NaN` and the ratio metrics are left empty.
- `log_gdp_gap` = `log(gdp_pc_dest) - log(gdp_pc_orig)` with natural logs; the ratio `gdp_pc_ratio` is reported when both GDP values are positive.
- QC cells above report coverage, year-fallback usage, weight-balance checks, and the most extreme destination-origin gaps for manual review.


In [11]:
fact.head()

Unnamed: 0,indicatorId,origin_country_code,year,students_outbound_total,qualifier,magnitude,origin_country,destination_country_code,destination_country,students_inbound_destination,...,earnings_orig,cost_val_dest,cost_tuition_dest,cost_living_dest,cost_val_orig,cost_tuition_orig,cost_living_orig,dist,comlang_off,colony
0,OE.5T8.40510,ABW,2018,365.0,,,Aruba,ALB,Albania,1969.0,...,41649.45,,,,,,,9083.2253,1.0,0.0
1,OE.5T8.40510,ABW,2018,365.0,,,Aruba,AND,Andorra,278.0,...,41649.45,,,,,,,7565.6943,1.0,0.0
2,OE.5T8.40510,ABW,2018,365.0,,,Aruba,ARE,United Arab Emirates,199958.0,...,41649.45,,,,,,,12723.0823,1.0,0.0
3,OE.5T8.40510,ABW,2018,365.0,,,Aruba,ARG,Argentina,109226.25,...,41649.45,3425.0,0.0,3332.5,,,,5391.165,1.0,0.0
4,OE.5T8.40510,ABW,2018,365.0,,,Aruba,ARM,Armenia,4598.0,...,41649.45,,,,,,,11097.3784,1.0,0.0


In [12]:
fact.columns

Index(['indicatorId', 'origin_country_code', 'year', 'students_outbound_total',
       'qualifier', 'magnitude', 'origin_country', 'destination_country_code',
       'destination_country', 'students_inbound_destination',
       'share_inbound_destination', 'students_enrolled', 'students_graduated',
       'students_new_entrants', 'flow_source', 'share_mobile_destination',
       'share_mobile_origin', 'students_national_abroad', 'weight_od',
       'earnings_dest', 'earnings_orig', 'cost_val_dest', 'cost_tuition_dest',
       'cost_living_dest', 'cost_val_orig', 'cost_tuition_orig',
       'cost_living_orig', 'dist', 'comlang_off', 'colony'],
      dtype='object')