# Pre-Model Analysis and Variable Definitions

Before running our first regression, we define the variables used in our Gravity Model.

### Dependent Variable
- **`log_students`** ($ln(Students_{ij})$): The natural log of student flows from origin $i$ to destination $j$. This is our measure of international student mobility.

### Key Independent Variables
- **`log_tuition_diff`** ($ln(Tuition_j) - ln(Tuition_i)$): The log difference in tuition costs between destination and origin. We expect a negative coefficient (higher relative tuition in destination should discourage students).
- **`log_earnings_diff`** ($ln(Earnings_j) - ln(Earnings_i)$): The log difference in expected earnings. We expect a positive coefficient (higher relative earnings in destination should attract students).
- **`log_living_diff`** ($ln(Cost_j) - ln(Cost_i)$): The log difference in living costs. We expect a negative coefficient.
- **`log_dist`** ($ln(Distance_{ij})$): The log of geographic distance. A proxy for migration costs (travel, psychological). We expect a negative coefficient.
- **`comlang_off`**: Dummy variable equal to 1 if countries share a common official language. Proxy for cultural proximity. We expect a positive coefficient.
- **`colony`**: Dummy variable equal to 1 if the pair has a colonial relationship. Proxy for historical ties. We expect a positive coefficient.
- **`log_gdp_dest`**: Log GDP per capita of the destination. A proxy for general economic opportunity and quality of life. We expect a positive coefficient.

### Data Cleaning
We remove observations with missing values or infinite logs to ensure a consistent sample size across all specifications.

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from IPython.display import display

BASE_DIR = Path("/Users/simonedinato/Documents/Classes/Applied Econometrics/Project")
DATA_DIR = BASE_DIR / "Datasets"

fact_path = DATA_DIR / "07_fact_tables" / "od_fact_table.csv"

fact = pd.read_csv(fact_path)

In [2]:
# Prepare data for Regression Models
reg_data = fact.copy()

# Dependent Variable
reg_data["log_students"] = np.log1p(reg_data["students_enrolled"])

# Independent Variables
reg_data["log_earnings_diff"] = np.log(reg_data["earnings_dest"]) - np.log(reg_data["earnings_orig"])
reg_data["log_tuition_diff"] = np.log1p(reg_data["cost_tuition_dest"]) - np.log1p(reg_data["cost_tuition_orig"])
reg_data["log_living_diff"] = np.log1p(reg_data["cost_living_dest"]) - np.log1p(reg_data["cost_living_orig"])
reg_data["log_dist"] = np.log(reg_data["dist"])
reg_data["log_gdp_dest"] = np.log(reg_data["gdp_pc_dest"])

# Restricted Model Variables
reg_data["total_cost_dest"] = reg_data["cost_tuition_dest"] + reg_data["cost_living_dest"]
reg_data["total_cost_orig"] = reg_data["cost_tuition_orig"] + reg_data["cost_living_orig"]
reg_data["roi_dest"] = reg_data["earnings_dest"] / reg_data["total_cost_dest"]
reg_data["roi_orig"] = reg_data["earnings_orig"] / reg_data["total_cost_orig"]
reg_data["log_roi_diff"] = np.log(reg_data["roi_dest"]) - np.log(reg_data["roi_orig"])

regression_cols = [
    "log_students",
    "log_earnings_diff",
    "log_tuition_diff",
    "log_living_diff",
    "log_dist",
    "comlang_off",
    "colony",
    "log_gdp_dest",
    "log_roi_diff"
]

# Drop NaNs and Infinite values
reg_data = reg_data.replace([np.inf, -np.inf], np.nan)
reg_data = reg_data.dropna(subset=regression_cols)
print(f"Regression rows: {len(reg_data)}")

Regression rows: 17759


  result = getattr(ufunc, method)(*inputs, **kwargs)


## Model 1: Base Gravity Model
Standard gravity variables + Destination GDP (No Fixed Effects).

In [3]:
model1 = smf.ols(
    "log_students ~ log_tuition_diff + log_earnings_diff + log_living_diff + log_dist + comlang_off + colony + log_gdp_dest",
    data=reg_data
).fit(cov_type='HC1')
print(model1.summary())

                            OLS Regression Results                            
Dep. Variable:           log_students   R-squared:                       0.149
Model:                            OLS   Adj. R-squared:                  0.148
Method:                 Least Squares   F-statistic:                     445.6
Date:                Thu, 04 Dec 2025   Prob (F-statistic):               0.00
Time:                        13:55:38   Log-Likelihood:                -33907.
No. Observations:               17759   AIC:                         6.783e+04
Df Residuals:                   17751   BIC:                         6.789e+04
Df Model:                           7                                         
Covariance Type:                  HC1                                         
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             0.8816      0.30

### Evaluation of Model 1
Model 1 yields an $R^2$ of approximately 0.15, which is relatively low. While some gravity variables (Distance, Language, Colony) have the expected signs, the cost and earnings variables may show counter-intuitive results (e.g., positive tuition coefficient). This suggests that our model is missing important factors.

### Motivation for Model 2: Origin Fixed Effects
Model 1 likely suffers from **Omitted Variable Bias (OVB)**. It fails to account for unobserved characteristics of the origin country, such as:
- Quality of local education system
- Sending capacity (population size, demographics)
- General propensity to study abroad

To address this, we introduce **Origin Fixed Effects** (`C(origin_country_code)`). By controlling for all time-invariant origin factors, we expect a significant increase in explanatory power ($R^2$) and potentially more accurate estimates for our destination-specific variables.

## Model 2: Origin Fixed Effects Model
Same as Model 1 but replacing `log_gdp_dest` with Origin Fixed Effects `C(origin_country_code)`.

In [4]:
model2 = smf.ols(
    "log_students ~ log_tuition_diff + log_earnings_diff + log_living_diff + log_dist + comlang_off + colony + C(origin_country_code)",
    data=reg_data
).fit(cov_type='HC1')
print(model2.summary())

                            OLS Regression Results                            
Dep. Variable:           log_students   R-squared:                       0.518
Model:                            OLS   Adj. R-squared:                  0.516
Method:                 Least Squares   F-statistic:                     292.8
Date:                Thu, 04 Dec 2025   Prob (F-statistic):               0.00
Time:                        13:55:38   Log-Likelihood:                -28855.
No. Observations:               17759   AIC:                         5.784e+04
Df Residuals:                   17693   BIC:                         5.836e+04
Df Model:                          65                                         
Covariance Type:                  HC1                                         
                                    coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept     

### Evaluation of Model 2
Including Origin Fixed Effects drastically improves the model fit ($R^2$ jumps to ~0.52). This confirms that origin-specific characteristics play a huge role in determining student flows.

However, we may still observe puzzling results for the specific cost and earnings variables (e.g., tuition might still be positively associated with flows, or earnings negatively). This could be because students do not view these costs in isolation.

### Motivation for Model 3: The ROI Hypothesis
Instead of separate cost and earnings variables, students might consider the **Return on Investment (ROI)**. They weigh the future benefits (Earnings) against the total costs (Tuition + Living).

We define:
$$ ROI_{j} = \frac{Earnings_j}{Tuition_j + Living_j} $$

Model 3 tests this hypothesis using `log_roi_diff`.

## Model 3: Restricted Gravity Model (ROI)
Using `log_roi_diff` (Earnings/Costs) to test the ROI hypothesis.

In [5]:
model3 = smf.ols(
    "log_students ~ log_roi_diff + log_dist + comlang_off + colony + C(origin_country_code)",
    data=reg_data
).fit(cov_type='HC1')
print(model3.summary())

                            OLS Regression Results                            
Dep. Variable:           log_students   R-squared:                       0.335
Model:                            OLS   Adj. R-squared:                  0.332
Method:                 Least Squares   F-statistic:                     145.2
Date:                Thu, 04 Dec 2025   Prob (F-statistic):               0.00
Time:                        13:55:38   Log-Likelihood:                -31717.
No. Observations:               17759   AIC:                         6.356e+04
Df Residuals:                   17695   BIC:                         6.406e+04
Df Model:                          63                                         
Covariance Type:                  HC1                                         
                                    coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept     

### Evaluation of Model 3
The ROI model attempts to combine economic incentives into a single metric. If this variable is significant and positive, it supports the idea that students act as rational investors. However, if the results are weak or the sign is unexpected, it suggests that our measurement of ROI (or the underlying cost data) might be noisy or that students prioritize other factors (like quality) over pure financial return.

### Motivation for Model 4: Parsimonious Gravity Model
Given the potential measurement errors and endogeneity in the tuition and earnings data (e.g., high tuition often signals high quality, attracting more students despite the cost), we estimate a **Parsimonious Gravity Model**.

In this final specification, we drop the potentially biased cost/earnings variables and focus on the robust structural gravity determinants:
- **Destination GDP** (Economic Opportunity/Quality)
- **Distance**
- **Cultural/Historical Ties**
- **Origin Fixed Effects**

This model serves as our most robust baseline for the structural gravity components of international student flows.

## Model 4: Parsimonious Gravity Model
Using only GDP and Gravity variables (dropping biased cost/earnings variables).

In [6]:
model4 = smf.ols(
    "log_students ~ log_gdp_dest + log_dist + comlang_off + colony + C(origin_country_code)",
    data=reg_data
).fit(cov_type='HC1')
print(model4.summary())

                            OLS Regression Results                            
Dep. Variable:           log_students   R-squared:                       0.412
Model:                            OLS   Adj. R-squared:                  0.410
Method:                 Least Squares   F-statistic:                     208.2
Date:                Thu, 04 Dec 2025   Prob (F-statistic):               0.00
Time:                        13:55:38   Log-Likelihood:                -30618.
No. Observations:               17759   AIC:                         6.136e+04
Df Residuals:                   17695   BIC:                         6.186e+04
Df Model:                          63                                         
Covariance Type:                  HC1                                         
                                    coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept     

### Evaluation of Model 4
Model 4, our Parsimonious Gravity Model, focuses on the structural determinants of student flows.
- **Goodness of Fit**: The $R^2$ is around 0.41. This is lower than Model 2 (0.52), but this is expected as we dropped the tuition and earnings variables.
- **Coefficients**:
    - **Destination GDP**: Positive and significant (+0.74). Economic opportunity is a major pull factor.
    - **Distance**: Surprisingly **positive** (+0.10). In standard gravity models, distance is negative. However, with Origin Fixed Effects, this might capture that students from certain regions prefer further destinations (e.g., Asian students going to US/UK vs nearby).
    - **Cultural Ties**: Common Language and Colonial ties are strongly positive.

### Conclusion: Which Model is "Best"?
- **Model 2 (Origin FE)** has the highest explanatory power ($R^2 \approx 0.52$). However, the positive coefficient on Tuition suggests **endogeneity** (tuition proxies for quality). It is the best model for *prediction* but potentially biased for *causal inference* regarding costs.
- **Model 4 (Parsimonious)** has a lower $R^2$ ($0.41$) but avoids the biased cost variables. It provides cleaner estimates for the structural gravity forces (GDP, Culture).

**Recommendation**: We prefer **Model 4** as the structural baseline because it avoids the misleading "positive tuition" effect, even though it explains less total variance.
#### Justification for $R^2 = 0.42$
An $R^2$ of **0.42** is respectable for a parsimonious gravity model on international student flows:

1.  **Cross-Sectional Noise**: Human migration data is inherently noisy and driven by countless unobservable factors. Explaining **42% of the variation** using just a handful of structural variables (GDP, Distance, Language, Colony) is a strong result.
2.  **Trade-off for Robustness**: Model 4 is "Parsimonious" by design. We intentionally removed variables like `tuition` and `earnings` because they were potentially endogenous (e.g., high tuition correlating with high quality). By dropping them, we accept a lower $R^2$ in exchange for **unbiased, structural coefficients**. We are trading "overfitting" for "causal validity."
3.  **Benchmark Standard**: In the gravity model literature, $R^2$ values often range between 0.30 and 0.60 depending on the granularity of the data. A value of 0.42 places this model firmly within the acceptable range for a structural baseline.
4.  **It Captures the "Gravity"**: The fact that we get 0.42 with just economic size (GDP) and friction (Distance/Culture) proves that the **Gravity Law holds**. It confirms that the core drivers of migration are indeed structural.


## Personalized Estimation: Italy (Data Science)

We now estimate student flows for a specific user profile:
- **Origin**: Italy
- **Major**: Master's in Data Science
- **Tuition (Origin)**: €2,800 (Average Public University)
- **Earnings (Origin)**: €29,000 (Entry Level Gross)
- **Living Costs (Origin)**: €12,000 (Student Average)

We will compare the predictions of our 4 models for flows to top destinations using these specific "Data Science" origin values against the general destination averages from our dataset.

In [7]:
# User Profile Data (Italy - Data Science)
user_origin = "ITA"
user_tuition = 2800
user_earnings = 29000
user_living = 12000
user_total_cost = user_tuition + user_living
user_roi = user_earnings / user_total_cost

# Select Top Destinations (Expanded List)
# Added: ESP, NLD, DNK, NOR, SWE, CHE, CHN, JPN, KOR, ARE, ITA
destinations = [
    "USA", "GBR", "DEU", "FRA", "CAN", 
    "ESP", "NLD", "DNK", "NOR", "SWE", "CHE", 
    "CHN", "JPN", "KOR", "ARE", "ITA"
]

# --- FETCH DATA FROM RAW FACT TABLE ---
# We use 'fact' instead of 'reg_data' because 'reg_data' dropped rows with missing values
# We filter for ITA origin and the selected destinations
personal_data = fact[
    (fact["origin_country_code"] == user_origin) &
    (fact["destination_country_code"].isin(destinations))
].copy()

# Filter for the latest year available for each destination
if "year" in personal_data.columns:
    personal_data = personal_data.sort_values("year", ascending=False).drop_duplicates(subset=["destination_country_code"])

# Ensure all structural variables are present (fill from reg_data logic if needed)
# The raw 'fact' table has 'gdp_pc_dest', 'dist', 'comlang_off', 'colony'
# We need to create the log variables

# --- INJECT MISSING DATA (WEB SCRAPED) ---
# Dictionary of missing data
# Format: Country Code: {col: value}
missing_data_map = {
    "CHN": {
        "cost_tuition_dest": 6500,   # ~Avg for Master's
        "earnings_dest": 53000,      # ~Entry Level Data Scientist
        "cost_living_dest": 12000    # ~Major city student living
    },
    "KOR": {
        "cost_tuition_dest": 13000,  # ~Avg Private Uni
        "earnings_dest": 67000,      # ~Entry Level
        "cost_living_dest": 9000     # ~Student avg
    },
    "ARE": {
        "cost_tuition_dest": 25000,  # ~Avg International Uni
        "cost_living_dest": 18000    # ~Mid-range student living
        # Earnings already in dataset (~73k)
    },
    "ITA": {
        "dist": 1,              # Hardcoded small distance for internal flow
        "cost_tuition_dest": user_tuition,
        "earnings_dest": user_earnings,
        "cost_living_dest": user_living
    }
}

# Apply the manual data
for country, data in missing_data_map.items():
    mask = personal_data["destination_country_code"] == country
    if mask.any():
        for col, val in data.items():
            personal_data.loc[mask, col] = val
    else:
        print(f"Warning: {country} not found in raw fact rows.")

# -----------------------------------------

# Now calculate the regression variables
# Dependent Variable (placeholder, not needed for prediction but good for consistency)
personal_data["log_students"] = np.log1p(personal_data["students_enrolled"])

# Independent Variables
personal_data["log_earnings_diff"] = np.log(personal_data["earnings_dest"]) - np.log(user_earnings)
personal_data["log_tuition_diff"] = np.log1p(personal_data["cost_tuition_dest"]) - np.log1p(user_tuition)
personal_data["log_living_diff"] = np.log1p(personal_data["cost_living_dest"]) - np.log1p(user_living)
personal_data["log_dist"] = np.log(personal_data["dist"])
personal_data["log_gdp_dest"] = np.log(personal_data["gdp_pc_dest"])

# Restricted Model Variables
personal_data["total_cost_dest"] = personal_data["cost_tuition_dest"] + personal_data["cost_living_dest"]
personal_data["roi_dest"] = personal_data["earnings_dest"] / personal_data["total_cost_dest"]
personal_data["log_roi_diff"] = np.log(personal_data["roi_dest"]) - np.log(user_roi)

# Drop any rows that STILL have NaNs in the required columns (e.g. if we missed some data)
# But we want to keep as many as possible.
# Model 4 only needs GDP, Dist, Culture.
# Model 2 needs Tuition, Earnings.

# Run Predictions
# We handle NaNs by filling with 0 or dropping, but let's try to predict where possible.
# If a row has NaN for a model's features, predict() will return NaN.

personal_data["pred_m1"] = model1.predict(personal_data)
personal_data["pred_m2"] = model2.predict(personal_data)
personal_data["pred_m3"] = model3.predict(personal_data)
personal_data["pred_m4"] = model4.predict(personal_data)

# Convert log predictions back to student counts (exp - 1)
results = personal_data[["destination_country_code"]].copy()

for col in ["pred_m1", "pred_m2", "pred_m3", "pred_m4"]:
    # Calculate raw counts
    results[col + "_count"] = np.expm1(personal_data[col])
    
    # Calculate Probabilities (Weights)
    # The probability of choosing destination j is Count_j / Sum(Counts)
    # We ignore NaNs in the sum
    total_flow = results[col + "_count"].sum()
    results[col + "_prob"] = results[col + "_count"] / total_flow

# Save Results to CSV for Visualization
# FIX: Use ../Datasets because notebook runs in scripts/
results.to_csv("../Datasets/personalized_predictions.csv", index=False)
print("Results saved to ../Datasets/personalized_predictions.csv")

# Display Results as Probabilities
print("Estimated Probability of Choosing Each Destination (Weights):")
cols_prob = ["destination_country_code", "pred_m1_prob", "pred_m2_prob", "pred_m3_prob", "pred_m4_prob"]
display(results[cols_prob].set_index("destination_country_code").style.format({
    "pred_m1_prob": "{:.1%}",
    "pred_m2_prob": "{:.1%}",
    "pred_m3_prob": "{:.1%}",
    "pred_m4_prob": "{:.1%}"
}).background_gradient(cmap="Blues", axis=0))

Results saved to ../Datasets/personalized_predictions.csv
Estimated Probability of Choosing Each Destination (Weights):


Unnamed: 0_level_0,pred_m1_prob,pred_m2_prob,pred_m3_prob,pred_m4_prob
destination_country_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
USA,9.2%,15.8%,6.6%,8.4%
FRA,6.0%,5.7%,6.1%,5.5%
ARE,8.2%,11.1%,7.5%,7.6%
CAN,6.8%,8.0%,6.2%,7.0%
CHE,8.8%,10.8%,6.1%,7.2%
CHN,4.5%,5.9%,5.9%,3.5%
DNK,5.6%,3.1%,5.0%,7.0%
ESP,5.1%,4.1%,5.7%,5.1%
DEU,5.2%,3.0%,5.2%,6.2%
GBR,6.4%,7.3%,7.0%,5.5%


### Analysis of Personalized Results

The table above shows the **estimated probability** (weight) of a student choosing each destination, based on the total pool of students flowing to these selected countries.

- **Interpretation**: A value of 20% for USA means that, out of the set of destinations considered, the model predicts a 20% chance that the student chooses the USA.
- **"Staying" (ITA)**: The probability for ITA represents the likelihood of staying in Italy (internal flow) versus going to one of the other international destinations.
