# UT-ECE Data Science Final - Complete Solution Notebook

This notebook covers all parts (Q1-Q6) of the Global Tech Talent Migration assessment with executable code and concise written answers.

In [None]:
from pathlib import Path
import sys

HERE = Path.cwd().resolve()
CODE_ROOT = None
for candidate in [HERE, HERE.parent, HERE.parent.parent]:
    if (candidate / 'scripts' / 'full_solution_pipeline.py').exists():
        CODE_ROOT = candidate
        break
    if (candidate / 'code' / 'scripts' / 'full_solution_pipeline.py').exists():
        CODE_ROOT = candidate / 'code'
        break

if CODE_ROOT is None:
    raise FileNotFoundError('Could not locate project code root. Run notebook from repo root or code/notebooks.')

sys.path.insert(0, str(CODE_ROOT / 'scripts'))

from full_solution_pipeline import (
    load_dataset,
    leakage_diagnostics,
    simulate_optimizers,
    plot_ravine_paths,
    run_q4_svm_and_pruning,
    run_q5_unsupervised,
    run_q6_capstone,
    run_all,
)

DATA_PATH = CODE_ROOT / 'data' / 'GlobalTechTalent_50k.csv'
FIG_DIR = CODE_ROOT / 'figures'
SOL_DIR = CODE_ROOT / 'solutions'

print('CODE_ROOT:', CODE_ROOT)
print('DATA_PATH exists:', DATA_PATH.exists())

In [None]:
df = load_dataset(DATA_PATH)
print('Shape:', df.shape)
print('Columns:', list(df.columns))
df.head(3)

## Q1 - Data Engineering and Leakage

**Q1A SQL answer:**
```sql
WITH citation_velocity AS (
    SELECT UserID, Country_Origin, Year, Research_Citations,
           AVG(Research_Citations) OVER (
               PARTITION BY Country_Origin
               ORDER BY Year
               ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
           ) AS moving_avg_citations
    FROM Professionals_Data
)
SELECT *, DENSE_RANK() OVER (
    PARTITION BY Country_Origin ORDER BY moving_avg_citations DESC
) AS country_rank
FROM citation_velocity;
```

**Q1B answer:** `Visa_Approval_Date` is direct leakage (post-outcome). `Last_Login_Region` and `Passport_Renewal_Status` can be temporally leaky depending on collection time.

In [None]:
q1_diag = leakage_diagnostics(df)
q1_diag

## Q2 - Statistical Inference and Elastic Net

            For

            $J(	heta)=rac{1}{2m}\sum_{i=1}^m(h_	heta(x^{(i)})-y^{(i)})^2 + \lambda_1\sum_j |	heta_j| + rac{\lambda_2}{2}\sum_j 	heta_j^2$,

            the coordinate gradient is

            $
abla_{	heta_j}J = rac{1}{m}\sum_i(h_	heta(x^{(i)})-y^{(i)})x_j^{(i)} + \lambda_1\partial|	heta_j| + \lambda_2	heta_j$

            with subgradient $\partial|	heta_j| \in [-1,1]$ at $	heta_j=0$.

            Given $eta=0.52$, $p=0.003$, CI $[0.18,0.86]$: reject $H_0: eta=0$; feature is statistically significant and positively associated with migration propensity.

## Q3 - Optimization (Ravine)

Momentum damps zig-zag oscillations in steep directions and accelerates along consistent gradients; Adam additionally adapts per-parameter step sizes using first and second moments.

In [None]:
paths = simulate_optimizers()
q3_info = plot_ravine_paths(paths, FIG_DIR / 'q3_ravine_optimizers.png')
q3_info

## Q4 - Non-Linear Models

For RBF-SVM overfitting, decrease `gamma` to widen each point's influence and smooth the boundary. For pruning, larger `alpha` penalizes leaf count more strongly, increasing bias and reducing variance.

In [None]:
q4_info = run_q4_svm_and_pruning(df, FIG_DIR)
q4_info

## Q5 - Unsupervised Learning

Eigenvalues of the covariance matrix are directional variances; explained variance ratio for PC_k is $\lambda_k / \sum_i \lambda_i$. Elbow method works because WCSS always decreases with K, but marginal reduction eventually diminishes.

In [None]:
q5_info = run_q5_unsupervised(df, FIG_DIR)
q5_info

## Q6 - Capstone Explainability (SHAP)

`base_value` is the expected model output over background data. `output_value` is the candidate-specific output. Their difference equals the sum of local SHAP contributions.

In [None]:
q6_info = run_q6_capstone(df, FIG_DIR, SOL_DIR)
q6_info

## Full Run

Run the complete pipeline and regenerate all outputs (`SQL`, figures, markdown answer key, and summary JSON).

In [None]:
summary = run_all(DATA_PATH, FIG_DIR, SOL_DIR)
summary['answer_key_path'], summary['sql_path']

### Deliverables produced by this notebook/script

- `code/solutions/q1_moving_average.sql`
- `code/solutions/complete_solution_key.md`
- `code/solutions/run_summary.json`
- `code/figures/q3_ravine_optimizers.png`
- `code/figures/q4_svm_gamma_sweep.png`
- `code/figures/q4_tree_pruning_curve.png`
- `code/figures/q5_kmeans_elbow.png`
- `code/figures/q6_shap_force_plot.png`
- `code/figures/q6_shap_summary.png`