A machine learning system that predicts whether newly-arrived refugee families will struggle to afford monthly living expenses, helping resettlement caseworkers identify families who need proactive support.
Author: Abbot Tubeine, Sattler College Course: BUS302 Advances in Data Dataset: 2022 Annual Survey of Refugees (ASR), distributed via ICPSR
| File | Purpose |
|---|---|
refugee_model.ipynb |
Training notebook — data cleaning, feature selection, model training, optimization, model export |
app.py |
Streamlit caseworker dashboard — intake form, risk prediction, explanations, recommendations |
requirements.txt |
Python dependencies |
asr_2022_data.dta |
Source data (download from ICPSR; rename from 2022 ASR_Public_Use_File.dta) |
refugee_model.pkl |
Trained model bundle (generated by running the notebook) |
-
Get the 2022 ASR data:
- Visit ICPSR project E207021V1
- Create a free account and download the zip
- Extract
2022 ASR_Public_Use_File.dta - Rename it to
asr_2022_data.dta
-
Open
refugee_model.ipynbin Jupyter or Google Colab. -
Run all cells top to bottom. The final cells (Section 14) will save
refugee_model.pkl— the bundle the Streamlit app needs.
pip install -r requirements.txt
streamlit run app.pyThen open the URL shown (typically http://localhost:8501) in your browser.
The fastest way to share the app:
- Push this directory to a GitHub repo (don't push
asr_2022_data.dta— it's licensed) - Go to share.streamlit.io
- Sign in with GitHub and click "New app"
- Point it at your repo and select
app.pyas the main file - Click Deploy
The app will be live at a public URL within a couple of minutes.
The model uses only intake-collectable features — things a caseworker can know during the first conversation with a newly-arrived refugee. It does NOT use post-arrival information like current employment status, current English level, or current income, even though those would improve raw accuracy.
- Demographics: household size, sex, age
- Background: lived in refugee camp, years in camp, marital status
- Skills: education before U.S. arrival, native language literacy, English on arrival
- Pre-arrival history: work status in home country (Employed / Self-employed / Not working / Other)
- Placement: region of resettlement, year of arrival
- Country of birth — excluded for fairness reasons (protected demographic; signal is largely captured by other features)
- All post-arrival outcomes — would not be available at intake time
- Test F1 score: ~0.30–0.40 (after optimization)
- Test ROC-AUC: ~0.70–0.75
- Threshold tuned for F1 maximization given the 87/13 class imbalance
This is a decision-support tool, not a replacement for caseworker judgment:
- Small sample size — ~1,400 usable rows after cleaning
- Severe class imbalance — only ~13% of refugees in the data couldn't afford monthly expenses
- Region-level geography only — public-use file suppresses cities, so we can't capture city-level economic variation
- Single survey year (2022) — outcomes may shift across years
- Self-reported target — "can pay expenses" is subjective
- Survivorship bias — only refugees who stayed at their resettlement address were surveyed
All flagged cases should be reviewed by a qualified caseworker. The model is meant to surface families who may benefit from extra attention, not to make resource allocation decisions on its own.
Urban Institute. 2022 Annual Survey of Refugees. Inter-university Consortium for Political and Social Research [distributor], 2024-09-20. DOI: 10.3886/E207021V1