## Ukraine Longitudinal Survey

_WIP - NOT FOR DISTRIBUTION_

_Proof-of-concept data structure: Bellingcat OSINT [Civilian Harm in Ukraine](https://ukraine.bellingcat.com/) geocoded event-level data $\rightarrow$ Ukraine Longitudinal Survey (ULS) cross-sectional survey data 1:$n$ merge._
> `uls_scratchpad.ipynb`<br>
> Simone J. Skeen x Claude Code CLI (01-11-2026)


### 1. Prepare
_Imports requisite packages; customizes outputs._

> **Dependencies:** Install via `pip install -r requirements.txt` from project root before running.

In [1]:
# ============================================================================
# IMPORTS
# ----------------------------------------------------------------------------
# Load requisite packages; configure display settings
# ============================================================================

import matplotlib.font_manager as fm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import warnings

from geopy.geocoders import Nominatim
from pathlib import Path
from time import sleep

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

pd.options.mode.copy_on_write = True

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

for c in (FutureWarning, UserWarning):
    warnings.simplefilter(action='ignore', category=c)

In [2]:
# ============================================================================
# CONFIGURATION
# ----------------------------------------------------------------------------
# All user-adjustable parameters in one place for replicability
# ============================================================================

# Project paths
PROJECT_ROOT = Path.home() / 'anaconda_projects' / 'ukraine_longitudinal_survey'
SUBDIRS = [
    'data', 
    'figures', 
    'tables', 
    'temp',
    ]
FONT_FILE = 'Arial.ttf'

# Bellingcat data source
# API (for future use): https://bellingcat-embeds.ams3.cdn.digitaloceanspaces.com/production/ukr/timemap/api.json
BELLINGCAT_CSV = 'ukr-civharm-2026-01-09.csv'

# ULS merge parameters
ULS_START_DATE = '2025-04-08'  # earliest observation in ULS child survey
DATE_FORMAT_INPUT = '%m/%d/%Y'  # format in source data (if CSV fallback)
DATE_FORMAT_ISO = '%Y-%m-%d'    # ISO 8601 for internal use

# Geocoding
NOMINATIM_USER_AGENT = 'ukraine_postcode_geocoder'
NOMINATIM_DELAY_SEC = 1  # rate limit compliance

In [3]:
# ============================================================================
# SET WORKING DIRECTORY
# ----------------------------------------------------------------------------
# Use pathlib for cross-environment replicability
# ============================================================================

import os
os.chdir(PROJECT_ROOT)
print(f"Working directory: {Path.cwd()}")

Working directory: /Users/sskeen/anaconda_projects/ukraine_longitudinal_survey


In [4]:
# ============================================================================
# CREATE SUBDIRECTORIES
# ----------------------------------------------------------------------------
# Initialize project folder structure
# ============================================================================

for subdir in SUBDIRS:
    (PROJECT_ROOT / subdir).mkdir(exist_ok=True)

print("Created subdirectories:", SUBDIRS)

Created subdirectories: ['data', 'figures', 'tables', 'temp']


In [5]:
# ============================================================================
# INSTALL ARIAL FONT
# ----------------------------------------------------------------------------
# Register custom font for matplotlib figures
# ============================================================================

fm.fontManager.addfont(str(PROJECT_ROOT / FONT_FILE))
plt.rcParams['font.family'] = 'Arial'

### 2. Import / explore: Bellingcat OSINT Civilian Harm in Ukraine
_Imports, cleans, describes level-2 aggregate conflict data. Acquired via CSV export: https://ukraine.bellingcat.com/._

In [6]:
# ============================================================================
# IMPORT BELLINGCAT CIVILIAN HARM DATA
# ----------------------------------------------------------------------------
# Source: https://ukraine.bellingcat.com/ (manual CSV export)
# Docs: https://github.com/bellingcat/ukraine-timemap
# ============================================================================

d_lvl2 = pd.read_csv(PROJECT_ROOT / 'data' / BELLINGCAT_CSV)

# Add ascending numerical index
d_lvl2['index'] = range(len(d_lvl2))
d_lvl2 = d_lvl2.set_index('index')

# Drop imprecise location column
d_lvl2 = d_lvl2.drop('location', axis=1, errors='ignore')

# Filter to observations on or before ULS start date
# TODO: Convert to Stata datetime and confirm before production merge
d_lvl2['date'] = pd.to_datetime(
    d_lvl2['date'], 
    format=DATE_FORMAT_INPUT,
    errors='coerce',
)

uls_startdate = pd.to_datetime(ULS_START_DATE)
d_lvl2 = d_lvl2[d_lvl2['date'] <= uls_startdate]

# Inspect
d_lvl2.shape
d_lvl2.info()
d_lvl2.head(2)
d_lvl2.tail(2)

(2446, 7)

<class 'pandas.core.frame.DataFrame'>
Index: 2446 entries, 0 to 2445
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            2446 non-null   object        
 1   date          2446 non-null   datetime64[ns]
 2   latitude      2444 non-null   float64       
 3   longitude     2444 non-null   float64       
 4   description   2446 non-null   object        
 5   sources       2445 non-null   object        
 6   associations  2405 non-null   object        
dtypes: datetime64[ns](1), float64(2), object(4)
memory usage: 152.9+ KB


Unnamed: 0_level_0,id,date,latitude,longitude,description,sources,associations
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,CIV0098,2022-02-24,49.212119,38.905921,"Individual injured by shelling, ambulance resp...",https://www.facebook.com/story.php?story_fbid=...,"Type of area affected=Residential,Weapon Syste..."
1,CIV0013,2022-02-24,48.055395,37.7783,Apparent strike on hospital in separatist held...,https://twitter.com/City_Donetsk/status/149687...,"Type of area affected=Healthcare,Weapon System..."


Unnamed: 0_level_0,id,date,latitude,longitude,description,sources,associations
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2444,S23173,2025-04-07,48.50781,37.747111,At least one person killed and two injured inc...,"https://t.me/astrapress/78387,https://t.me/ast...","Type of area affected=Residential,Type of area..."
2445,W4DR1R,2025-04-08,50.775975,35.251465,Heavily damaged residential buildings followin...,"https://t.me/suspilnesumy/32400,https://t.me/s...",Type of area affected=Residential


In [7]:
# ============================================================================
# DUMMY CODE: AREA TYPE AFFECTED
# ----------------------------------------------------------------------------
# Create binary indicators for area types
# ============================================================================

# Residential areas
d_lvl2['afct_residential'] = d_lvl2['associations'].str.contains(
    r'Type of area affected=Residential',
    case=False,
    na=False,
    regex=True,
).astype(int)

# Schools / childcare facilities
d_lvl2['afct_school'] = d_lvl2['associations'].str.contains(
    r'Type of area affected=School or childcare',
    case=False,
    na=False,
    regex=True,
).astype(int)

# Verify coding
print("Area type affected counts:")
print(f"  Residential: {d_lvl2['afct_residential'].sum()}")
print(f"  School/childcare: {d_lvl2['afct_school'].sum()}")
print(f"\nSample rows with afct_school=1:")
d_lvl2[d_lvl2['afct_school'] == 1][['id', 'associations', 'afct_residential', 'afct_school']].head(3)

Area type affected counts:
  Residential: 1094
  School/childcare: 326

Sample rows with afct_school=1:


Unnamed: 0_level_0,id,associations,afct_residential,afct_school
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
13,CIV1921,"Type of area affected=School or childcare,Weap...",0,1
17,CIV0378,"Type of area affected=School or childcare,Weap...",0,1
29,CIV0024,"Type of area affected=School or childcare,Weap...",0,1


#### 2a. Reverse geocode: latitude / longitude $\rightarrow$ UA postcode
_Reverse geocodes event coordinates to Ukrainian postcodes via Nominatim API._

In [8]:
# ============================================================================
# REVERSE GEOCODE COORDINATES
# ----------------------------------------------------------------------------
# Convert lat/lon to Ukrainian postcodes via Nominatim API
# ============================================================================

geolocator = Nominatim(user_agent=NOMINATIM_USER_AGENT)

def get_postcode(lat, lon):
    """
    Reverse geocode latitude/longitude to get postcode.
    Returns None if postcode not found.
    """
    try:
        location = geolocator.reverse(f"{lat}, {lon}", language='en')
        if location and location.raw.get('address'):
            postcode = location.raw['address'].get('postcode')
            return postcode
        return None
    except Exception as e:
        print(f"Error geocoding ({lat}, {lon}): {e}")
        return None

# Apply geocoding to each row with rate-limited delay
postcodes = []
total_rows = len(d_lvl2)

for idx, row in d_lvl2.iterrows():
    lat = row['latitude']
    lon = row['longitude']
    
    postcode = get_postcode(lat, lon)
    postcodes.append(postcode)
    
    if (idx + 1) % 10 == 0:
        print(f"Processed {idx + 1}/{total_rows} rows...")
    
    sleep(NOMINATIM_DELAY_SEC)

d_lvl2['postcode'] = postcodes

print(f"\nGeocoding complete!")
print(f"Postcodes found: {d_lvl2['postcode'].notna().sum()}/{len(d_lvl2)}")
print(f"\nSample results:")
print(d_lvl2[['latitude', 'longitude', 'postcode']].head(10))

Processed 10/2446 rows...
Processed 20/2446 rows...
Processed 30/2446 rows...
Processed 40/2446 rows...
Processed 50/2446 rows...
Processed 60/2446 rows...
Processed 70/2446 rows...
Processed 80/2446 rows...
Processed 90/2446 rows...
Processed 100/2446 rows...
Processed 110/2446 rows...
Processed 120/2446 rows...
Processed 130/2446 rows...
Processed 140/2446 rows...
Processed 150/2446 rows...
Processed 160/2446 rows...
Processed 170/2446 rows...
Processed 180/2446 rows...
Processed 190/2446 rows...
Processed 200/2446 rows...
Processed 210/2446 rows...
Processed 220/2446 rows...
Processed 230/2446 rows...
Processed 240/2446 rows...
Processed 250/2446 rows...
Processed 260/2446 rows...
Processed 270/2446 rows...
Processed 280/2446 rows...
Processed 290/2446 rows...
Processed 300/2446 rows...
Processed 310/2446 rows...
Processed 320/2446 rows...
Processed 330/2446 rows...
Processed 340/2446 rows...
Processed 350/2446 rows...
Processed 360/2446 rows...
Processed 370/2446 rows...
Processed 

In [9]:
# ============================================================================
# POSTCODE SUMMARY
# ----------------------------------------------------------------------------
# Enumerate unique postcodes in d_lvl2
# ============================================================================

def count_unique_postcodes(df, col='postcode'):
    """
    Returns count of unique non-null values in specified column.
    """
    unique_vals = df[col].dropna().unique()
    return len(unique_vals)

n_unique = count_unique_postcodes(d_lvl2)
print(f"Unique postcodes in d_lvl2: {n_unique}")

Unique postcodes in d_lvl2: 948


In [10]:
# ============================================================================
# ADMIN UNIT EXTRACTION
# ----------------------------------------------------------------------------
# Extract first 2 digits of postcode as administrative unit identifier
# ============================================================================

d_lvl2['admin_unit'] = d_lvl2['postcode'].astype(str).str[:2]

# Replace 'na' (from NaN conversion) with actual NaN
d_lvl2.loc[d_lvl2['admin_unit'] == 'na', 'admin_unit'] = np.nan

# Enumerate unique admin units
n_unique_admin = count_unique_postcodes(d_lvl2, col='admin_unit')
print(f"Unique admin units in d_lvl2: {n_unique_admin}")
print(f"\nAdmin unit distribution:")
d_lvl2['admin_unit'].value_counts().sort_index()

Unique admin units in d_lvl2: 82

Admin unit distribution:


admin_unit
01     19
02     16
03     26
04     33
06      2
07     28
08     64
09      3
10      1
11      6
12      3
14     20
15      5
16      1
17      3
18      2
19      1
20      2
21      6
22      1
23      1
24      5
27      2
28     12
29      3
30     68
31      1
33      1
34     11
35      5
36      5
38      1
40     20
41     26
42     34
43      1
45      2
46      2
48      1
49     36
50     29
51      3
52      4
53     40
54     91
55      2
56      6
57     16
61    326
62    104
63     80
64     30
65     42
67     17
68     10
69     62
70     40
71     30
72      3
73    241
74     81
75     53
76      1
77      1
78      1
79     10
80      2
81      2
83    103
84    174
85    143
86     36
87     73
88      3
91      3
92      6
93     54
94      7
96      1
99      1
MD      1
No     35
Name: count, dtype: int64

In [11]:
# ============================================================================
# MAP ADMIN UNITS TO OBLAST NAMES
# ----------------------------------------------------------------------------
# Source: Ukrposhta postal code system
# Ref: https://en.wikipedia.org/wiki/Postal_codes_in_Ukraine
# ============================================================================

ADMIN_UNIT_TO_OBLAST = {
    # Kyiv city (01-06)
    '01': 'Kyiv', '02': 'Kyiv', '03': 'Kyiv', 
    '04': 'Kyiv', '05': 'Kyiv', '06': 'Kyiv',
    # Kyiv Oblast (07-09)
    '07': 'Kyiv Oblast', '08': 'Kyiv Oblast', '09': 'Kyiv Oblast',
    # Zhytomyr Oblast (10-13)
    '10': 'Zhytomyr Oblast', '11': 'Zhytomyr Oblast', 
    '12': 'Zhytomyr Oblast', '13': 'Zhytomyr Oblast',
    # Chernihiv Oblast (14-17)
    '14': 'Chernihiv Oblast', '15': 'Chernihiv Oblast',
    '16': 'Chernihiv Oblast', '17': 'Chernihiv Oblast',
    # Cherkasy Oblast (18-22)
    '18': 'Cherkasy Oblast', '19': 'Cherkasy Oblast', '20': 'Cherkasy Oblast',
    '21': 'Cherkasy Oblast', '22': 'Cherkasy Oblast',
    # Vinnytsia Oblast (23-24)
    '23': 'Vinnytsia Oblast', '24': 'Vinnytsia Oblast',
    # Kirovohrad Oblast (25-28)
    '25': 'Kirovohrad Oblast', '26': 'Kirovohrad Oblast',
    '27': 'Kirovohrad Oblast', '28': 'Kirovohrad Oblast',
    # Khmelnytskyi Oblast (29-32)
    '29': 'Khmelnytskyi Oblast', '30': 'Khmelnytskyi Oblast',
    '31': 'Khmelnytskyi Oblast', '32': 'Khmelnytskyi Oblast',
    # Rivne Oblast (33-35)
    '33': 'Rivne Oblast', '34': 'Rivne Oblast', '35': 'Rivne Oblast',
    # Poltava Oblast (36-39)
    '36': 'Poltava Oblast', '37': 'Poltava Oblast',
    '38': 'Poltava Oblast', '39': 'Poltava Oblast',
    # Sumy Oblast (40-42)
    '40': 'Sumy Oblast', '41': 'Sumy Oblast', '42': 'Sumy Oblast',
    # Volyn Oblast (43-45)
    '43': 'Volyn Oblast', '44': 'Volyn Oblast', '45': 'Volyn Oblast',
    # Ternopil Oblast (46-48)
    '46': 'Ternopil Oblast', '47': 'Ternopil Oblast', '48': 'Ternopil Oblast',
    # Dnipropetrovsk Oblast (49-53)
    '49': 'Dnipropetrovsk Oblast', '50': 'Dnipropetrovsk Oblast',
    '51': 'Dnipropetrovsk Oblast', '52': 'Dnipropetrovsk Oblast',
    '53': 'Dnipropetrovsk Oblast',
    # Mykolaiv Oblast (54-57)
    '54': 'Mykolaiv Oblast', '55': 'Mykolaiv Oblast',
    '56': 'Mykolaiv Oblast', '57': 'Mykolaiv Oblast',
    # Chernivtsi Oblast (58-60)
    '58': 'Chernivtsi Oblast', '59': 'Chernivtsi Oblast', '60': 'Chernivtsi Oblast',
    # Kharkiv Oblast (61-64)
    '61': 'Kharkiv Oblast', '62': 'Kharkiv Oblast',
    '63': 'Kharkiv Oblast', '64': 'Kharkiv Oblast',
    # Odesa Oblast (65-68)
    '65': 'Odesa Oblast', '66': 'Odesa Oblast',
    '67': 'Odesa Oblast', '68': 'Odesa Oblast',
    # Zaporizhzhia Oblast (69-72)
    '69': 'Zaporizhzhia Oblast', '70': 'Zaporizhzhia Oblast',
    '71': 'Zaporizhzhia Oblast', '72': 'Zaporizhzhia Oblast',
    # Kherson Oblast (73-75)
    '73': 'Kherson Oblast', '74': 'Kherson Oblast', '75': 'Kherson Oblast',
    # Ivano-Frankivsk Oblast (76-78)
    '76': 'Ivano-Frankivsk Oblast', '77': 'Ivano-Frankivsk Oblast',
    '78': 'Ivano-Frankivsk Oblast',
    # Lviv Oblast (79-82)
    '79': 'Lviv Oblast', '80': 'Lviv Oblast',
    '81': 'Lviv Oblast', '82': 'Lviv Oblast',
    # Donetsk Oblast (83-87)
    '83': 'Donetsk Oblast', '84': 'Donetsk Oblast', '85': 'Donetsk Oblast',
    '86': 'Donetsk Oblast', '87': 'Donetsk Oblast',
    # Zakarpattia Oblast (88-90)
    '88': 'Zakarpattia Oblast', '89': 'Zakarpattia Oblast', '90': 'Zakarpattia Oblast',
    # Luhansk Oblast (91-94)
    '91': 'Luhansk Oblast', '92': 'Luhansk Oblast',
    '93': 'Luhansk Oblast', '94': 'Luhansk Oblast',
    # AR Crimea (95-98) & Sevastopol (99)
    '95': 'AR Crimea', '96': 'AR Crimea', '97': 'AR Crimea', '98': 'AR Crimea',
    '99': 'Sevastopol',
}

# Map admin units to oblast names
d_lvl2['oblast'] = d_lvl2['admin_unit'].map(ADMIN_UNIT_TO_OBLAST)

# Verify mapping
print(f"Mapped oblasts: {d_lvl2['oblast'].notna().sum()}/{len(d_lvl2)}")
print(f"Unmapped admin units: {d_lvl2[d_lvl2['oblast'].isna()]['admin_unit'].unique()}")
print(f"\nOblast distribution:")
d_lvl2['oblast'].value_counts()

Mapped oblasts: 2410/2446
Unmapped admin units: ['No' 'MD']

Oblast distribution:


oblast
Kharkiv Oblast            540
Donetsk Oblast            529
Kherson Oblast            375
Zaporizhzhia Oblast       135
Mykolaiv Oblast           115
Dnipropetrovsk Oblast     112
Kyiv                       96
Kyiv Oblast                95
Sumy Oblast                80
Khmelnytskyi Oblast        72
Luhansk Oblast             70
Odesa Oblast               69
Chernihiv Oblast           29
Rivne Oblast               17
Kirovohrad Oblast          14
Lviv Oblast                14
Cherkasy Oblast            12
Zhytomyr Oblast            10
Vinnytsia Oblast            6
Poltava Oblast              6
Zakarpattia Oblast          3
Ternopil Oblast             3
Volyn Oblast                3
Ivano-Frankivsk Oblast      3
AR Crimea                   1
Sevastopol                  1
Name: count, dtype: int64

In [12]:
# ============================================================================
# AGGREGATE TO OBLAST LEVEL
# ----------------------------------------------------------------------------
# Roll up event-level data to oblast-level summary
# ============================================================================

# Extract admin_units mapping BEFORE aggregation
admin_units_map = d_lvl2.groupby('oblast')['admin_unit'].apply(
    lambda x: ', '.join(sorted(x.dropna().unique()))
).reset_index().rename(columns={'admin_unit': 'admin_units'})

# Aggregate counts
d_lvl2 = d_lvl2.groupby('oblast', as_index=False).agg({
    'id': 'count',
    'afct_residential': 'sum',
    'afct_school': 'sum',
})

# Rename columns
d_lvl2 = d_lvl2.rename(columns={'id': 'n_events'})

# Merge admin_units
d_lvl2 = d_lvl2.merge(admin_units_map, on='oblast', how='left')

# Reorder columns and sort by total events descending
d_lvl2 = d_lvl2[['oblast', 'admin_units', 'n_events', 'afct_residential', 'afct_school']]
d_lvl2 = d_lvl2.sort_values('n_events', ascending=False).reset_index(drop=True)

# Save to CSV
d_lvl2.to_csv(PROJECT_ROOT / 'data' / 'd_lvl2.csv', index=False)

print(f"Oblast-level aggregation: {len(d_lvl2)} oblasts")
print(f"Saved to: {PROJECT_ROOT / 'data' / 'd_lvl2.csv'}")
d_lvl2

Oblast-level aggregation: 26 oblasts
Saved to: /Users/sskeen/anaconda_projects/ukraine_longitudinal_survey/data/d_lvl2.csv


Unnamed: 0,oblast,admin_units,n_events,afct_residential,afct_school
0,Kharkiv Oblast,"61, 62, 63, 64",540,264,94
1,Donetsk Oblast,"83, 84, 85, 86, 87",529,240,72
2,Kherson Oblast,"73, 74, 75",375,147,63
3,Zaporizhzhia Oblast,"69, 70, 71, 72",135,43,16
4,Mykolaiv Oblast,"54, 55, 56, 57",115,56,9
5,Dnipropetrovsk Oblast,"49, 50, 51, 52, 53",112,61,11
6,Kyiv,"01, 02, 03, 04, 06",96,55,8
7,Kyiv Oblast,"07, 08, 09",95,42,5
8,Sumy Oblast,"40, 41, 42",80,36,16
9,Khmelnytskyi Oblast,"29, 30, 31",72,38,4
