## Ukraine Longitudinal Survey

_WIP - NOT FOR DISTRIBUTION_

_Proof-of-concept data structure: Bellingcat OSINT [Civilian Harm in Ukraine](https://ukraine.bellingcat.com/) geocoded event-level data $\rightarrow$ Ukraine Longitudinal Survey (ULS) cross-sectional survey data 1:$n$ merge._
> `uls_scratchpad.ipynb`<br>
> Simone J. Skeen (01-09-2026)


### 1. Prepare
_Imports requisite packages; customizes outputs._

> **Dependencies:** Install via `pip install -r requirements.txt` from project root before running.

In [None]:
import matplotlib.font_manager as fm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import warnings

from geopy.geocoders import Nominatim
from pathlib import Path
from time import sleep

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

pd.options.mode.copy_on_write = True

pd.set_option(
    'display.max_columns',
    None,
    )

pd.set_option(
    'display.max_rows',
    None,
    )

for c in (FutureWarning, UserWarning):
    warnings.simplefilter(
        action = 'ignore',
        category = c,
        )

In [None]:
# ============================================================================
# CONFIGURATION
# ----------------------------------------------------------------------------
# All user-adjustable parameters in one place for replicability
# ============================================================================

# Project paths
PROJECT_ROOT = Path.home() / 'anaconda_projects' / 'ukraine_longitudinal_survey'
SUBDIRS = [
    'data', 
    'figures', 
    'tables', 
    'temp',
    ]
FONT_FILE = 'Arial.ttf'

# Bellingcat data source
# API (for future use): https://bellingcat-embeds.ams3.cdn.digitaloceanspaces.com/production/ukr/timemap/api.json
BELLINGCAT_CSV = 'ukr-civharm-2026-01-09.csv'

# ULS merge parameters
ULS_START_DATE = '2025-04-08'  # earliest observation in ULS child survey
DATE_FORMAT_INPUT = '%m/%d/%Y'  # format in source data (if CSV fallback)
DATE_FORMAT_ISO = '%Y-%m-%d'    # ISO 8601 for internal use

# Geocoding
NOMINATIM_USER_AGENT = 'ukraine_postcode_geocoder'
NOMINATIM_DELAY_SEC = 1  # rate limit compliance

**Set working directory**<br>


`~/ukraine_longitudinal_survey/`<br>

In [3]:
# Set working directory using pathlib (replicable across environments)
import os
os.chdir(PROJECT_ROOT)
print(f"Working directory: {Path.cwd()}")

Working directory: /Users/sskeen/anaconda_projects/ukraine_longitudinal_survey


**Create  subdirectories**<br>


`~/ukraine_longitudinal_survey/`<br>
`├──data`<br>
`├──figures`<br>
`├──src`<br>
`└──tables`<br>

In [None]:
# Create subdirectories if they don't exist
for subdir in SUBDIRS:
    (PROJECT_ROOT / subdir).mkdir(exist_ok = True)

print("Created subdirectories:", SUBDIRS)

**Install Arial font**

In [None]:
fm.fontManager.addfont(str(PROJECT_ROOT / FONT_FILE))
plt.rcParams['font.family'] = 'Arial'

### 2. Import / explore: Bellingcat OSINT Civilian Harm in Ukraine
_Imports, cleans, describes level-2 aggregate conflict data. Acquired via CSV export: https://ukraine.bellingcat.com/._

In [None]:
# ============================================================================
# IMPORT BELLINGCAT CIVILIAN HARM DATA
# ----------------------------------------------------------------------------
# Source: https://ukraine.bellingcat.com/ (manual CSV export)
# Docs: https://github.com/bellingcat/ukraine-timemap
# ============================================================================

d_lvl2 = pd.read_csv(PROJECT_ROOT / 'data' / BELLINGCAT_CSV)

# add ascending numerical idx: 'index'

d_lvl2['index'] = range(len(d_lvl2))
d_lvl2 = d_lvl2.set_index('index')

# drop imprecise location col

d_lvl2 = d_lvl2.drop(
    'location', 
    axis = 1,
    errors = 'ignore',
    )

# drop obs _later_ than 8-Apr-2025 14:56:10.00 - i.e. earliest in ULS child

        ### SJS 1/9: based on _quick_ + dirty `list in` command; convert to State datetime + confirm before true merge

d_lvl2['date'] = pd.to_datetime(
    d_lvl2['date'], 
    format = DATE_FORMAT_INPUT,
    errors = 'coerce',
    )

uls_startdate = pd.to_datetime(ULS_START_DATE)

d_lvl2 = d_lvl2[d_lvl2['date'] <= uls_startdate]

# inspect - initial glance

d_lvl2.shape
d_lvl2.info()
d_lvl2.head(2)
d_lvl2.tail(2)

In [None]:
# ============================================================================
# DUMMY CODE: AREA TYPE AFFECTED
# ----------------------------------------------------------------------------
# Create binary indicators for area types
# ============================================================================

# Residential areas
d_lvl2['afct_residential'] = d_lvl2['associations'].str.contains(
    r'Type of area affected=Residential',
    case=False,
    na=False,
    regex=True,
).astype(int)

# Schools / childcare facilities
d_lvl2['afct_school'] = d_lvl2['associations'].str.contains(
    r'Type of area affected=School or childcare',
    case=False,
    na=False,
    regex=True,
).astype(int)

# Verify coding
print("Area type affected counts:")
print(f"  Residential: {d_lvl2['afct_residential'].sum()}")
print(f"  School/childcare: {d_lvl2['afct_school'].sum()}")
print(f"\nSample rows with afct_school=1:")
d_lvl2[d_lvl2['afct_school'] == 1][['id', 'associations', 'afct_residential', 'afct_school']].head(3)

#### 2a. Reverse geocode: latitude / longitude $\rightarrow$ UA postcode
_tktk._

In [None]:

# d_test = 1/5 in d_lvl2

d_test = d_lvl2.head(5)

print(f"d_test shape: {d_test.shape}")
d_test.head(8)

In [None]:
# Initialize the geocoder with a user agent
geolocator = Nominatim(user_agent=NOMINATIM_USER_AGENT)

def get_postcode(lat, lon):
    """
    Reverse geocode latitude/longitude to get postcode.
    Returns None if postcode not found.
    """
    try:
        # Reverse geocode the coordinates
        location = geolocator.reverse(f"{lat}, {lon}", language='en')
        
        # Extract postcode from address
        if location and location.raw.get('address'):
            postcode = location.raw['address'].get('postcode')
            return postcode
        return None
    except Exception as e:
        print(f"Error geocoding ({lat}, {lon}): {e}")
        return None

# Apply geocoding to each row with 1-second delay
postcodes = []
total_rows = len(d_test)

for idx, row in d_test.iterrows():
    lat = row['latitude']
    lon = row['longitude']
    
    postcode = get_postcode(lat, lon)
    postcodes.append(postcode)
    
    # Progress indicator
    if (idx + 1) % 10 == 0:
        print(f"Processed {idx + 1}/{total_rows} rows...")
    
    # Respect Nominatim usage policy - configurable delay between requests
    sleep(NOMINATIM_DELAY_SEC)

# Add postcodes to dataframe
d_test['postcode'] = postcodes

print(f"\nGeocoding complete!")
print(f"Postcodes found: {d_test['postcode'].notna().sum()}/{len(d_test)}")
print(f"\nSample results:")
print(d_test[['latitude', 'longitude', 'postcode']].head(10))