# Project 1

Due: 11/13/25

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data Description

**Describe the data clearly -- particularly any missing data that might impact your analysis -- and the provenance of your dataset. Who collected the data and why? (10/100 pts)**

In [3]:
ukraine = pd.read_csv('ukraine-damages.csv', sep='|', header=0, index_col=False)
ukraine.head()

Unnamed: 0,damage_id,iso3,country,gid_1,oblast,rayon,type_of_infrastructure,if_other_what,date_of_event,source_name,source_date,source_link,additional_sources,extent_of_damage,_internal_filter_date,_weights,access_subindicator,pcode
0,D0011,UKR,Ukraine,['UKR.15_1'],Luhanska,Siverskodonetskyi,Warehouse,,2022-03-25,OCHA,2022-03-28,https://reliefweb.int/report/ukraine/ukraine-h...,,Destroyed,2022-03-25,0.7,['7.2'],UA44
1,D0012,UKR,Ukraine,['UKR.15_1'],Luhanska,Siverskodonetskyi,Warehouse,,2022-03-26,OCHA,2022-03-28,https://reliefweb.int/report/ukraine/ukraine-h...,,Partially damaged,2022-03-26,0.7,['7.2'],UA44
2,D0015,UKR,Ukraine,['UKR.14_1'],Lvivska,,Warehouse,,2022-03-26,OCHA,2022-03-28,https://reliefweb.int/report/ukraine/ukraine-h...,,Unknown,2022-03-26,1.0,['7.2'],UA46
3,D0016,UKR,Ukraine,['UKR.14_1'],Lvivska,,Aircraft repair plant,Aircraft repair plan,2022-03-26,OCHA,2022-03-28,https://reliefweb.int/report/ukraine/ukraine-h...,,Destroyed,2022-03-26,1.0,['7.2'],UA46
4,D0017,UKR,Ukraine,['UKR.12_1'],Kyivska,,Bridge,,2022-03-22,OCHA,2022-03-28,https://reliefweb.int/report/ukraine/ukraine-h...,,Destroyed,2022-03-22,1.0,['9.2'],UA32


In [4]:
ukraine.to_csv('ukraine-data.csv', index=False)

### Basic Overview

In [4]:
ukraine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24269 entries, 0 to 24268
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   damage_id               24269 non-null  object 
 1   iso3                    24269 non-null  object 
 2   country                 24269 non-null  object 
 3   gid_1                   24261 non-null  object 
 4   oblast                  24261 non-null  object 
 5   rayon                   15549 non-null  object 
 6   type_of_infrastructure  24265 non-null  object 
 7   if_other_what           3107 non-null   object 
 8   date_of_event           17918 non-null  object 
 9   source_name             24267 non-null  object 
 10  source_date             22935 non-null  object 
 11  source_link             24266 non-null  object 
 12  additional_sources      3121 non-null   object 
 13  extent_of_damage        24266 non-null  object 
 14  _internal_filter_date   24269 non-null

Notes about the data:
- damage_id
-iso3 gives the country's unique 3 letter code form the Internations Organization for Standarization (ISO)
    - https://acleddata.com/methodology/guide-2023-acled-column-changes
- oblast is a term that refers to the primary (highest level) administrative division (region) in Ukraine. There are 24 of them in Ukraine.
- it appears that the `_internal_filter_date` column is equal to the `event_date` column if there is event_date data, and otherwise equals the source_date. 

In [5]:
# Convert 'date_of_event' and 'source_date' to datetime
ukraine['date_of_event'] = pd.to_datetime(ukraine['date_of_event'], errors='coerce')
ukraine['source_date'] = pd.to_datetime(ukraine['source_date'], errors='coerce')

In [6]:
print(ukraine['date_of_event'].min())
print(ukraine['date_of_event'].max())

2022-02-24 00:00:00
2025-10-01 00:00:00


In [15]:
# Categorical Variables Summary
print(ukraine['iso3'].value_counts())


iso3
UKR    24269
Name: count, dtype: int64


In [16]:
print(ukraine['oblast'].value_counts())


oblast
Donetska                         6095
Dnipropetrovska                  4053
Kharkivska                       3393
Khersonska                       2305
Mykolaivska                      2110
Sumska                           1540
Zaporizka                         803
Odeska                            671
Kyivska                           645
Luhanska                          588
Chernihivska                      516
Kyiv                              498
Zhytomyrska                       204
Khmelnytska                       177
Poltavska                         148
Cherkaska                         133
Lvivska                           128
Kirovohradska                      67
Vinnytska                          53
Ternopilska                        37
Rivnenska                          22
Ivano-Frankivska                   19
Volynska                           16
Autonomous Republic of Crimea      16
Zakarpatska                        12
Chernivetska                        7
Sevas

In [17]:
print(ukraine['rayon'].value_counts())


rayon
Nikopolskyi     2474
Kramatorskyi    2274
Kharkivskyi     1308
Khersonskyi     1187
Pokrovskyi       733
                ... 
Khmilnytskyi       1
Yaltynskyi         1
Zhmerynskyi        1
Sambirskyi         1
Kovelskyi          1
Name: count, Length: 113, dtype: int64


In [18]:
print(ukraine['type_of_infrastructure'].value_counts())

type_of_infrastructure
Industrial/Business/Enterprise facilities     5056
Education facility (school, etc.)             5021
Electricity supply system                     3016
Government facilities                         2060
Gas supply system                             1968
Cultural facilities (museum, theater etc.)    1210
Health facility (hospital, health clinic)     1169
Warehouse                                      834
Agricultural facilities                        805
Other                                          724
Railway                                        556
Water supply system                            280
Religious facilities                           277
Road / Highway                                 245
Heating and water facility                     244
Fuel depot                                     203
Telecommunications                             171
Bridge                                         146
Power plant                                     74
Harbor  

In [20]:
print(ukraine['extent_of_damage'].value_counts())

extent_of_damage
Partially damaged               20566
Unknown                          2457
Destroyed                        1242
Partially damaged, Destroyed        1
Name: count, dtype: int64


In [None]:
# Temporal Analysis
ukraine_copy = ukraine.copy()
ukraine_copy['year'] = ukraine_copy['date_of_event'].dt.year

print("Year Distribution:")
year_counts = ukraine_copy['year'].value_counts().sort_index()
for year, count in year_counts.items():
    if pd.notna(year):
        pct = count / len(ukraine) * 100
        print(f"{int(year)}: {count:6,} records ({pct:5.2f}%)")

print("\nMonthly Distribution:")     
ukraine_copy['year_month'] = ukraine_copy['date_of_event'].dt.to_period('M')
monthly_counts = ukraine_copy['year_month'].value_counts().sort_index()
for month, count in monthly_counts.items():
    if pd.notna(month):
        pct = count / len(ukraine) * 100
        print(f"{month}: {count:6,} records ({pct:5.2f}%)")

Year Distribution:
2022:  1,620 records ( 6.68%)
2023:  3,703 records (15.26%)
2024:  5,469 records (22.53%)
2025:  7,126 records (29.36%)

Monthly Distribution:
2022-02:     36 records ( 0.15%)
2022-03:    148 records ( 0.61%)
2022-04:    187 records ( 0.77%)
2022-05:    226 records ( 0.93%)
2022-06:    227 records ( 0.94%)
2022-07:    162 records ( 0.67%)
2022-08:    134 records ( 0.55%)
2022-09:     14 records ( 0.06%)
2022-10:     60 records ( 0.25%)
2022-11:    176 records ( 0.73%)
2022-12:    250 records ( 1.03%)
2023-01:    258 records ( 1.06%)
2023-02:    194 records ( 0.80%)
2023-03:    191 records ( 0.79%)
2023-04:    222 records ( 0.91%)
2023-05:    458 records ( 1.89%)
2023-06:    375 records ( 1.55%)
2023-07:    302 records ( 1.24%)
2023-08:    388 records ( 1.60%)
2023-09:    336 records ( 1.38%)
2023-10:    356 records ( 1.47%)
2023-11:    262 records ( 1.08%)
2023-12:    361 records ( 1.49%)
2024-01:    435 records ( 1.79%)
2024-02:    406 records ( 1.67%)
2024-03:    4

In [24]:
# Summary Statistics

summary = {
    'total_records': len(ukraine),
    'total_columns': len(ukraine.columns),
    'unique_oblasts': ukraine['oblast'].nunique(),
    'unique_rayons': ukraine['rayon'].nunique(),
    'unique_infrastructure_types': ukraine['type_of_infrastructure'].nunique(),
    'date_range': f"{ukraine['date_of_event'].min()} to {ukraine['date_of_event'].max()}",
    'missing_dates_pct': (ukraine['date_of_event'].isna().sum() / len(ukraine) * 100),
    'missing_rayons_pct': (ukraine['rayon'].isna().sum() / len(ukraine) * 100)
}

for key, value in summary.items():
    print(f"{key}: {value}")


total_records: 24269
total_columns: 18
unique_oblasts: 27
unique_rayons: 113
unique_infrastructure_types: 25
date_range: 2022-02-24 00:00:00 to 2025-10-01 00:00:00
missing_dates_pct: 26.169187028719765
missing_rayons_pct: 35.93061106761712


This dataset documents the infrastructure damage across Ukraine from the ongoing war. It contains 24,269 unique records spanning from February 24, 2022 to October 01, 2025. The dataset provides granular documentation of damage to vivilian and critical infratructure across all 27 regions of Ukraine.

### Key Dataset Characteristics:

**Size:** 24,269 unique damage records across 27 Ukrainian oblasts (regions), spanning February 2022 to October 2025

**Geographic Focus:** There is the heaviest damage concentation in the eastern and soutern front-line regions (Donestska, Dnipropetrovska, Kharkivska, Khersonska)

**Infrastructure Types:** The data documents damage to 25 categories, predominently industrial facilities (21%), schools (21%), and electricity systems (12%)

### Missing Data

In [7]:
missing_data = pd.DataFrame({
    'Column': ukraine.columns,
    'Missing Count': ukraine.isnull().sum(),
    'Missing Percentage': (ukraine.isnull().sum() / len(ukraine) * 100).round(2)
})

In [8]:
print(missing_data.to_string(index=False))

                Column  Missing Count  Missing Percentage
             damage_id              0                0.00
                  iso3              0                0.00
               country              0                0.00
                 gid_1              8                0.03
                oblast              8                0.03
                 rayon           8720               35.93
type_of_infrastructure              4                0.02
         if_other_what          21162               87.20
         date_of_event           6351               26.17
           source_name              2                0.01
           source_date           1334                5.50
           source_link              3                0.01
    additional_sources          21148               87.14
      extent_of_damage              3                0.01
 _internal_filter_date              0                0.00
              _weights              0                0.00
   access_subi

In [9]:
oblast_rayon = ukraine.groupby('oblast')['rayon'].apply(lambda x: x.isnull().sum())
total_by_oblast = ukraine.groupby('oblast').size()
missing_pct = (oblast_rayon / total_by_oblast * 100).round(2)

rayon_analysis = pd.DataFrame({
    'Oblast': oblast_rayon.index,
    "Total Records": total_by_oblast.values,
    "Missing Rayon": oblast_rayon.values,
    "Missing %": missing_pct.values
})

rayon_analysis_sorted = rayon_analysis.sort_values('Missing %', ascending=False)

print(rayon_analysis_sorted.to_string(index=False))

                       Oblast  Total Records  Missing Rayon  Missing %
                  Zhytomyrska            204            176      86.27
                  Mykolaivska           2110           1738      82.37
                    Rivnenska             22             17      77.27
                    Vinnytska             53             37      69.81
                      Kyivska            645            445      68.99
                     Luhanska            588            405      68.88
             Ivano-Frankivska             19              9      47.37
                 Chernihivska            516            234      45.35
                     Donetska           6095           2716      44.56
                    Zaporizka            803            317      39.48
                  Khmelnytska            177             63      35.59
                   Khersonska           2305            814      35.31
                         Kyiv            498            157      31.53
      

In [None]:
# Date validity
valid_event_dates = ukraine['date_of_event'].notna().sum()
valid_event_pct = (valid_event_dates / len(ukraine) * 100)
print(f"Records with valid event dates: {valid_event_dates:,} ({valid_event_pct:.2f}%)")
print(f"Records with missing event dates: {ukraine['date_of_event'].isna().sum():,} "
      f"({(ukraine['date_of_event'].isna().sum()/len(ukraine)*100):.2f}%)")

Records with valid event dates: 17,918 (73.83%)
Records with missing event dates: 6,351 (26.17%)


### Critical Missing Data That Will Impact Analysis

1. **Date of Event (26.17% missing):** This is the most significant gap, with over 6,000 records lacking event dates. This severely limits temporal analysis and trend identification.

2. **Rayon/District (35.93% mssing):** This reduces reographic precision, with mssing rates varying dramatically by oblast (from 0% to 86% depending on region).

3. **Additional Sources (87.14% missing):** Most records cite only a single source, limiting cross-verification capabilities.

4. **"If Other What" (87.20% missing):** This is not concerning. This field only applies to the 3% of records classified as "Other" infratructure.

## SOURCES TO CONSULT:

https://acleddata.com/methodology/ukraine-civilian-infrastructure-automated-tags

^^ USE OF LLM IN TAGGING THESE ATTACKS

https://acleddata.com/about/history

https://acleddata.com/methodology/ukraine

https://acleddata.com/methodology/acled-codebook

https://acleddata.com/methodology/tags-data

https://acleddata.com/methodology/sourcing

https://acleddata.com/methodology/guide-2023-acled-column-changes

## Data Provenance: Who collected the data and why?

This data is a multi-source humanitarian tracking dataset complied from 577 dfferent organizations, including:
- Major Ukranian media (Suspilne, Ukrinform - contributing 42% of records)
- UN agencies (OCHA, UNOSAT, UNESCO, IAEA)
- Ukrainian government (regional military administrations, ministries)
- International organizations (Human Rights Watch, Amnesty International)
- International media (BBC, CNN, Reuters, AP News)

**Purpose:** This data was collected for humanitarian response planning, needs assessment, protection monitoring, historical documentation, and coordination among aid organizations. The data supports idendifying areas requiring assistance, tracking civilian infrastructure damaga, and planning reconstruction efforts. 

**Methods:** Methods include combining satallite imagery analysis, on-ground reporting, humanitarian assessments, open-sources intelligence, and official governent reports. 

The process by which these events were given their civilian infrastructure tags involves both AI and human oversight (SUMMARIZE THE INFORMATION FROM THE LINK ABOUT THIS AND CITE)

According to the ACLED (Armed Conflict Location and Event Dataset) webpage for the Ukraine Conflict Monitor, the data on attacks on Ukrainian infrastructure file (ukraine-damages.csv) contains events in Ukraine that feature at least one of the following civilian infrastructure tags: energy, health, education, and residential infrastructure. (CITE!!!! (and probably quote or rephrase further))

## Phenomena

**What phenomenon are you modeling? Provide a brief background on the topic, including definitions and details that are relevant to your analysis. Clearly describe its main features, and support those claims with data where appropriate. (10/100 pts)**

## Describe Your Non-Parametric Model

Describe your non-parametric model (empirical cumulative distribution functions, kernel density function, local constant least squares regression, Markov transition models). How are you fitting your model to the phenomenon to get realistic properties of the data? What challenges did you have to overcome? (15/100 pts)

remember to cut the gordian knot. don't stay stuck.

IDEA: MARKOV CHAIN

"Given that a warehouse was damaged, what is t

## Use model/Bootstrap

Either use your model to create new sequences (if the model is more generative) or bootstrap a quantity of interest (if the model is more inferential). (15/100 pts)

## Critical Evaluation

Critically evaluate your work in part 4. Do your sequences have the properties of the training data, and if not, why not? Are your estimates credible and reliable, or is there substantial uncertainty in your results? (15/100 pts)

## Conclusion

Write a conclusion that explains the limitations of your analysis and potential for future work on this topic. (10/100 pts)

# Other

In addition, submit a GitHub repo containing your code and a description of how to obtain the original data from the source. Make sure the code is commented, where appropriate. Include a .gitignore file. We will look at your commit history briefly to determine whether everyone in the group contributed. (10/100 pts)

In class, we'll briefly do presentations and criticize each other's work, and participation in your group's presentation and constructively critiquing the other groups' presentations accounts for the remaining 15/100 pts.