# Data Pipeline

This notebook contains the preprocessing steps for the UN resolutions dataset. The goal is to create a clean and structured dataset that can be used for analysis and modeling.

The current pipeline roughly looks like this:

1. **Fetch Resolution**: Download the raw UN resolutions data from the source.
2. **Transform Resolutions**: Change structure to one row per resolution.
3. **Parse Subjects**: Subjects are currently a single string, that may contain multiple subjects. We create one row per resolution-subject pair.
4. **Fetch Thesaurus**: Download the thesaurus data from the source (TTL file)
5. **Parse Thesaurus Graph**: Parse the TTL file to extract the hierarchical relationships between subjects (using SKOS broader/narrower relationships).
6. **Create Subject Lookup Table**: Build a subject reference table with subject_id (URI), labels in different languages, and any other metadata from the thesaurus.
7. **Transform Subjects**: Change subject string to subject_id, which allows for multiple languages.
8. **Normalize Dataframe**: Create separate tables for resolutions and subjects, and a mapping table for resolution-subject pairs.
9. **Build Hierarchy Graph**: Create a directed graph structure from the thesaurus where edges represent parent-child relationships.
10. **Generate Closure Table**: Create a closure table that contains all ancestor-descendant pairs with their depths. This includes:
    - Self-references (each subject to itself at depth 0)
    - All transitive relationships (every ancestor-descendant pair with their distance)
11. **Index Tables**: Create indexes on foreign keys and frequently queried columns (resolution_id, subject_id, ancestor_id, descendant_id) for performance.
12. **Implement Filter Functions**: Create query functions that use the closure table to efficiently filter resolutions by any category level (including all descendants).

Some subjects are in different schemes! example-> 1002319 

In [1]:
from resolution_analyzer import UNResolutionAnalyzer

In [2]:
# Examples of how to use the UNResolutionAnalyzer class

# 1. Basic initialization with default configuration
analyzer = UNResolutionAnalyzer(config_path='config/data_sources.yaml')


INFO - Logging setup complete.
INFO - Initializing UNResolutionAnalyzer
INFO - Loaded data from local source successfully.


  self.resolution_table = pd.read_csv(resolution_table_path)


In [3]:
# 2. Query all resolutions (no filters)
all_resolutions = analyzer.query()
print(f"Total resolutions: {len(all_resolutions)}")


INFO - 
Final result: 5534 resolutions
Total resolutions: 5534


In [4]:
# 3. Query by date range
date_filtered = analyzer.query(start_date='2000-01-01', end_date='2010-12-31')
print(f"Resolutions from 2000-2010: {len(date_filtered)}")


INFO - 
Final result: 835 resolutions
Resolutions from 2000-2010: 835


In [5]:
# 4. Query by subject with descendants
# Using 'Political and Legal Questions' subject
political_legal_questions_resolutions = analyzer.query(
    subject_ids=['http://metadata.un.org/thesaurus/01'],
    include_descendants=True
)
print(f"Political and Legal Questions related resolutions: {len(political_legal_questions_resolutions)}")

INFO - Expanded 1 subjects to 1157 (including descendants)
INFO - After subject filter: 1872 resolutions
INFO - 
Final result: 1872 resolutions
Political and Legal Questions related resolutions: 1872


In [6]:
# 5. Query by subject without descendants
palestine_questions_resolutions = analyzer.query(
    subject_ids=['http://metadata.un.org/thesaurus/1004700'],
    include_descendants=False
)
print(f"Palestine Questions (strict) resolutions: {len(palestine_questions_resolutions)}")


INFO - After subject filter: 172 resolutions
INFO - 
Final result: 172 resolutions
Palestine Questions (strict) resolutions: 172


In [7]:
# 6. Combined query (date range and subject)
science_technology_recent = analyzer.query(
    start_date='2015-01-01',
    subject_ids=['http://metadata.un.org/thesaurus/16'],  # Science and technology
    include_descendants=True
)
print(f"Recent science and technology resolutions: {len(science_technology_recent)}")

INFO - Expanded 1 subjects to 723 (including descendants)
INFO - After subject filter: 33 resolutions
INFO - 
Final result: 33 resolutions
Recent science and technology resolutions: 33
