# PolicyEngine Dataset classes

This notebook provides documentation for the `SingleYearDataset` and `MultiYearDataset` classes in PolicyEngine Data. These classes are designed to handle structured data for policy analysis and microsimulation.

More information on how to integrate with PolicyEngine Core and country-specific data packages will be added as this develops.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
from tables import NaturalNameWarning

from policyengine_data.single_year_dataset import SingleYearDataset
from policyengine_data.multi_year_dataset import MultiYearDataset

## SingleYearDataset

The `SingleYearDataset` class is designed to handle data for a single year, organizing it by entities (typically "person" and "household" in addition to others). Each entity contains a pandas DataFrame with variables relevant to that entity.

### Key features:
- Stores data for a single time period
- Organizes data by entities (person, household, etc.)
- All data in a given entity is combined into a single table
- Forces data shape validation in the dataset creation process given the table format
- Supports basic functionality from the legacy `Dataset` like loading and saving but deprecates multiple data format and loading to the cloud complexity

### Creating a SingleYearDataset

There are three main ways to create a `SingleYearDataset`:

1. **From entity DataFrames**: Create directly from a dictionary of entity DataFrames
2. **From HDF5 file**: Load from an existing HDF5 file
3. **From simulation**: Create from a PolicyEngine Core microsimulation

#### Method 1: From entity DataFrames

In [2]:
# Create sample data for demonstration
np.random.seed(42)

# Person-level data
person_data = pd.DataFrame({
    'person_id': range(1000),
    'age': np.random.randint(18, 80, 1000),
    'income': np.random.normal(50000, 15000, 1000),
    'household_id': np.repeat(range(400), [3, 2, 3, 2] * 100)  # Varying household sizes
})

# Household-level data
household_data = pd.DataFrame({
    'household_id': range(400),
    'household_size': np.random.randint(1, 6, 400),
    'housing_cost': np.random.normal(1200, 300, 400),
    'state': np.random.choice(['CA', 'TX', 'NY', 'FL'], 400)
})

# Create entities dictionary
entities = {
    'person': person_data,
    'household': household_data
}

# Create SingleYearDataset
dataset_2023 = SingleYearDataset(
    entities=entities,
    time_period=2023
)

print(f"Dataset created for year: {dataset_2023.time_period}")
print(f"Available entities: {list(dataset_2023.entities.keys())}")
print(f"Person data shape: {dataset_2023.entities['person'].shape}")
print(f"Household data shape: {dataset_2023.entities['household'].shape}")

Dataset created for year: 2023
Available entities: ['person', 'household']
Person data shape: (1000, 4)
Household data shape: (400, 4)


#### Method 2: Loading from HDF5 file

In [3]:
# Save the dataset to an HDF5 file
file_path = "sample_dataset_2023.h5"
dataset_2023.save(file_path)
print(f"Dataset saved to {file_path}")

# Load the dataset from the HDF5 file
loaded_dataset = SingleYearDataset(file_path=file_path)
print(f"Dataset loaded from file for year: {loaded_dataset.time_period}")
print(f"Loaded entities: {list(loaded_dataset.entities.keys())}")

# Verify the data is the same
print(f"Original person data shape: {dataset_2023.entities['person'].shape}")
print(f"Loaded person data shape: {loaded_dataset.entities['person'].shape}")
print(f"Data integrity check: {dataset_2023.entities['person'].equals(loaded_dataset.entities['person'])}")

Dataset saved to sample_dataset_2023.h5
Dataset loaded from file for year: 2023
Loaded entities: ['household', 'person']
Original person data shape: (1000, 4)
Loaded person data shape: (1000, 4)
Data integrity check: True


#### Method 3: From a PolicyEngine MicroSimulation

In [3]:
from policyengine_us import Microsimulation

start_year = 2023
dataset = "hf://policyengine/policyengine-us-data/cps_2023.h5"

sim = Microsimulation(dataset=dataset)

single_year_dataset = SingleYearDataset.from_simulation(sim, time_period=start_year)
single_year_dataset.time_period = start_year

print(f"Dataset created from PolicyEngine US microdata stored in {dataset}")
print(f"Dataset created for time period: {single_year_dataset.time_period}")

Dataset created from PolicyEngine US microdata stored in hf://policyengine/policyengine-us-data/cps_2023.h5
Dataset created for time period: 2023


### Main functionalities of SingleYearDataset

#### 1. Data access and properties

In [None]:
# Access entity data
print("Person entity columns:", dataset_2023.entities['person'].columns.tolist())
print("Household entity columns:", dataset_2023.entities['household'].columns.tolist())

# Get variables by entity
print("\nVariables by entity:")
variables = dataset_2023.variables
for entity, vars_list in variables.items():
    print(f"{entity}: {vars_list}")

# Access basic properties
print(f"\nTime period: {dataset_2023.time_period}")
print(f"Data format: {dataset_2023.data_format}")
print(f"Table names: {dataset_2023.table_names}")
print(f"Number of tables: {len(dataset_2023.tables)}")

Person entity columns: ['person_id', 'age', 'income', 'household_id']
Household entity columns: ['household_id', 'household_size', 'housing_cost', 'state']

Variables by entity:
person: ['person_id', 'age', 'income', 'household_id']
household: ['household_id', 'household_size', 'housing_cost', 'state']

Time period: 2023
Data format: arrays
Table names: ('person', 'household')
Number of tables: 2


Note that the data format property will be removed once we fully move away from legacy code that used the old `Dataset` classes as only entity tables as DataFrames will be supported

#### 2. Data loading and copying

In [5]:
# Load data as a flat dictionary (useful for PolicyEngine Core)
loaded_data = dataset_2023.load()
print("Loaded data keys (first 10):", list(loaded_data.keys())[:10])
print("Sample variable 'age' shape:", loaded_data['age'].shape)

# Create a copy of the dataset
dataset_copy = dataset_2023.copy()
print(f"\nOriginal dataset time period: {dataset_2023.time_period}")
print(f"Copied dataset time period: {dataset_copy.time_period}")
print(f"Are they the same object? {dataset_2023 is dataset_copy}")
print(f"Do they have the same data? {dataset_2023.entities['person'].equals(dataset_copy.entities['person'])}")

Loaded data keys (first 10): ['person_id', 'age', 'income', 'household_id', 'household_size', 'housing_cost', 'state']
Sample variable 'age' shape: (1000,)

Original dataset time period: 2023
Copied dataset time period: 2023
Are they the same object? False
Do they have the same data? True


#### 3. Data validation

In [6]:
# Validate the dataset (checks for NaN values)
try:
    dataset_2023.validate()
    print("Dataset validation passed - no NaN values found")
except ValueError as e:
    print(f"Validation failed: {e}")

# Create a dataset with NaN values to demonstrate validation
invalid_person_data = person_data.copy()
invalid_person_data.loc[0, 'income'] = np.nan

invalid_entities = {
    'person': invalid_person_data,
    'household': household_data
}

invalid_dataset = SingleYearDataset(
    entities=invalid_entities,
    time_period=2023
)

# Try to validate the invalid dataset
try:
    invalid_dataset.validate()
    print("Invalid dataset validation passed")
except ValueError as e:
    print(f"Validation correctly failed: {e}")

Dataset validation passed - no NaN values found
Validation correctly failed: Column 'income' contains NaN values.


## MultiYearDataset

The `MultiYearDataset` class is designed to handle data across multiple years, containing a collection of `SingleYearDataset` instances. This is useful for storing all the data necessary for multi-year analysis in a single object, rather than having to load and manage multiple `Dataset` objects one per year.

### Key features:
- Stores multiple `SingleYearDataset` instances indexed by year
- Maintains consistency across years for entity structures
- Supports copying and data extraction across all years

### Creating a MultiYearDataset

There are two main ways to create a `MultiYearDataset`:

1. **From a list of SingleYearDatasets**: Create from existing SingleYearDataset instances
2. **From HDF5 file**: Load from an existing multi-year HDF5 file

#### Method 1: From SingleYearDataset list

In [8]:
# Create datasets for multiple years
datasets_by_year = []

for year in [2021, 2022, 2023, 2024]:
    # Create slightly different data for each year (e.g., income growth)
    year_person_data = person_data.copy()
    year_person_data['income'] = year_person_data['income'] * (1.03 ** (year - 2023))  # 3% annual growth
    
    year_household_data = household_data.copy()
    year_household_data['housing_cost'] = year_household_data['housing_cost'] * (1.05 ** (year - 2023))  # 5% annual growth
    
    year_entities = {
        'person': year_person_data,
        'household': year_household_data
    }
    
    year_dataset = SingleYearDataset(
        entities=year_entities,
        time_period=year
    )
    datasets_by_year.append(year_dataset)

# Create MultiYearDataset
multi_year_dataset = MultiYearDataset(datasets=datasets_by_year)

print(f"Multi-year dataset created with years: {sorted(multi_year_dataset.datasets.keys())}")
print(f"Earliest time period present: {multi_year_dataset.time_period}")
print(f"Data format: {multi_year_dataset.data_format}")

Multi-year dataset created with years: [2021, 2022, 2023, 2024]
Earliest time period present: 2021
Data format: time_period_arrays


#### Method 2: Save and load from HDF5 File

In [15]:
warnings.filterwarnings("ignore", category=NaturalNameWarning)

# Save the multi-year dataset to an HDF5 file
multi_year_file_path = "sample_multi_year_dataset.h5"
multi_year_dataset.save(multi_year_file_path)
print(f"Multi-year dataset saved to {multi_year_file_path}")

# Load the multi-year dataset from the HDF5 file
loaded_multi_year = MultiYearDataset(file_path=multi_year_file_path)
print(f"Multi-year dataset loaded with years: {sorted(loaded_multi_year.datasets.keys())}")

# Verify the data integrity
original_2022_income = multi_year_dataset[2022].entities['person']['income'].mean()
loaded_2022_income = loaded_multi_year[2022].entities['person']['income'].mean()

print(f"Original 2022 average income: ${original_2022_income:.2f}")
print(f"Loaded 2022 average income: ${loaded_2022_income:.2f}")
print(f"Data integrity check: {abs(original_2022_income - loaded_2022_income) < 0.01}")

Multi-year dataset saved to sample_multi_year_dataset.h5
Multi-year dataset loaded with years: [2021, 2022, 2023, 2024]
Original 2022 average income: $49201.98
Loaded 2022 average income: $49201.98
Data integrity check: True


### Main functionalities of MultiYearDataset

#### 1. Accessing data by year

In [16]:
# Access specific years using get_year() method
dataset_2022 = multi_year_dataset.get_year(2022)
print(f"2022 dataset time period: {dataset_2022.time_period}")
print(f"2022 person data shape: {dataset_2022.entities['person'].shape}")

# Access specific years using indexing operator []
dataset_2024 = multi_year_dataset[2024]
print(f"2024 dataset time period: {dataset_2024.time_period}")

# Try to access a year that doesn't exist
try:
    dataset_2025 = multi_year_dataset.get_year(2025)
except ValueError as e:
    print(f"Error accessing non-existent year: {e}")

# List all available years
print(f"Available years: {sorted(multi_year_dataset.datasets.keys())}")

2022 dataset time period: 2022
2022 person data shape: (1000, 4)
2024 dataset time period: 2024
Error accessing non-existent year: No dataset found for year 2025.
Available years: [2021, 2022, 2023, 2024]


#### 2. Variables and data structure

In [17]:
# Get variables across all years
variables_by_year = multi_year_dataset.variables
print("Variables by year and entity:")
for year, entities in variables_by_year.items():
    print(f"\nYear {year}:")
    for entity, vars_list in entities.items():
        print(f"  {entity}: {vars_list}")

Variables by year and entity:

Year 2021:
  person: ['person_id', 'age', 'income', 'household_id']
  household: ['household_id', 'household_size', 'housing_cost', 'state']

Year 2022:
  person: ['person_id', 'age', 'income', 'household_id']
  household: ['household_id', 'household_size', 'housing_cost', 'state']

Year 2023:
  person: ['person_id', 'age', 'income', 'household_id']
  household: ['household_id', 'household_size', 'housing_cost', 'state']

Year 2024:
  person: ['person_id', 'age', 'income', 'household_id']
  household: ['household_id', 'household_size', 'housing_cost', 'state']


#### 3. Data loading and copying

In [18]:
# Load all data as a time-period indexed dictionary
all_data = multi_year_dataset.load()
print("Sample of loaded data structure:")
for var_name, year_data in list(all_data.items())[:2]:  # Show first 2 variables
    print(f"\nVariable '{var_name}':")
    for year, data_array in year_data.items():
        print(f"  Year {year}: shape {data_array.shape}, mean = {data_array.mean():.2f}")

# Create a copy of the multi-year dataset
multi_year_copy = multi_year_dataset.copy()
print(f"\nOriginal dataset years: {sorted(multi_year_dataset.datasets.keys())}")
print(f"Copied dataset years: {sorted(multi_year_copy.datasets.keys())}")
print(f"Are they the same object? {multi_year_dataset is multi_year_copy}")

# Verify independence of the copy
original_2023_income_mean = multi_year_dataset[2023].entities['person']['income'].mean()
copy_2023_income_mean = multi_year_copy[2023].entities['person']['income'].mean()
print(f"Original 2023 income mean: ${original_2023_income_mean:.2f}")
print(f"Copy 2023 income mean: ${copy_2023_income_mean:.2f}")
print(f"Data integrity check: {abs(original_2023_income_mean - copy_2023_income_mean) < 0.01}")

Sample of loaded data structure:

Variable 'person_id':
  Year 2021: shape (1000,), mean = 499.50
  Year 2022: shape (1000,), mean = 499.50
  Year 2023: shape (1000,), mean = 499.50
  Year 2024: shape (1000,), mean = 499.50

Variable 'age':
  Year 2021: shape (1000,), mean = 49.86
  Year 2022: shape (1000,), mean = 49.86
  Year 2023: shape (1000,), mean = 49.86
  Year 2024: shape (1000,), mean = 49.86

Original dataset years: [2021, 2022, 2023, 2024]
Copied dataset years: [2021, 2022, 2023, 2024]
Are they the same object? False
Original 2023 income mean: $50678.04
Copy 2023 income mean: $50678.04
Data integrity check: True


In [19]:
# Clean up temporary files
import os

temp_files = ["sample_dataset_2023.h5", "sample_multi_year_dataset.h5"]
for file in temp_files:
    if os.path.exists(file):
        os.remove(file)
        print(f"Cleaned up {file}")

print("Documentation complete! The notebook now contains comprehensive documentation for both SingleYearDataset and MultiYearDataset classes.")

Cleaned up sample_dataset_2023.h5
Cleaned up sample_multi_year_dataset.h5
Documentation complete! The notebook now contains comprehensive documentation for both SingleYearDataset and MultiYearDataset classes.
