In [1]:
#! pip install tqdm

In [2]:
import numpy as np
import yaml  # or json
from pathlib import Path
import csv
import matplotlib.pyplot as plt
from multiprocessing import Pool, shared_memory
import os
from tqdm import tqdm 
import json




### **Synthetic Population Generation via Categorical Constraint Matching**

#### **Core Methodology**
This implementation transforms raw census constraints and individual microdata into optimized numerical representations for synthetic population generation, using a three-stage process:

1. **Constraint Processing**  
   - Ingests structured CSV constraints (e.g., `age.csv`, `sex.csv`)  
   - Generates:  
     - A **label list** (e.g., `['age%16-24', 'age%25-34', 'sex%m', 'sex%f']`)  
     - A **target matrix** of census counts per geography zone (`np.array` shape: `[n_zones, n_categories]`)  

2. **Microdata Encoding**  
   - Converts individual records (`microdata.csv`) into a sparse binary matrix where:  
     - Rows = Individuals  
     - Columns = Constraint categories  
     - Values = `1` (matches category) or `0` (no match/missing)  
   - *Example*: An individual with `age=25-34` and `sex=m` encodes as `[0,1,0,0,0,0,1,0]`  

3. **Memory-Efficient Design**  
   - Uses pure NumPy arrays (no Pandas) for:  
     - Zero-copy sharing in multiprocessing  
     - O(1) incremental updates during annealing  
   - Handles missing data implicitly via zero-padding  

#### **Key Innovations**  
- **Deterministic Labeling**: Human-readable category prefixes (`age%`, `sex%`) ensure traceability  
- **Sparse-by-Design**: Binary encoding minimizes memory overhead  
- **Annealing-Ready**: Optimized for rapid constraint violation checks during optimization  

#### **Technical Highlights**  
```python
# Pseudocode of data flow
constraint_labels, constraint_targets = build_constraint_arrays(config)  # From age/sex CSVs
microdata_encoded = encode_microdata(config, constraint_labels)  # Binary matrix

# During annealing:
current_error = calculate_error(microdata_encoded, constraint_targets)  # L1/Chi-squared
```

---

### **Visualization of Data Flow**  
```mermaid
graph LR
    A[age.csv] --> C[Constraint Processor]
    B[sex.csv] --> C
    C --> D[Constraint Labels]
    C --> E[Target Matrix]
    F[microdata.csv] --> G[Microdata Encoder]
    D --> G
    G --> H[Binary Encoded Matrix]
```

---

### **Why This Works**  
- **Scalability**: Processes 100K+ individuals with minimal memory  
- **Flexibility**: New constraints require only YAML updates (no code changes)  
- **Reproducibility**: Explicit category mapping avoids hidden assumptions  



## Example YAML Configuration

```yaml
# Required microdata source
microdata:
  file: "data/microdata.csv"  # Path to individual records
  id_column: "ID"            # Optional unique identifier column

# List of constraint definitions
constraints:
  # Age distribution constraints
  - file: "data/age.csv"             # Census data file
    microdata_id: "Age"              # Matching column in microdata
    constraint_prefix: "Age%"        # Label prefix for categories
    geography_column: "GEO_CODE"    # Zone identifier column

  # Sex distribution constraints  
  - file: "data/sex.csv"
    microdata_id: "Sex"
    constraint_prefix: "Sex%"
    geography_column: "GEO_CODE"

In [3]:
def load_config(config_path):
    """Load YAML config file and validate structure."""
    config_path = Path(config_path)
    with open(config_path) as f:
        if config_path.suffix == '.yaml':
            config = yaml.safe_load(f)
        else:
            import json
            config = json.load(f)
    
    # Validate config structure
    assert "microdata" in config, "Config missing 'microdata' section"
    assert "constraints" in config and len(config["constraints"]) > 0, "No constraints defined"
    return config

## `load_config(config_path)`

**Purpose**:  
Loads and validates a configuration file (YAML or JSON) that defines the microdata and constraints structure.

**Inputs**:
- `config_path` (str or Path): Path to the configuration file (`.yaml` or `.json`)

**Returns**:
- `dict`: Parsed configuration with keys `'microdata'` and `'constraints'`

**Key Features**:
- Automatically detects file format (YAML/JSON) from extension
- Validates presence of required sections:
  - `'microdata'`: File path for microdata CSV
  - `'constraints'`: List of constraint definitions
- Raises `AssertionError` if structure is invalid

**Example Config**:
```yaml
microdata:
  file: "data/microdata.csv"
constraints:
  - file: "data/age.csv"
    microdata_id: "age"
    constraint_prefix: "age%"
    geography_column: "GEO_CODE"
    set_as_population_total: true

In [4]:
config = load_config('testdata/config.yaml')
print(config)

{'microdata': {'file': 'testdata/microdata.csv', 'id_column': 'ID'}, 'constraints': [{'file': 'testdata/age.csv', 'microdata_id': 'Age', 'constraint_prefix': 'Age%', 'geography_column': 'GEO_CODE', 'set_as_population_total': True}, {'file': 'testdata/sex.csv', 'microdata_id': 'Sex', 'constraint_prefix': 'Sex%', 'geography_column': 'GEO_CODE', 'set_as_population_total': False}]}


In [None]:
def build_constraint_arrays(config):
    """
    Enhanced version that:
    1. Uses set_as_population_total to calculate population sizes
    2. Tracks geography codes (GEOIDs) separately
    3. Returns results in a structured dict
    
    Returns:
        {
            "constraint_labels": List[str],
            "constraint_targets": np.array,
            "geography_codes": List[str],
            "population_constraints": np.array
            "population_totals":np.arry
        }
    """
    constraint_labels = []
    constraint_targets = None
    geography_codes = []
    pop_total_constraint = False
    population_constraints = []
    population_totals=[]
    

    for constraint in config["constraints"]:
        with open(constraint["file"], mode='r') as f:
            reader = csv.reader(f)
            headers = next(reader)
            data = list(reader)
        
        poptotal_constraint = constraint["set_as_population_total"]
        print(poptotal_constraint)
            
        geo_col = constraint["geography_column"]
        geo_idx = headers.index(geo_col)
        
        # Store GEOIDs on first pass
        if not geography_codes:
            geography_codes = [row[geo_idx] for row in data]
        
        # Handle population totals if specified
        if pop_total_constraint: 
            population_constraints = np.array([float(row[pop_idx]) for row in data])
        
        # Process categories
        categories = [h for i, h in enumerate(headers) if i != geo_idx]
        prefix = constraint["constraint_prefix"]
        constraint_labels.extend(f"{prefix}{cat}" for cat in categories)
        
        # Extract targets
        target_rows = []
        for row in data:
            target_values = [float(row[i]) for i in range(len(headers)) if i != geo_idx]
            total_population = sum(target_values)
            if poptotal_constraint:
                population_totals.append(total_population)
            target_values = [v/total_population for v in target_values]
            target_rows.append(target_values)

        
        targets = np.array(target_rows)
        constraint_targets = targets if constraint_targets is None else np.hstack([constraint_targets, targets])
    
    header = ['geography_code','population_total']+constraint_labels
    print(header)
    table = [[geography_codes[i]]+[population_totals[i]]+constraint_targets[i].tolist() for i in range(len(geography_codes))]
    table.insert(0,header)
    return {
        "constraint_labels": constraint_labels,
        "constraint_targets": constraint_targets.tolist(),
        "geography_codes": geography_codes,
        "population_totals":population_totals,
        "table":table
    }

In [6]:
constraints_dict = build_constraint_arrays(config)

True
False
['geography_code', 'population_total', 'Age%16-24', 'Age%25-34', 'Age%35-44', 'Age%45-54', 'Age%55-64', 'Age%65-74', 'Sex%m', 'Sex%f']


In [7]:
constraints_dict['table']

[['geography_code',
  'population_total',
  'Age%16-24',
  'Age%25-34',
  'Age%35-44',
  'Age%45-54',
  'Age%55-64',
  'Age%65-74',
  'Sex%m',
  'Sex%f'],
 ['E05001341',
  11345.0,
  0.11899515204936095,
  0.13494931687968267,
  0.18386954605553107,
  0.21533715293080652,
  0.20308505949757602,
  0.14376377258704276,
  0.49114147201410313,
  0.5088585279858969],
 ['E05001342',
  13422.0,
  0.1280733124720608,
  0.1805245119952317,
  0.18417523468931604,
  0.2021308299806288,
  0.18305766651765756,
  0.12203844434510505,
  0.4979138727462375,
  0.5020861272537624],
 ['E05001343',
  13132.0,
  0.12899786780383796,
  0.13904964971063052,
  0.18885166006701187,
  0.21893085592445932,
  0.18626256472738348,
  0.13790740176667682,
  0.48522692659153216,
  0.5147730734084679],
 ['E05001344',
  11466.0,
  0.1741671027385313,
  0.19945927088784232,
  0.19344147915576487,
  0.19623233908948196,
  0.14346764346764346,
  0.09323216466073608,
  0.49293563579277866,
  0.5070643642072213],
 ['E050013

In [8]:
for key, value in constraints_dict.items():
    print(f"Key: {key:>10} Type: {type(value).__name__}")

Key: constraint_labels Type: list
Key: constraint_targets Type: list
Key: geography_codes Type: list
Key: population_totals Type: list
Key:      table Type: list


In [9]:
constraints_dict
with open('constraints.json', 'w') as f:
    json.dump(constraints_dict, f)

## `build_constraint_arrays(config)`

**Purpose**:  
Transforms constraint CSV files (like age/sex distributions) into labeled NumPy arrays for synthetic population generation.

### Inputs
- `config` (dict): Configuration dictionary containing:
  - `constraints`: List of constraint definitions (file paths, prefixes, etc.)

### Returns
- `constraint_labels` (List[str]): Formatted category labels  
  Example: `["age%16-24", "age%25-34", "sex%m", "sex%f"]`
- `constraint_targets` (np.array): 2D array of census counts  
  Shape: `(n_geographies, n_constraints)`

### Key Features
-  **File Processing**:
  - Reads CSV files without Pandas (vanilla Python `csv` module)
  - Handles header rows and geography columns intelligently
-  **Smart Labeling**:
  - Combines constraint prefixes with category names  
    (e.g., `"age%" + "25-34" → "age%25-34"`)
-  **Array Construction**:
  - Builds a consolidated NumPy array by horizontally stacking constraints
  - Automatically converts string values to floats

### Example Workflow
```python
config = {
    "constraints": [
        {
            "file": "data/age.csv",
            "constraint_prefix": "age%",
            "geography_column": "GEO_CODE"
        }
    ]
}
labels, targets = build_constraint_arrays(config)

#### **Key Points**
- **`np.hstack`**: Short for "horizontal stack," it concatenates arrays column-wise.  
  Example:
  ```python
  import numpy as np
  a = np.array([[1, 2], [3, 4]])
  b = np.array([[5], [6]])
  np.hstack([a, b])  # Result: [[1, 2, 5], [3, 4, 6]]
  ```

In [10]:
def encode_microdata(config, constraint_labels):
    """
    Encode microdata into a one-hot-like array where missing values are 0.
    Returns:
        microdata_encoded: np.array shape (n_individuals, n_constraints)
        ids: list of IDs from the microdata
    """
    # Step 1: Load microdata from CSV
    with open(config["microdata"]["file"], mode='r') as f:
        reader = csv.DictReader(f)  # Reads header and rows as dictionaries
        microdata = list(reader)    # Convert to list of dicts

    n_individuals = len(microdata)
    n_constraints = len(constraint_labels)

    # Step 2: Create label-to-index mapping
    label_to_idx = {label: idx for idx, label in enumerate(constraint_labels)}

    # Step 3: Initialize output array (all zeros)
    microdata_encoded = np.zeros((n_individuals, n_constraints), dtype=np.int8)

    # Step 4: Extract IDs
    ids = [row[config["microdata"]["id_column"]] for row in microdata]

    # Step 5: Encode each constraint
    for constraint in config["constraints"]:
        col = constraint["microdata_id"]
        prefix = constraint["constraint_prefix"]

        for row_idx, row in enumerate(microdata):
            value = row.get(col)  # Get value for the current constraint column

            # Skip missing values (leave as 0)
            if value is not None and value.strip() != '':  # Check for non-empty strings
                label = f"{prefix}{value}"
                if label in label_to_idx:  # Ensure label exists in constraints
                    microdata_encoded[row_idx, label_to_idx[label]] = 1
    header = ['id']+constraint_labels
    table = [[ids[i]] + microdata_encoded[i].tolist() for i in range(len(microdata_encoded))]
    table.insert(0,header)
    return {
        "microdata_encoded":microdata_encoded, 
        "ids":ids,
        "table":  table
    }

In [11]:
microdata_dict = encode_microdata(config,constraints_dict["constraint_labels"])

In [12]:
microdata_dict['table']

[['id',
  'Age%16-24',
  'Age%25-34',
  'Age%35-44',
  'Age%45-54',
  'Age%55-64',
  'Age%65-74',
  'Sex%m',
  'Sex%f'],
 ['1', 0, 1, 0, 0, 0, 0, 1, 0],
 ['2', 0, 0, 0, 0, 1, 0, 0, 1],
 ['3', 0, 0, 0, 1, 0, 0, 0, 1],
 ['4', 0, 0, 0, 1, 0, 0, 1, 0],
 ['5', 0, 0, 0, 1, 0, 0, 1, 0],
 ['6', 0, 0, 0, 1, 0, 0, 0, 1],
 ['7', 1, 0, 0, 0, 0, 0, 0, 1],
 ['8', 0, 0, 0, 0, 0, 1, 1, 0],
 ['9', 0, 0, 1, 0, 0, 0, 0, 1],
 ['10', 0, 1, 0, 0, 0, 0, 1, 0],
 ['11', 1, 0, 0, 0, 0, 0, 0, 1],
 ['12', 1, 0, 0, 0, 0, 0, 0, 1],
 ['13', 0, 0, 0, 0, 0, 1, 0, 1],
 ['14', 0, 0, 0, 0, 1, 0, 0, 1],
 ['15', 0, 1, 0, 0, 0, 0, 1, 0],
 ['16', 0, 0, 0, 0, 0, 1, 0, 1],
 ['17', 0, 1, 0, 0, 0, 0, 0, 1],
 ['18', 0, 0, 0, 0, 1, 0, 0, 1],
 ['19', 0, 0, 0, 0, 1, 0, 1, 0],
 ['20', 0, 0, 0, 1, 0, 0, 1, 0],
 ['21', 0, 0, 0, 0, 0, 1, 1, 0],
 ['22', 0, 1, 0, 0, 0, 0, 1, 0],
 ['23', 0, 0, 0, 0, 1, 0, 1, 0],
 ['24', 0, 0, 0, 0, 0, 1, 1, 0],
 ['25', 0, 0, 0, 0, 1, 0, 0, 1],
 ['26', 0, 0, 0, 0, 0, 1, 1, 0],
 ['27', 0, 0, 0, 0, 0, 1, 1, 0

##  `encode_microdata(config, constraint_labels)`

**Purpose**:  
Converts individual microdata records into a binary matrix matching census constraints, with automatic handling of missing values.

### Inputs
- `config` (dict): Configuration dictionary with microdata file path
- `constraint_labels` (List[str]): Pre-generated labels from `build_constraint_arrays()`

### Returns
- `microdata_encoded` (np.array): Binary matrix where:
  - Rows = Individuals
  - Columns = Constraint categories
  - Values = `1` (present) or `0` (missing/not applicable)

### Key Features
-   **Smart Encoding**:
  - Converts categorical values (e.g., `"m"`, `"25-34"`) to binary flags
  - Preserves relationships between original data and constraint categories
-   **Missing Data Handling**:
  - Empty/missing values remain `0` by default
  - Silent skipping of undefined categories
-   **Efficient Construction**:
  - Pre-allocates NumPy array for performance
  - Uses memory-efficient `int8` dtype

### Example Transformation
**Input Microdata**:
```csv
ID,sex,age
1,m,25-34
2,f,55-64
3,,45-54

In [17]:
def to_file(file_path,data):
    filename = file_path
    # Open the file in write mode
    with open(filename, mode='w', newline='') as file:
        writer = csv.writer(file)

        # Write the data to the CSV file
        writer.writerows(data)

In [18]:
to_file('testdata/microdata_encoded.csv', microdata_dict["table"])
to_file('testdata/constraint_targets.csv', constraints_dict["table"])