In [None]:
import numpy as np
import pandas as pd

# Creating a Synthetic Population Using Iterative Proportional Fitting (IPF)

Creating a synthetic population using Iterative Proportional Fitting (IPF) is a common technique in demographic modeling and spatial analysis. IPF is used to adjust a base population to match given marginal totals for various demographic characteristics (e.g., age, gender, income). Here's a step-by-step guide to creating a synthetic population using IPF:

## Step 1: Define the Base Population

Start with a base population that includes individual records with various demographic attributes. This base population can be generated randomly or based on existing data.

## Step 2: Define Marginal Totals

Specify the marginal totals for each demographic characteristic that you want the synthetic population to match. These marginal totals are usually derived from census data or other reliable sources.

## Step 3: Initialize the Population

Initialize the synthetic population with the base population. This population will be adjusted iteratively to match the marginal totals.

## Step 4: Iterative Proportional Fitting

Perform the IPF algorithm to adjust the population. The algorithm involves iteratively adjusting the population to match the marginal totals for each demographic characteristic.


## Step 5: Validate the Results

After running the IPF algorithm, validate the synthetic population to ensure it matches the specified marginal totals. You can do this by checking the marginal totals of the synthetic population and comparing them to the specified marginal totals.

## Step 6: Use the Synthetic Population

Once validated, the synthetic population can be used for various analyses, such as simulating demographic changes, spatial analysis, or policy impact assessments.

### Notes:
- The base population should be large enough to capture the diversity of the target population.
- The marginal totals should be accurate and reliable.
- The IPF algorithm may require multiple iterations to converge, and the number of iterations can be adjusted based on the desired level of accuracy.
- The weights in the synthetic population can be used to represent the probability of each individual in the base population.



### **Iterative Adjustment**

The IPF algorithm works by iteratively adjusting the population to match the marginal totals for each characteristic. The process involves the following steps:

#### a. **Select a Characteristic**

Choose one demographic characteristic to adjust. For example, you might start with 'age'.

#### b. **Calculate Current Marginal Totals**

Calculate the current marginal totals for the selected characteristic in the synthetic population. This involves counting the number of individuals in each category of the characteristic.

#### c. **Calculate Adjustment Factors**

Compare the current marginal totals to the specified marginal totals. Calculate adjustment factors for each category of the characteristic. The adjustment factor for a category is the ratio of the specified marginal total to the current marginal total for that category.

#### d. **Adjust the Population**

Multiply the weights of individuals in each category by the corresponding adjustment factor. This step adjusts the population to better match the specified marginal totals for the selected characteristic.

#### e. **Normalize Weights**

After adjusting the weights, normalize them so that the total weight of the population remains constant. This ensures that the population size does not change during the adjustment process.

#### f. **Repeat for All Characteristics**

Repeat the above steps for all demographic characteristics. The order in which characteristics are adjusted can affect the convergence of the algorithm, but typically, the algorithm will converge regardless of the order.



In [1]:
def iterative_proportional_fitting(base_population, marginal_totals, max_iterations=1000, tolerance=1e-6):
    # Convert base population to a DataFrame
    df = pd.DataFrame(base_population)

    # Initialize the synthetic population
    synthetic_population = df.copy()

    # Iterate over each demographic characteristic
    for char in marginal_totals:
        # Calculate the current marginal totals
        current_totals = synthetic_population[char].value_counts().reindex(marginal_totals[char].index, fill_value=0)

        # Calculate the adjustment factors
        adjustment_factors = marginal_totals[char] / current_totals

        # Adjust the synthetic population
        for value, factor in adjustment_factors.items():
            synthetic_population.loc[synthetic_population[char] == value, 'weight'] *= factor

        # Normalize the weights
        synthetic_population['weight'] = synthetic_population['weight'] / synthetic_population['weight'].sum()

    return synthetic_population



```python
current_totals = synthetic_population[char].value_counts().reindex(marginal_totals[char].index, fill_value=0)
```

This line is responsible for calculating the current marginal totals for a specific demographic characteristic (`char`) in the synthetic population. Here's a detailed explanation of each part:

### `synthetic_population[char]`

- **Purpose**: Selects the column corresponding to the demographic characteristic `char` from the `synthetic_population` DataFrame.
- **Example**: If `char` is `'age'`, this will select the `'age'` column from the `synthetic_population` DataFrame.

### `.value_counts()`

- **Purpose**: Counts the frequency of each unique value in the selected column.
- **Example**: If the `'age'` column has values `[20, 30, 40, 50, 60]`, `.value_counts()` will return a Series with counts for each age, e.g., `20: 1, 30: 1, 40: 1, 50: 1, 60: 1`.

### `.reindex(marginal_totals[char].index, fill_value=0)`

- **Purpose**: Reindexes the resulting Series to match the index of the `marginal_totals` for the same characteristic, filling any missing values with `0`.
- **Example**: If `marginal_totals['age']` has an index `[20, 30, 40, 50, 60, 70]` and the `value_counts()` result only has indices `[20, 30, 40, 50, 60]`, `.reindex()` will ensure the resulting Series has indices `[20, 30, 40, 50, 60, 70]`, with the count for `70` being `0`.

### Putting It All Together

- **Step 1**: Select the column for the demographic characteristic `char` from the `synthetic_population` DataFrame.
- **Step 2**: Count the frequency of each unique value in that column.
- **Step 3**: Reindex the resulting Series to match the index of the `marginal_totals` for the same characteristic, filling any missing values with `0`.

### Example

Suppose `synthetic_population` looks like this:

| age | gender | income | weight |
|-----|--------|--------|--------|
| 20  | male   | low    | 1.0    |
| 30  | female | high   | 1.0    |
| 40  | male   | medium | 1.0    |
| 50  | female | low    | 1.0    |
| 60  | male   | high   | 1.0    |

And `marginal_totals` for `'age'` is:

```python
pd.Series({20: 1, 30: 1, 40: 1, 50: 1, 60: 1, 70: 0})
```

- **Step 1**: `synthetic_population['age']` selects the `'age'` column: `[20, 30, 40, 50, 60]`.
- **Step 2**: `.value_counts()` results in `pd.Series({20: 1, 30: 1, 40: 1, 50: 1, 60: 1})`.
- **Step 3**: `.reindex(marginal_totals['age'].index, fill_value=0)` ensures the resulting Series has indices `[20, 30, 40, 50, 60, 70]`, with the count for `70` being `0`.

The final `current_totals` will be:

```python
pd.Series({20: 1, 30: 1, 40: 1, 50: 1, 60: 1, 70: 0})
```

This ensures that the current marginal totals match the structure of the `marginal_totals` for the given characteristic, making it easier to compare and adjust the synthetic population iteratively.

In [None]:




# Example usage
base_population = [
    {'age': 20, 'gender': 'male', 'income': 'low'},
    {'age': 30, 'gender': 'female', 'income': 'high'},
    {'age': 40, 'gender': 'male', 'income': 'medium'},
    {'age': 50, 'gender': 'female', 'income': 'low'},
    {'age': 60, 'gender': 'male', 'income': 'high'}
]

# Define marginal totals
marginal_totals = {
    'age': pd.Series({20: 1, 30: 1, 40: 1, 50: 1, 60: 1}),
    'gender': pd.Series({'male': 3, 'female': 2}),
    'income': pd.Series({'low': 2, 'medium': 1, 'high': 2})
}

# Initialize weights
for record in base_population:
    record['weight'] = 1.0

# Perform IPF
synthetic_population = iterative_proportional_fitting(base_population, marginal_totals)

print(synthetic_population)
