# Data

PolicyEngine-UK uses a combination of data sources to maximise the accuracy of the microsimulation model. Like most other UK tax-benefit models, we use the [Family Resources Survey](https://www.gov.uk/government/collections/family-resources-survey--2) as the core dataset. The FRS is the highest-quality source of information on household structure, household incomes and benefits, but it is not enough to give a full picture of the UK household population: it does not offer sufficient information on consumption or wealth, and the information is does provide often doesn't exactly match administrative statistics (for example, benefit and investment incomes do not gross up to match UK totals from more reliable datasets, indicating they are either under-reported or under-sampled).

To overcome this, we apply several different techniques to enhance our Family Resources Survey-based dataset (which we term the "Enhanced FRS"). These techniques fall into two categories: *imputation* and *calibration*:
* Imputation involves predicting variables that are not already present in the survey: for example, predicting fuel spending for households in the FRS.
* Calibration involves adjusting the survey weights to correct for sampling bias: for example, increasing the weights of high-income households which are less likely to respond to the FRS to correct for under-sampling.

## Imputations

We impute:

* Income by source from the [Survey of Personal Incomes](https://www.gov.uk/government/collections/personal-income-by-tax-year)
* Wealth by asset type from the [Wealth and Assets Survey](https://www.ons.gov.uk/peoplepopulationandcommunity/personalandhouseholdfinances/debt/methodologies/wealthandassetssurveyqmi)
* Consumption by category from the [Living Costs and Food Survey](https://www.ons.gov.uk/peoplepopulationandcommunity/personalandhouseholdfinances/expenditure/methodologies/livingcostsandfoodsurveytechnicalreportfinancialyearsendingmarch2018andmarch2019)

All of our imputations use the same method:

1. Start with the FRS
2. Find a survey which has:

    a. Our "target" variables not in the FRS (e.g. wealth)
    
    b. Some "source" variables in common with the FRS (e.g. income)
    
3. Train a random forest model to predict target variable distributions from source variables in the non-FRS survey
4. Apply the trained model to predict target variables from the source variables in the FRS

This process is illustrated below:

<img src="imputation.png" alt="drawing" width="600"/>

### Income

The Family Resources Survey is known to under-sample high incomes in particular. This fundamentally limits the accuracy of microsimulation models which use it, even with re-weighting: around 10% of Income Tax revenue comes from individuals with over £1m in total income, who do not appear in the FRS. We train our random forest model on to predict the distribution of a range of income variables (listed below) from demographic variables:

* Employment income
* Self-employment profit
* Savings interest income
* Dividend income
* Pension income
* Employment expenses
* Property income
* Gift aid
* Pension contributions


We do not override income variables in the FRS: instead, we clone the entire dataset, replace the incomes of the clone and set their weights to zero. This is to ensure that the weighted dataset has not changed, but the re-weighting procedure later can make use of the new records if they provide more useful samples to meet statistical targets.

Note: before training, we first expand the SPI to add its missing population - the SPI is a survey of individuals on administrative tax datasets, not all individuals (only around 75%). We find the population of FRS individuals who could realistically represent the missing population (the lowest-income 25% of individuals, including children), construct SPI records for them and add them to the SPI dataset.

### Wealth

We impute from the Wealth and Assets Survey:

* Total property wealth
* Total corporate wealth
* Gross financial wealth
* Net financial wealth
* Value of main residence
* Value of other residences
* Value of non-residential property
* Value of land-only property

We impute these variables for every record in the survey (including the SPI copies).

### Consumption

From the Living Costs and Food Survey, we impute spending on:

* Food and non-alcoholic beverages
* Alcohol and tobacco
* Clothing and footwear
* Housing, water and electricity
* Household furnishings
* Health
* Transport
* Communication
* Recreation
* Education
* Restaurants and hotels
* Miscellaneous goods and services
* Petrol
* Diesel

## Calibration

We apply gradient descent to maximise the accuracy of the resulting survey microdata (after imputations). This involves:

1. Constructing a loss function f: (household weights) -> (accuracy score, lower is better)
2. Initialise weights to their original values
3. Find the downwards gradient of the loss function (the amount accuracy would increase if each household weight increased)
4. Move each household weight in the downwards loss direction
5. Repeat steps 3-4 until loss no longer decreases

The results of this process can be seen in the [re-weighting section](/model/reweighting).

Note: before re-weighting, we duplicate the entire dataset and migrate each household in the second half to Universal Credit (initialising with zero weight). This is to ensure that the households on legacy benefits still provide useful information after legacy benefits are completely phased out.