# Project Title:

## Abstract

## Introduction:



## Methods
In this project, we make use primarily of two main methodologies: spatial microsimulation to address the problem of a lack of granular data, and geodemographic clustering to demonstrate the use and benefits of this granular data, even when it is synthetically generated. In this case, to demonstrate how it can be used to create clusters of areas where certain social/political attitudes prevail. 

Spatial microsimulation is a relatively recent development in the discipline of microsimulation, with most of the pioneering studies of microsimulation being combined with demography being published in the 1980s (Tanton and Edwards, 2013). It involves taking a dataset of individual-level granular data, and utilising aggregate data with a geometric quality (usually taken from a census) to reweight and project the individual data, effectively 'creating' a population of synthetic individuals. This doens't create any 'new data' as such, instead it calculates how representative each individual is of each zone and creates repeat individuals to make up the populations of each zone according to what we already know to be true of that zone. These microsimulations can be either static (creating a snapshot of the population) or dynamic (moving that population through time and certain phenomena). Although it is a young methodology, it's importance in all levels of planning cannot be understated. Granular data is incredible valuable in observing trends within populations on a spatial and temporal scale, but collecting data on every individual in a study area is expensive and arduous. Spatial microsimulation allows for the synthesis of granular data where it did not previously exist, and so allows for methods of studying entire populations without requiring data collection for an entire population. 

The method has been successfully utilised in a range of different  health studies (e.g., a recently published study used spatial microsimulation to simulated the prevalence of back pain across small areas in England (Smalley and Edwards, 2024), and dynamic spatial microsimulation was utilised a lot within the COVID-19 health pandemic to create models that simulated the spread of coronavirus on a local scale (Spooner *et al*, 2021)) and in other studies of urban dynamics (e.g., another study used microsimulation in order to simulate labour markets in order to understand migration flows within an area, including individual characteristics such as age and training (Amrhein and MacKinnonf, 1988). As a method of creating granular data, it's applications are numerous and invaluable.

Our second methodology here is kMeans clustering to create a geodemographic classification of electoral zones in York. This is a method of grouping areas by certain characteristics to create 'clusters' of neighbourhoods (or even households if we were to look at a smaller scale). With a top-down approach to creating clusters, kMeans uses machine learning to randomly distribute centroids and group objects based on their closest centroid, then re-adjusting the centroid to be the mean distance between all it's grouped objects.

It relies on the idea that individuals will chose to live in areas with people similar to them and it often has quite broad applications, e.g., for marketing purposes to create clusters of different types of consumers. But so long as the data exists, we can use it to create clusters of any characteristics we chose. The methodology requires granular-level data, therefore requiring the data we created with the previous method of spacial microsimulation. 


In this study, we will use a static spatial microsimulation to create a population estimate of York based off of responses to the 2020 British Social Attitudes Survey. The British Social Attitudes survey is a longitudinal study, running since 1983, which surveys a representative population of Britain on their opinions on pressing social matters. Using a subset of the 2020 dataset (NatCen Social Research, 2023), we will create a 




### Data Sources

| Dataset | Description |
| ------ |  --- |
| British Social Attitudes Survey 2020 |  Individual level data for microsimulation. Contains a range of variables all pertaining to the participants social attitudes and beliefs (NatCen Social Research, 2023). |
| UK Census Data | UK census data from 2011 (Office for National Statistics, 2011) at the electoral ward scale for York, only downloading data for aggregate counts of age and sex for all adults in York in order to form our aggregate constraints for microsimulation. |
| Electoral Zones Shapefile | Shapefile of York's electoral boundaries from 2011 (before changes that occured after our census data was collected) (UK Data Service, 2024). |

*Table 1: data sources*

To carry out these methods we use three data sources, as shown in table 1 above. Each of them need a significant amount of cleaning before we can use them within our methods. The British Social Attitudes survey is originally in the format of a .sav file. For use in for our spatial microsimulation, the file needs to be transformed into a CSV which we do in with an <a href="Cleaning.ipynb#SPSS-to-CSV">RStudio script</a>. This new CSV data set has nearly 3964 rows and 210 columns. We <a href="Cleaning.ipynb#Step-3:-Subsetting">create a subset</a> of the data, only including data entries for individuals located in Yorkshire and the Humber to be able to more accurately simulate a population of York. At the same time, we <a href="Cleaning.ipynb#Step-4:-Cleaning-the-new-CSV">remove all unnecessary columns</a> in order to increase processing times. After further removing all rows containing NaN values, we are left with a complete dataset with 276 rows and 19 columns. To finish, variables are changed to more descriptive names.

The UK Census Data from 2011 is access through the UK Data Service, and data pertaining to aggregate counts of age and sex of all adults grouped by electoral wards in York are downloaded. These have been manually cleaned in Microsoft Excel to form to constraints for our spatial microsimulation. Age is merged into two categories (between 18 and 64, and 65 and over), sex is left as male or female. 

The electoral zones shapefile also requires a degree of cleaning for use. As it is downloaded, it's missing the geocodes for electoral zones to match the census data zones; to fix this we <a href="Fixing the shapefile.ipynb">add the codes</a> from the census data to the shapefile, merging on the ward names, to ensure that both files contain the geocodes necessary for later analysis. 

Once we've carried out of spatial microsimulation and we have a resulting dataframe filled with granular level data to represent a complete adult population of York, we also need to transform this data into aggregate counts for each zone in order to carry out of kMeans clustering. We do this by creating <a href="Cleaning.ipynb#Step-3:-Creating-a-new-database-of-aggregate-counts">new pivoted dataframes</a> and adding counts into a new dataframe. We then <a href="Cleaning.ipynb#Step-4:-Validation-and-checking-the-data-to-ensure-we've-aggregated-it-correctly">validate these new columns</a> by checking counts between our aggregate columns for each category, ensuring that each category is still representative of our population, before exporting the dataframe as a CSV for use in geodemographics.

### Spatial microsimulation

In this study, we employ the use of a static microsimulation to reweight aggregate data from the British Attitude Survey to represent the population of York, UK. 


### Geodemographics

In order to carry out geodemographic clustering, we must further process the data we made using spatial microsimulation. The data we produced in the previous methodology is granular data, representing individuals in York. However, in order to carry out a k-means clustering (as we do here), the data needs to be in aggregate format. We do this in our <a href="cleaning.ipynb#Cleaning-for-Geodemographics">cleaning section</a>, by subsetting the BSA dataframe by categories (age, sex, whether they voted in the last election or not etcetera) and using the pandas 'pivot' function to change the index to 'geo_codes', effectively aggregating the table, and then grouping the data by value and electoral zone, and adding these into a new dataframe. 

Next we go about selecting our variables. We're trying to create a geodemographic cluster model 
| Data | Topic | Description | 
| ------ | --- | --- |
| Age | Demographic | Including a categorical measure of age to see if there are differences in individuals over the age of 65 versus the rest of the population. Age is also a core demographic characteristic, and is often considered necessary to include in all studies of demography. |
| Sex | Demographic | Same as age, sex is often regarded as a core demographic characteristic to be included in all demographic studies. |
| Party support | Political | Indication of which political party the individual most alligned themselves with at the time of the survey. |
| Political interest | Political/social | Does the participant consider themselves to be interested in politics. |
| Welfare opinions | Political/social | What does the participant think of the current welfare system? Are they pro welfare or anti? |
| Left/right | Social | Does the participant consider themselves more on the left, right, or centre of the political scale? |
| Liberal/authoritarian | Social | Does the participant consider themselves to be more of the libertarian or an authoritarian? |
| Religion |
| National identity |
| Racial origin |
| Disability |


## Results

## Discussion

## Conclusion

---

## References

- Amrhein, C.G., and MacKinnonf, R.D. (1988) 'A Microsimulation Model of a Spatial Labor Market', *Annals of the Association of American Geographers*. 78(1). pp.112-131. doi:  10.1111/j.1467-8306.1988.tb00194.x.
- Haven (2024) *Read and write SPSS files*. Available at: https://haven.tidyverse.org/reference/read_spss.html. (Accessed on: 22nd March 2024).
- Lovelace, R. and Ballas, D. (2013) '‘Truncate, replicate, sample’: A method for creating integer weights for spatial microsimulation', in *Computers, Environment and Urban Systems*. 41. pp.1-11. doi: 10.1016/j.compenvurbsys.2013.03.004.
- NatCen Social Research (2023) *British Social Attitudes Survey, 2020* [data collection]. UK Data Service. SN: 9005, doi: 10.5255/UKDA-SN-9005-1 (Accessed on: 22nd March 2024).
- Office for National Statistics (2011) Census aggregate data. *UK Data Service*. Edition: February 2017. DOI: 10.5257/census/aggregate-2011-2.
- Readr (2024) *Write a data frame to a delimited file*. Available at: https://readr.tidyverse.org/reference/write_delim.html. (Accessed on: 22nd March 2024). 
- Smalley, H. and Edwards, K. (2024) 'Chronic back pain prevalence at small area level in England - the design and validation of a 2-stage static spatial microsimulation model', *Spatial and Spatio-temporal Epidemiology*. 48. pp.1-10. doi: 10.1016/j.sste.2023.100633.
- Spooner, F., Abrams, J.F., Morrissey, K., Shaddick, G., Batty, M., Milton, R., Dennett, A., Lomax, N., Malleson, N., Nelissen, N., Coleman, A., Nur, J., Jin, Y., Greig, R., Shenton, C., and Birkin, M. (2021) 'A dynamic microsimulation model for epidemics'. *Social Science & Medicine*. 291. pp.1-12. doi: 10.1016/j.socscimed.2021.114461.
- Tanton, R., and Edwards, K.L. (2013) 'Introduction to Spatial Microsimulation: History, Methods and Applications', in Tanton, R., and Edwards, K.L. (eds), *Spatial Microsimulation: A Reference Guide for Users*,  Dordrecht: Springer. pp.3-8.
- UK Data Service (2024) *Boundary Data Selector*. Available at: https://borders.ukdataservice.ac.uk/bds.html. (Accessed on: 22nd March 2024). 
