# Data acquisition

This document collates the three main datasets used int his capsule: the Energy Performance Certificates (EPC), the UPRN locations, and the Spatial Signature polygons. We first link (through a table join) building age, through EPC, with UPRN locations, and then we bring the Spatial Signatures. The two are subsequently joined on the GPU in a [separate notebook](gpu_spatial_join.ipynb). Each section details the origin of the data.

In [1]:
import pandas
import geopandas
import dask_geopandas
from pyogrio import read_dataframe
import warnings # To turn disable some known ones below

uprn_p = '/home/jovyan/data/uk_os_openuprn/osopenuprn_202210.gpkg'
epc_p = '/home/jovyan/data/uk_epc_certificates/'
ss_p = '/home/jovyan/data/tmp/spatial_signatures_GB.gpkg'

ERROR 1: PROJ: proj_create_from_database: Open of /opt/conda/share/proj failed


Some of the computations will be run in parallel through Dask, so we set up a client for a local cluster with 16 workers (as many as threads in the machine where this is run):

In [2]:
import dask.dataframe as ddf
from dask.distributed import LocalCluster, Client

with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    client = Client(LocalCluster(n_workers=16))

## EPC certificates

These need to be downloaded manually from the official website ([https://epc.opendatacommunities.org/](https://epc.opendatacommunities.org/)). Once unzipped, it is a collection of `.csv` files that can be processed efficiently with Dask. Here we specify the computation lazily:

In [3]:
dtypes = {
   'CONSTRUCTION_AGE_BAND': 'str',
   'UPRN': 'str',
   'LMK_KEY': 'str'
}
certs_all = ddf.read_csv(
    f'{epc_p}*/certificates.csv', 
    dtype=dtypes,
    usecols=dtypes
)

And execute it on the Dask cluster, local in this case, to load them in RAM (NOTE: this will take a significant amount of RAM on your machine). Note that we drop rows with `N/A` values in either of the three columns as we need observations with the three valid.

In [4]:
%%time
with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    certs = certs_all.dropna().compute()

CPU times: user 12.4 s, sys: 3.1 s, total: 15.5 s
Wall time: 38.6 s


## UPRN coords

UPRN coordinates are unique identifiers for property in Britain. We source them from the Ordnance Survey's Open UPRN product ([https://www.ordnancesurvey.co.uk/business-government/products/open-uprn](https://www.ordnancesurvey.co.uk/business-government/products/open-uprn)), which also needs to be downloaded manually. We access the GPKG format which contains the geometries created for each point already.

To consume them, we load them up in RAM (NOTE - this will take a significant amount of memory on your machine):

::: {.column-margin}
The approach using `pyogrio` seems to beat a multi-core implementation with `dask-geopandas`, possibly because the latter relies on `geopandas.read_file`, even though it spreads the computation it across cores. In case of interest, here's the code:

```python
uprn = dask_geopandas.read_file(
    uprn_p, npartitions=16
).compute()
```
:::

In [5]:
%%time
uprn = read_dataframe(uprn_p, columns=['UPRN', 'geometry'])
uprn['UPRN'] = uprn['UPRN'].astype(str) 

CPU times: user 56.1 s, sys: 8.79 s, total: 1min 4s
Wall time: 1min 10s


## Merge UPRN-EPC

With both tables ready in memory, we merge them so that we attach point geometries to all the EPC certificate points through their UPRNs.

In [6]:
%%time
db = geopandas.GeoDataFrame(
    certs.merge(
        uprn, left_on='UPRN', right_on='UPRN', how='left'
    ), crs=uprn.crs
)

CPU times: user 40.9 s, sys: 3.58 s, total: 44.4 s
Wall time: 43.4 s


After the merge, we write the table to disk so it can be loaded later on for the spatial join:

In [7]:
db.to_parquet('/home/jovyan/data/tmp/epc_uprn.pq')


This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.

  db.to_parquet('/home/jovyan/data/tmp/epc_uprn.pq')


## Spatial Signatures

For the Spatial Signature boundaries, we rely on the official open data product. This can be downloaded programmatically from its [Figshare location](https://figshare.com/articles/dataset/Geographical_Characterisation_of_British_Urban_Form_and_Function_using_the_Spatial_Signatures_Framework/16691575/1). You can download it directly with:

In [8]:
! rm -f $ss_p # Remove if exsisting
! wget -O $ss_p https://figshare.com/ndownloader/files/30904861

--2022-12-21 17:30:16--  https://figshare.com/ndownloader/files/30904861
Resolving figshare.com (figshare.com)... 54.194.88.49, 52.17.229.77, 2a05:d018:1f4:d003:376b:de5c:3a42:a610, ...
Connecting to figshare.com (figshare.com)|54.194.88.49|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/30904861/spatial_signatures_GB.gpkg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20221221/eu-west-1/s3/aws4_request&X-Amz-Date=20221221T173017Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=6c7b771aaa9d3262e8c5d21388e81b74dd21b6d622d36a17bac818dc7fe6a71e [following]
--2022-12-21 17:30:17--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/30904861/spatial_signatures_GB.gpkg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20221221/eu-west-1/s3/aws4_request&X-Amz-Date=20221221T173017Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=6c7b771aaa9d3262

In [9]:
%%time
ss = read_dataframe(ss_p)

CPU times: user 1.46 s, sys: 794 ms, total: 2.26 s
Wall time: 2.24 s


This is very detailed, which makes things much slower to run, so we simplify first:

In [10]:
%%time
sss = ss.simplify(10)

CPU times: user 1min 17s, sys: 1.04 s, total: 1min 19s
Wall time: 1min 10s


Now we can write to disk a Parquet table with the simplified geometries for consumption later in the GPU:

In [11]:
ss.assign(geometry=sss).to_parquet('/home/jovyan/data/tmp/sss.pq')


This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.

  ss.assign(geometry=sss).to_parquet('/home/jovyan/data/tmp/sss.pq')
