Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure the MOH dataset is complete #5670

Open
alexwlchan opened this issue Mar 20, 2023 · 0 comments
Open

Make sure the MOH dataset is complete #5670

alexwlchan opened this issue Mar 20, 2023 · 0 comments

Comments

@alexwlchan
Copy link
Contributor

alexwlchan commented Mar 20, 2023

We provide snapshots of the MOH reports (Medical Officer of Health reports) at https://developers.wellcomecollection.org/docs/datasets#london-moh-reports

You can download the data tables in several formats (CSV, HTML, TXT, XML) or the full corpus as raw text.

I don't know how these data sets were created, but they're missing some files. In particular, while I was shutting down the Systems Strategy (Legacy) account (#5669), I found the S3 bucket that used to serve these snapshots. After unpacking the various zips and comparing files, I found we're missing some files in the current snapshots:

  • The full text corpuses don't match – e.g. we have two copies of Barking.1958.b1978448x.txt and BethnalGreen.1921.b18219962.txt is entirely missing from our current snapshots
  • The data tables in our current snapshots only go up to 1972, but we have tables from 1973 to 1978 in the old buckets, e.g. CityofLondon.1973.b18253908.csv

I've uploaded all the files that aren't in the current snapshots to https://eu-west-1.console.aws.amazon.com/s3/buckets/wellcomecollection-assets-workingstorage?region=eu-west-1&prefix=moh-reports/&showversions=false, to make them easier to find.

We should fold these back into the public snapshots so everyone can get this data! Or work out why they were excluded (but since the digitised files are available online, I can't see why they would be).

@alexwlchan alexwlchan changed the title Make sure the MOH dataset contains all the MOH data we have Make sure the MOH dataset is complete Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant