Skip to content
/ uk_bmd Public

Supplementary material for "The UK Local BMD: A Full Name Onomastic Resource".

Notifications You must be signed in to change notification settings

sjbush/uk_bmd

Repository files navigation

uk_bmd

This repository contains the datasets described in the paper The UK Local BMD: A Full Name Onomastic Resource, alongside the scripts necessary to generate them.

The raw data (not included in this repo but available on request) are sourced from the UK Local BMD, a volunteer project to transcribe the birth, marriage and death records of England and Wales (more specifically, from the 12 cities, counties and regions of Bath, Berkshire, Cheshire, Cumbria, Kingston-upon-Thames, Lancashire, North Wales, Shropshire, Staffordshire, West Midlands, Wiltshire, and Yorkshire), and processed to generate a rare onomastic resource - one which contains unredacted full names.

The two subdirectories in this repo, 'dataset_B' and 'dataset_D', represent data processed from 25,213,860 birth and 9,887,244 death records, respectively, with the contents of each file described in the paper. These include the total count of each name registered per year. Birth records span the period 1837 - 2022 and death records 1733 - 2009, in both cases representing an assumed unbiased population sample.

IMPORTANT: the years shown in both datasets represent the known or imputed year of birth, and in this respect, the two datasets are directly comparable. To repeat: the years shown in the 'death' dataset do not represent the year of death; they represent the year of birth (i.e. year of death - age at death, unless otherwise specified). Unfortunately, age at death was not provided for records in Cumbria, Shropshire, the West Midlands and the vast majority of data from North Wales, and as such, no (or very few) death records could be used from those regions.

A Note on the Name

The "UK BMD" is somewhat misleadingly named as it does not contain records from Scotland or Northern Ireland. This is due to differences in their legislative frameworks and history of civil registration relative to England and Wales. In the latter, civil registration began on the 1st July 1837 although only became compulsory from 1st January 1875 with the passing of the Births and Deaths Registration Act 1874.

Prior to 1837, records of baptisms, marriages, and funerals can be found in local parish registers although as these were maintained by Anglican clergy, they wouldn’t have included non-conformists (among others).

Record Parsing Criteria

The BMD records contain both 'forename(s)' and 'surname' fields, with the latter always capitalised.

These scripts parse the 'forename(s)' field to produce one 'first name' and zero or more 'middle names' on the basis that spaces separate individual names.

The name before the first space was considered the first name and all subsequent names, space delimited, were considered middle names. For example, Ellen Sarah Jane SMITH has the first name Ellen, two middle names, Sarah and Jane, and the surname SMITH but Sarahjane Ellen SMITH has one first name, Sarahjane, one middle name, Ellen, and one surname, SMITH.

Nevertheless, there are exceptions to this. The records were parsed using a small number of criteria, some to exclude them from consideration and others to make light edits. These are as follows:

Criteria for exclusion

Records were excluded if the 'forename(s)' field:

  • did not contain at least one alphabetical character (i.e. A to Z, irrespective of case)
  • contained a number
  • contained either of the symbols: ? & ( ) [ ]
    • these generally denote ambiguous transcription or a description appended to the name, such as John Son of William & Mary
  • exactly matched any of the following phrases: Newborn, Not Named, Un-named, Unnamed, Unknown, Undetermined, No Name, No First Name, Name Not Given, Registered, Re-registered, Infant Girl, Infant Boy, Infant Female, Infant Male, Male Infant, Female Infant, Baby Female, Baby Male, Unchristened, Deceased
  • contained the clause " of " or " Of ", noting spaces between the word
    • this invariably matches a descriptive phrase instead of a name, such as in birth records beginning Child of, Son of or Daughter of, and in death records beginning Late of
  • for some regions (Bath, Cumbria, Shropshire, West Midlands, Wiltshire, Yorkshire), records had a unique reference number but for others (Berkshire, Cheshire, Kingston, Lancashire, North Wales, Staffordshire) multiple different records could share the same reference. In this case, the reference number refers to a processing batch, often (but not always) of around five records at a time.
    • for these 6 regions where there is meant to be a one-to-one correspondence between record and reference number, what do we do when more than one record has that number?
    • we could either pick one at random and exclude the others, or exclude them all as a matter of course, but on manual inspection we may find there is a good reason why this has happened
    • it is often either because the name is complex and there is ambiguity in the "forenames" and "surname" fields, or because the record either did not include all the names or (especially in the case of death records) because it used prefered names as opposed to the literal birth name
    • e.g. category one (transcription error): Brian Armstrong- CLIFFORD and Brian ARMSTRONG-CLIFFORD
    • e.g. category two (same individual but different names): John ALEXANDER and Jack ALEXANDER, Katherine Helen ANGUS and Kit ANGUS, and Henry Frederick John ANDREWS and Harry ANDREWS
    • only manual inspection can rescue the "real" category two record (which we can't automate so must instead pick one at random), but we can automate rescue of the category one records
    • the way we do this is to convert all possible names associated with a given reference into a gapless capitalised string, and then count the number of strings associated with that reference
    • if there is only one string (e.g. BRIANARMSTRONG-CLIFFORD), then what this means is that irrespective of the number of records associated with that reference ID, the names are the same
    • in that case, we allow one of these records to be processed, randomly chosen, and exclude the others
    • for category two records, we just pick one at random. This is the preferable option because it means we aren't throwing out records arbitrarily - for instance, with John/Jack ALEXANDER, we know that there is an individual who goes by the name of either John or Jack, so it is not inaccurate to include one of them in the output
    • it's also important that we pick one record rather than include every record as there are some egregious examples of many records for the same name
    • e.g. a birth record of Edward C R F M ROSPIGLIOSI-PALLAVACINI with parallel records for Edward C R F M ROSPIGLIOSI and Edward C R F M PALLAVACINI, and whose mother's maiden name (after a presumed remarriage) is either Acton, Dalbert, Dalbert-Acton, Lyon, Lyon-Dalbert or Lyonacton. There are records containing combinations of all these names, all with the same reference number, making 17 records for the same boy. By picking one at random, we don't artificially inflate the number of births named Edward by 16

Criteria for editing

Records were edited according to the following criteria:

  • if the forenames field began with either of the following - Colonel, Corporal, Countness, Doctor, General, H R H Prince, Lady, Lord, Major, Prince, Reverend, Sergeant, Sir, Sister - the text was removed
    • this is a unique complication with death records: they sometimes contain titles which must first be omitted, otherwise they may mistakenly be interpreted as names
  • if the forenames field had one first name ending in a hyphen and only one middle name, then the latter is appended to the former - unless the middle name is an initial (because there is not enough information to go on)
    • e.g. Ann- Marie FLYNN is edited to Ann-Marie FLYNN
  • if the forenames field had one middle name and that name is "-" then it is treated as a placeholder character meaning "not applicable", and so removed
    • e.g. Ann - FLYNN is edited to Ann FLYNN
  • if the middle name contains multiple hyphens, e.g. E-----, then remove all of them and leave only the initial
    • this is a sign that the transcription was incomplete
  • if either first or middle name was a conventional abbreviation, it was expanded, and if an obvious typo, amended (noting the subjective nature of this edit)
    • for example, Wm to William, Edwd to Edward, Geo to George
    • for the full list, see the two-column file "names_to_revise.txt"; the leftmost column is a name as it appears in the BMD, the rightmost how it is amended
  • if the forenames field had multiple middle names, the last of which ended in a hyphen, then we assume that is the first part of a compound surname - unless either middle name is an initial (because there is not enough information to go on) or the surname is already hyphenated
    • e.g. Ann Marie Hucklebury- FLYNN is edited to Ann Marie HUCKLEBURY-FLYNN
    • note that there are many instances (mostly historical) where a familial surname has been used in a middle name position; we would assume Ann Marie Hucklebury FLYNN (no hyphen) is one of them
    • further support for the assumption that the last middle name, if ending in a hyphen, is actually a compound surname comes from the presence of multiple people from the same area with the same compound, dying in close proximity (these are probably siblings)
    • e.g. the deaths of Courtenay Sandilands Wynell- MAYOW and Robert Lawrence Wynell- MAYOW

There will inevitably be a number of errors remaining in the final, processed, dataset.

Scripts

The scripts, which should be run in numbered order, perform the following processing steps:

Gender prediction

1a.predict_gender_of_name_using_US_SSA.pl

1b.predict_gender_of_name_using_UK_ONS.pl

1c.predict_gender_of_name_using_UK_NRS.pl

1d.predict_gender_of_name_using_Canada_Alberta.pl

These scripts parse four birth registration datasets, each of which represent official government statistics, to record the gender (more precisely, sex assigned at birth) associated with each name.

These scripts replicate the method described by Blevins and Mullen 2015 and implemented, with guidelines for responsible use, here. Although a pragmatic approach to predicting gender it is, to quote their paper, a "blunt tool to study a complex subject". Note in particular that these are state-generated datasets which acknowledge only two genders, and that the method can only provide population-level classifications of the gender of each name - individual usage may differ.

The four datasets used by these scripts are obtained from the US Social Security Administration, the UK Office for National Statistics (ONS), the National Records of Scotland, and the Government of Alberta, Canada.

The US dataset contains the first names and gender of all US Americans with a social security number, with names registered to fewer than 5 people a year excluded, whilst the UK and Canadian datasets are full population samples of all live births in their respective regions. The UK ONS dataset excludes names registered to fewer than 3 people a year whereas the NRS and Alberta datasets do not require a minimum number of births per name. For further details as to how these datasets were compiled and their coverage, please refer to their respective URLs.

2.predict_gender_of_name_by_combining_datasets.pl

This script pools the count data from the four aforementioned datasets to make one of five gender classifications per name:

  • male - the name was more frequently assigned to males for every year in which it was recorded

  • female - the name was more frequently assigned to females for every year in which it was recorded

  • mostly male - the name was not more frequently assigned to males for every year in which it was recorded but by total number of birth records (summed across all years), the name was more commonly given to males than females

  • mostly female - the name was not more frequently assigned to females for every year in which it was recorded but by total number of birth records (summed across all years), the name was more commonly given to females than males

  • undetermined - the name could not be automatically gender-typed because there was an even number of male and female records

This pooled list of gender classifications is used in subsequent scripts to gender-type names in the UK BMD (within which, gender was not recorded). Names present in the UK BMD but not this list are, by default, "undetermined" gender.

Birth and death record parsing

3a.parse_birth_records.pl

3b.parse_death_records.pl

These scripts parse the UK BMD's birth and death records, respectively, generating a number of tab-separated plain text files containing summary statistics including the absolute and relative (percentage) count of each first and middle name per year.

Internal and external validation

4a.compare_total_cts_per_year_for_B_vs_D.pl

4b.compare_total_cts_per_year_for_B_vs_ONS.pl

Sanity tests of the data are performed both internally (by comparing subsets of the data with each other) and externally (by comparing the data to an outside source) using Pearson correlations. More specifically:

Script 4a compares the total number of records per forename in the ‘B’ and ‘D’ datasets to find (as expected) a strong positive correlation between them. This is a crude sanity-test of the data and although not controlling for differences in either temporal or geographical scope, nevertheless suggests that both datasets are, correctly, randomly sampling from the same population.

Script 4b compares the total number of records per forename in the ‘B’ dataset to the total number of records per forename in the UK ONS dataset, again finding a strong positive correlation between the two. This analysis was restricted to the years 1996 - 2007 as in this period the two datasets had a substantive number of records in common (the ‘B’ dataset has >10,000 records per year for each of these years). Unfortunately, it was not possible to compare the ‘D’ dataset with the ONS dataset due to paucity of records in the years they have in common.

5.count_how_many_first_and_middle_names_in_B_and_D.pl

This script determines how many different first names and middle names there are in each of the B and D datasets and how many are unique to either.

Figures

The four figures of the paper were created using R with packages ggplot2 and scales. Code is available as comments within the following scripts:

Figure 1: 3b.parse_death_records.pl

Figure 2: 4a.compare_total_cts_per_year_for_B_vs_D.pl

Figure 3: 4b.compare_total_cts_per_year_for_B_vs_ONS.pl

Figure 4: 3b.parse_death_records.pl

Date the raw records were last downloaded

Data in both the dataset_B and dataset_D subdirectories - which contain the output of scripts 3a and 3b - were generated using birth/death records last obtained in September 2023. Note that this is the date of last access - not the same as the date these records were last updated.

Birth records:

  • Bath: 16th April 2023

  • Berkshire: 15th August 2021

  • Cheshire: 25th April 2023

  • Cumbria: 15th August 2021

  • Kingston: 13th September 2023

  • Lancashire: 13th September 2023

  • North Wales: 16th August 2021

  • Shropshire: 13th September 2023

  • Staffordshire: 13th September 2023

  • West Midlands: 15th August 2021

  • Wiltshire: 16th August 2021

  • Yorkshire: 16th August 2021

Death records:

  • Bath: 17th June 2023

  • Berkshire: 15th August 2021

  • Cheshire: 17th June 2023

  • Cumbria: 17th June 2023

  • Kingston: 8th June 2023

  • Lancashire: 13th September 2023

  • North Wales: 17th June 2023

  • Shropshire: 13th September 2023

  • Staffordshire: 13th September 2023

  • West Midlands: 17th June 2023

  • Wiltshire: 17th June 2023

  • Yorkshire: 17th June 2023

Copyright Statement

The website hosting the UK local BMD project is operated by Weston Technologies Limited (Crewe, Cheshire, UK). This company is the owner or license-holder of the intellectual property constituting the raw birth and death records, as detailed here.

Under section 29A of the UK Copyright, Designs and Patents Act 1988, a copyright exception permits copies to be made of lawfully accessible material in order to conduct text and data mining for non-commercial research.

Consistent with this, the processed datasets presented here are the result of text-mining and neither reproduce any given birth or death record in their entirety, nor make it possible for them to be reconstructed.

The processed data in this repo are made available for non-commercial research purposes only.

Acknowledgements

I thank the volunteers and contributors to each of the UK local BMD projects for making their data publicly available.

About

Supplementary material for "The UK Local BMD: A Full Name Onomastic Resource".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages