# Domesday Data

This notebook contains the code required to retrieve the Domesday data directly from the Hull Univeristy website and then process it for use in the analysis.ipynb notebook. For full reproducibility, this notebook should be run before running analysis.ipynb, but it can be skipped since we includeed the processed data in the repository. Users should note that some of the code below requires a call to an external system terminal because the data storage format used by Hull (MS Access) is not well supported in modern Python libraries. You will, therefore, need to ensure you run this notebook on a system with appropriate support (eg., Mac or Linux), or extract and export the relevant data manually in Windows or using WSL2.

In [14]:
import pandas as pd
import os
import requests
# pip install osgb
import osgb
import geopandas as gpd
from pyproj import Transformer
from bs4 import BeautifulSoup  # For HTML parsing
from urllib.parse import urljoin, urlparse  # For URL handling


In [15]:
def download_filtered_apache_directory(base_url, local_folder, visited=None, depth=0):
    """
    Recursively download files from an Apache-style directory listing.

    Parameters:
    -----------
    base_url : str
        The base URL of the Apache directory listing.
    local_folder : str
        The local folder where files will be saved.
    visited : set
        Set of URLs already visited to prevent infinite loops.
    depth : int
        Current depth of recursion for pretty-printing (default is 0).

    Returns:
    --------
    None
    """
    if visited is None:
        visited = set()

    # Avoid revisiting the same URL
    if base_url in visited:
        return
    visited.add(base_url)

    # Create an ignore list from the base URL's path segments
    parsed_base = urlparse(base_url)
    ignore_list = parsed_base.path.strip('/').split('/')

    try:
        response = requests.get(base_url)
        if response.status_code != 200:
            print(f"{'  ' * depth}Failed to access {base_url} (Status code: {response.status_code})")
            return

        print(f"{'  ' * depth}Directory: {base_url}")
        soup = BeautifulSoup(response.text, 'html.parser')

        for link in soup.find_all('a'):
            href = link.get('href')

            # Skip links starting with non-alphanumeric characters
            if not href or not href[0].isalnum():
                continue

            # Skip links containing any segment from the base URL
            if any(segment in href for segment in ignore_list):
                continue

            # Construct the full URL
            full_url = urljoin(base_url, href)

            if href.endswith('/'):  # It's a folder
                # Recursively process the subfolder
                subfolder = os.path.join(local_folder, href.strip('/'))
                download_filtered_apache_directory(full_url, subfolder, visited, depth + 1)
            else:  # It's a file
                # Download the file
                local_file_path = os.path.join(local_folder, href)
                print(f"{'  ' * (depth + 1)}Downloading: {full_url} -> {local_file_path}")
                download_file(full_url, local_file_path)

    except Exception as e:
        print(f"{'  ' * depth}Error processing {base_url}: {e}")


def download_file(url, local_path):
    """
    Download a file from a URL and save it to a local path.

    Parameters:
    -----------
    url : str
        The URL of the file to download.
    local_path : str
        The local file path where the file will be saved.

    Returns:
    --------
    None
    """
    try:
        response = requests.get(url, stream=True)
        if response.status_code == 200:
            # Ensure the local directory exists
            os.makedirs(os.path.dirname(local_path), exist_ok=True)

            with open(local_path, 'wb') as file:
                for chunk in response.iter_content(chunk_size=1024):
                    file.write(chunk)
        else:
            print(f"Failed to download {url} (Status code: {response.status_code})")
    except Exception as e:
        print(f"Error downloading {url}: {e}")


# Example usage
base_url = "https://api.library.hull.ac.uk/hydra-contents/hull-domesdayDatabases/"
local_folder = "../Data/Doomsday"
download_filtered_apache_directory(base_url, local_folder)

Directory: https://api.library.hull.ac.uk/hydra-contents/hull-domesdayDatabases/
  Directory: https://api.library.hull.ac.uk/hydra-contents/hull-domesdayDatabases/hull-456/
    Downloading: https://api.library.hull.ac.uk/hydra-contents/hull-domesdayDatabases/hull-456/1-Dataset.mda -> ../Data/Doomsday/hull-456/1-Dataset.mda
    Downloading: https://api.library.hull.ac.uk/hydra-contents/hull-domesdayDatabases/hull-456/metadata.txt -> ../Data/Doomsday/hull-456/metadata.txt
    Downloading: https://api.library.hull.ac.uk/hydra-contents/hull-domesdayDatabases/hull-456/solr.xml -> ../Data/Doomsday/hull-456/solr.xml
  Directory: https://api.library.hull.ac.uk/hydra-contents/hull-domesdayDatabases/hull-457/
    Downloading: https://api.library.hull.ac.uk/hydra-contents/hull-domesdayDatabases/hull-457/1-Dataset.mda -> ../Data/Doomsday/hull-457/1-Dataset.mda
    Downloading: https://api.library.hull.ac.uk/hydra-contents/hull-domesdayDatabases/hull-457/metadata.txt -> ../Data/Doomsday/hull-457/me

## Export the Microsoft Access Data 
The Domesday records are stored in MS Access data files. The table for the Domesday Places with OSGB location codes and other variables we need is located in the following directory now (after downloading these data):

angkorclusters/Data/Doomsday/hull-462/1-Dataset.mda

We need to use commandline tools in a terminal to convert the table to a CSV for processing. So, we will have to use here the subprocess package and assume this notebook is being run on a linux OS. This step requires `mdbtools`, a command-line utility for reading Microsoft Access `.mdb` and `.mda` files. Install it with `sudo apt install mdbtools` (Linux) or `brew install mdbtools` (macOS). Windows users must use WSL or export the data manually.

In [16]:
import subprocess

# Define paths
input_file = "../Data/Doomsday/hull-462/1-Dataset.mda"
output_file = "../Data/Doomsday/domesday_places.csv"
table_name = "Places"

# Run the command
subprocess.run([
    "mdb-export",
    input_file,
    table_name
], stdout=open(output_file, "w"), check=True)

CompletedProcess(args=['mdb-export', '../Data/Doomsday/hull-462/1-Dataset.mda', 'Places'], returncode=0)

In [19]:
# Load the CSV into a pandas DataFrame
df = pd.read_csv(output_file)

# Preview the DataFrame
print(df.head())


   PlacesIdx County           Phillimore        Hundred        Vill Area  \
0          1    WOR                 15,8  `Doddingtree'    Abberley  NaN   
1          6    ESS  20,20. 24,51. 34,16     `Winstree'    Abberton  NaN   
2         11    WOR                 9,1a       Pershore    Abberton  NaN   
3         16    DOR                 13,1   `Uggescombe'  Abbotsbury  NaN   
4         21    DEV                  5,6         Merton   Abbotsham  NaN   

  XRefs  OSrefs OScodes  
0   NaN  SO7567     NaN  
1   NaN  TL9919     NaN  
2   NaN  SO9953     NaN  
3   NaN  SY5785     NaN  
4   NaN  SS4226     NaN  


In [20]:
def process_os_refs(df, grid_column):
    """
    Convert OS Grid References to coordinates (lat/lon and UTM).

    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame containing OS grid references.
    grid_column : str
        Name of the column with OS grid references.

    Returns:
    --------
    pandas.DataFrame:
        Original DataFrame with added lat, lon, and UTM columns.
    """
    # Initialize lists to store converted coordinates
    latitudes, longitudes, eastings, northings = [], [], [], []

    for grid_ref in df[grid_column]:
        try:
            # Convert OS grid reference to easting/northing
            easting, northing = osgb.gridder.parse_grid(grid_ref)
            lat, lon = osgb.convert.grid_to_ll(easting, northing)

            # Append to lists
            latitudes.append(lat)
            longitudes.append(lon)
            eastings.append(easting)
            northings.append(northing)

        except Exception as e:
            # Handle invalid grid references
            print(f"Error processing grid_ref {grid_ref}: {e}")
            latitudes.append(None)
            longitudes.append(None)
            eastings.append(None)
            northings.append(None)

    # Add new columns to DataFrame
    df['lat'] = latitudes
    df['lon'] = longitudes
    df['easting'] = eastings
    df['northing'] = northings

    return df

# Apply the function to the DataFrame
df = process_os_refs(df, grid_column="OSrefs")

Error processing grid_ref nan: I can't read a grid reference from this -> nan
Error processing grid_ref nan: I can't read a grid reference from this -> nan
Error processing grid_ref nan: I can't read a grid reference from this -> nan
Error processing grid_ref nan: I can't read a grid reference from this -> nan
Error processing grid_ref nan: I can't read a grid reference from this -> nan
Error processing grid_ref nan: I can't read a grid reference from this -> nan
Error processing grid_ref nan: I can't read a grid reference from this -> nan
Error processing grid_ref nan: I can't read a grid reference from this -> nan
Error processing grid_ref nan: I can't read a grid reference from this -> nan
Error processing grid_ref nan: I can't read a grid reference from this -> nan
Error processing grid_ref nan: I can't read a grid reference from this -> nan
Error processing grid_ref nan: I can't read a grid reference from this -> nan
Error processing grid_ref nan: I can't read a grid reference fro

In [21]:
df['OSrefs'].notna().sum(), df['OSrefs'].isna().sum()

(13458, 1309)

In [22]:
doomsday_places = df[df['OSrefs'].notna()]

In [23]:
# Convert to a GeoDataFrame
gdf = gpd.GeoDataFrame(
    doomsday_places,
    geometry=gpd.points_from_xy(doomsday_places['lon'], doomsday_places['lat']),
    crs="EPSG:4326"  # Set the coordinate reference system to WGS 84
)

# Write to a GeoPackage
output_path = "../Output/doomsday_gis.gpkg"
gdf.to_file(output_path, layer="doomsday_places", driver="GPKG")

print(f"GeoPackage written to {output_path}")

GeoPackage written to ../Output/doomsday_gis.gpkg


In [24]:
# Assume some start and end dates based on the historical information about the survey period
doomsday_places['start_date'] = 1066  # Default start date
doomsday_places['end_date'] = 1086  # Default end date
doomsday_places

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  doomsday_places['start_date'] = 1066  # Default start date
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  doomsday_places['end_date'] = 1086  # Default end date


Unnamed: 0,PlacesIdx,County,Phillimore,Hundred,Vill,Area,XRefs,OSrefs,OScodes,lat,lon,easting,northing,start_date,end_date
0,1,WOR,158,`Doddingtree',Abberley,,,SO7567,,52.300561,-2.368032,375000.0,267000.0,1066,1086
1,6,ESS,"20,20. 24,51. 34,16",`Winstree',Abberton,,,TL9919,,51.834157,0.886905,599000.0,219000.0,1066,1086
2,11,WOR,"9,1a",Pershore,Abberton,,,SO9953,,52.175269,-2.016034,399000.0,253000.0,1066,1086
3,16,DOR,131,`Uggescombe',Abbotsbury,,,SY5785,,50.663064,-2.609752,357000.0,85000.0,1066,1086
4,21,DEV,56,Merton,Abbotsham,,,SS4226,,51.011615,-4.253705,242000.0,126000.0,1066,1086
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14762,73861,STS,222,Offlow,Yoxall,,,SK1419,,52.768434,-1.793938,414000.0,319000.0,1066,1086
14763,73866,SUF,"7,18. 44,4",`Blything',Yoxford,,,TM3968,,52.258228,1.500634,639000.0,268000.0,1066,1086
14764,73871,CHS,"FT1,4",Ati's Cross,Ysceifiog,Ati's Cross,,SJ1571,,53.229229,-3.274776,315000.0,371000.0,1066,1086
14765,73876,DEV,63,North Tawton,Zeal Monachorum,,,SS7103,,50.812135,-3.832415,271000.0,103000.0,1066,1086


In [None]:
# Define the path for the output CSV file
output_path = "../Data/doomsday_places.csv"
# Write the DataFrame to CSV
doomsday_places.to_csv(output_path, index=False)