# National Archives of Australia Digitisation Dashboard

The National Archives of Australia's online database, RecordSearch, includes a list of recently digitised files, but this list only includes files digitised in the last month, so it's not possible to examine long-term changes. Since March 2021, I've captured weekly harvests of this list and saved them to [a GitHub repository](https://github.com/wragge/naa-recently-digitised). This dashboard aggregates the weekly harvests to display digitisation progress in the last week, the current year, and since harvests began in 2021.

A dataset containing annual compilations of the weekly harvests is [available from Zenodo](https://doi.org/10.5281/zenodo.14744049).

<p class="alert alert-danger"><a href="https://updates.timsherratt.org/2025/05/19/no-more-harvesting-data-from.html">Changes to RecordSearch in May 2025</a> have blocked the weekly harvests and made it impossible to update this dashboard. The last data harvest was on 11 May 2025.</p>

<style>.jp-RenderedHTMLCommon table { table-layout: auto;}</style>

In [35]:
import datetime
from urllib.error import HTTPError
import re
from IPython.display import Markdown, HTML
from pathlib import Path

import arrow
import requests
import pandas as pd
from IPython.display import display
from recordsearch_data_scraper.scrapers import RSSeries
import altair as alt
import ipywidgets as widgets

def add_series_titles(df):
    series_list = list(df["series"].unique())

    cited_series = []
    for series in series_list:
        data = RSSeries(
            series, include_number_digitised=False, include_access_status=False
        ).data
        cited_series.append({"series": series, "series_title": data.get("title", "")})

    df_titles = pd.merge(df, pd.DataFrame(cited_series), how="left", on="series")

    return df_titles

def top_series(df):

    with pd.option_context("display.max_colwidth", None):
        df = (
            df.value_counts(["series", "series_title"]).to_frame().reset_index()
        )
        df.columns = ["series", "series_title", "total"]
        df["series"] = df["series"].apply(lambda x: f"<a href='http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number={x}'>{x}</a>")
        display(df[:25].style.set_properties(**{"text-align": "left"}).set_table_styles([{'selector': 'th', 'props': 'text-align: left;'}]).hide().format(thousands=","))

most_recent_harvest = sorted(Path("data").glob("*.csv"), reverse=True)[0]
total_harvests = len(list(Path("data").glob("*.csv")))
df_most_recent = pd.read_csv(most_recent_harvest, parse_dates=["date_digitised"], keep_default_na=False)
df_week = add_series_titles(df_most_recent)

dfs = [df_week]
for year in Path("years").glob("digitised_*.parquet"):
    df = pd.read_parquet(year)
    #df["date_digitised"] = pd.to_datetime(df["date_digitised"].astype(str), format='%Y-%d-%m').dt.date
    dfs.append(df)
df_all = pd.concat(dfs)
df_all.drop_duplicates(inplace=True)

df_year = df_all.copy().loc[df_all["date_digitised"] >= f"{arrow.now().year}-01-01"]
df_year.drop_duplicates(inplace=True)

start_date = arrow.get(df_all['date_digitised'].min()).format('d MMMM YYYY')
week_end = arrow.get(re.search(r"(\d{8})", most_recent_harvest.name).group(1), "YYYYMMDD")
week_start = week_end.shift(weeks=-1)

In [36]:
Markdown(f"There have been **{total_harvests} weekly harvests** since March 2021. The most recent harvest was on {week_end.format('D MMMM YYYY')}.")

There have been **211 weekly harvests** since March 2021. The most recent harvest was on 11 May 2025.

In [37]:
Markdown(f"## Files digitised last week ({week_start.format('D MMM YYYY')} – {week_end.format('D MMM YYYY')})")

## Files digitised last week (4 May 2025 – 11 May 2025)

In [38]:
Markdown(f"**{df_week.shape[0]:,} files** were digitised last week. Here are the number of files digitised per day.")

**5,012 files** were digitised last week. Here are the number of files digitised per day.

In [39]:
df_week

Unnamed: 0,title,item_id,series,control_symbol,date_range,date_digitised,series_title
0,Volume 4 - Number 8 Service Flying Training Sc...,30946910,A10605,863/3,1943 – 1943,2025-05-10,Personnel Occurrence Reports
1,Volume 3 - Number 8 Service Flying Training Sc...,30946909,A10605,863/2,1942 – 1943,2025-05-10,Personnel Occurrence Reports
2,Volume 2 - Number 8 Service Flying Training Sc...,30946908,A10605,863/1,1942 – 1942,2025-05-10,Personnel Occurrence Reports
3,Volume 12 - Number 1 Initial Flying Training S...,30946896,A10605,858/4,1953 – 1953,2025-05-10,Personnel Occurrence Reports
4,PINTI Stefano [Application for Landing Permit ...,7085371,PP96/1,W1955/3012,1954 – 1956,2025-05-10,"Correspondence [client] files, annual single n..."
...,...,...,...,...,...,...,...
5007,Ballarat Emergency Landing Ground [Landing gro...,203940986,B3712,DRAWER 29 FOLDER 13,1920 – 1970,2025-05-05,"Folders of construction drawings, numerical se..."
5008,Ballarat Emergency Landing Ground [Ballarat - ...,203940984,B3712,DRAWER 29 FOLDER 13,1920 – 1970,2025-05-05,"Folders of construction drawings, numerical se..."
5009,Ballarat Emergency Landing Ground [Landing gro...,203940985,B3712,DRAWER 29 FOLDER 13,1920 – 1970,2025-05-05,"Folders of construction drawings, numerical se..."
5010,Ballarat Emergency Landing Ground [Emergency l...,203940983,B3712,DRAWER 29 FOLDER 13,1920 – 1970,2025-05-05,"Folders of construction drawings, numerical se..."


In [40]:
df_days = df_week.groupby(["date_digitised"])["item_id"].count().to_frame().reset_index()

In [41]:
df_days = df_week.groupby(["date_digitised"])["item_id"].count().to_frame().reset_index()

alt.Chart(df_days).mark_bar().encode(
    x=alt.X("day(date_digitised):O", title="day of the week"),
    y=alt.Y("item_id:Q", title="number of files digitised"),
    tooltip=[alt.Tooltip("date_digitised", title="date"), alt.Tooltip("day(date_digitised)", title="day"), alt.Tooltip("item_id", title="number digitised", format=",")]
).properties(padding=20, width=300, height=200, title=f"Files digitised {week_start.format('D MMM YYYY')} – {week_end.format('D MMM YYYY')}" )

In [42]:
Markdown(f"The files came from **{df_week['series'].nunique():,} different series**. Here are the 25 series which contain the most files digitised last week.")

The files came from **76 different series**. Here are the 25 series which contain the most files digitised last week.

In [43]:
top_series(df_week)

series,series_title,total
A8746,"Photographic colour negatives, chronological series with 'KN' or 'RKN' prefix and a single number suffix",2211
A11016,"Construction of Snowy Mountains Hydro-Electric Scheme, black and white photographic negatives, single number series",1401
D1989,"Application forms, medical examination documents and related papers of British and Foreign Immigrants (including Ex Service) in receipt of free and assisted passages, chronological order of ship arrival.",359
B6288,,260
D3481,"Photographs (black and white, colour) of buildings, installations, sites, etc",166
PP246/4,"Personal Statement and Declaration forms, alphabetical order within nationality",147
B883,"Second Australian Imperial Force Personnel Dossiers, 1939-1947",98
A10605,Personnel Occurrence Reports,84
A2403,,70
A705,"Correspondence files, multiple number (Melbourne) series (Primary numbers 1-323)",50


In [44]:
HTML(f'<div class="alert alert-block alert-info"><a href="https://glam-workbench.net/datasette-lite/?csv=https%3A%2F%2Fgithub.com%2Fwragge%2Fnaa-recently-digitised%2Fblob%2Fmaster%2Fdata%2Fdigitised-week-ending-{week_end.format("YYYYMMDD")}.csv&install=datasette-homepage-table&fts=title">Explore files digitised last week using Datasette</a></div>')

In [45]:
Markdown(f"## Files digitised this year ({arrow.now().year})")

## Files digitised this year (2025)

In [46]:
Markdown(f"**{df_year.shape[0]:,} files** have been digitised this year. Here are the number of files digitised per month.")

**76,057 files** have been digitised this year. Here are the number of files digitised per month.

In [47]:
df_months = df_year['date_digitised'].groupby(df_year['date_digitised'].dt.to_period("M")).agg('count').to_frame()
df_months.index = df_months.index.strftime('%Y-%m')
df_months.columns = ["count"]
df_months.reset_index(inplace=True)

In [48]:
alt.Chart(df_months).mark_bar().encode(
    x=alt.X("month(date_digitised):O", title="month"),
    y=alt.Y("count:Q", title="number of files digitised"),
    tooltip=[alt.Tooltip("month(date_digitised)", title="month"), alt.Tooltip("count", title="number digitised", format=",")]
).properties(width=300, height=200, padding=20, title=f"Files digitised {arrow.now().year}")

In [49]:
Markdown(f"The files came from **{df_year['series'].nunique():,} different series**. Here are the 25 series which contain the most files digitised so far this year.")

The files came from **737 different series**. Here are the 25 series which contain the most files digitised so far this year.

In [50]:
top_series(df_year)

series,series_title,total
B883,"Second Australian Imperial Force Personnel Dossiers, 1939-1947",18050
A714,Books of duplicate certificates of naturalization A(1)[Individual person] series,10409
B6288,"Photographs and negatives of Commonwealth building sites and DHC/ACS departmental activities, single number series with 'V' (Victoria) prefix",6123
BP308/1,"Wacol Migrant Centre accommodation history cards, alphabetical series",5658
PP246/4,"Personal Statement and Declaration forms, alphabetical order within nationality",3968
A705,"Correspondence files, multiple number (Melbourne) series (Primary numbers 1-323)",3290
SP300/1,ABC Talk Scripts - General,2562
A471,"Courts-Martial files [including war crimes trials], single number series",2509
A8746,"Photographic colour negatives, chronological series with 'KN' or 'RKN' prefix and a single number suffix",2278
A8746,,2211


In [51]:
Markdown(f"## Files digitised since {start_date}")

## Files digitised since 4 February 2021

In [52]:
Markdown(f"**{df_all.shape[0]:,} files** have been digitised since {start_date}. Here are the number of files digitised per year.")

**1,478,687 files** have been digitised since 4 February 2021. Here are the number of files digitised per year.

In [53]:
df_years = df_all['date_digitised'].groupby(df_all['date_digitised'].dt.to_period("Y")).agg('count').to_frame()
df_years.index = df_years.index.strftime('%Y')
df_years.columns = ["count"]
df_years.reset_index(inplace=True)

In [54]:
alt.Chart(df_years).mark_bar().encode(
    x=alt.X("date_digitised:O", title="year"),
    y=alt.Y("count:Q", title="number of files digitised"),
    tooltip=[alt.Tooltip("date_digitised", title="year"), alt.Tooltip("count", title="number digitised", format=",")]
).properties(width=300, height=200, padding=20, title=f"Files digitised since {start_date}")

In [55]:
Markdown(f"The files came from **{df_all['series'].nunique():,} different series**. Here are the 25 series which contain the most files digitised since {start_date}.")

The files came from **2,949 different series**. Here are the 25 series which contain the most files digitised since 4 February 2021.

In [56]:
top_series(df_all)

series,series_title,total
B883,"Second Australian Imperial Force Personnel Dossiers, 1939-1947",331133
B884,"Citizen Military Forces Personnel Dossiers, 1939-1947",284168
A9301,"RAAF Personnel files of Non-Commissioned Officers (NCOs) and other ranks, 1921-1948",189849
A2571,"Name Index Cards, Migrants Registration [Bonegilla]",162204
A2572,"Name Index Cards, Migrants Registration [Bonegilla]",101983
A9300,"RAAF Officers Personnel files, 1921-1948",33909
A714,Books of duplicate certificates of naturalization A(1)[Individual person] series,24012
A6135,"Photographic colour transparencies positives, daily single number series with 'K' [Colour Transparencies] prefix",20158
A13150,"Specifications, examiners reports and correspondence relating to the Registration of Victorian Patents - Second system",17546
D3481,"Photographs (black and white, colour) of buildings, installations, sites, etc",13555


In [57]:
if week_start.year != week_end.year:
    df_last_year = df_all.copy().loc[(df_all["date_digitised"] >= f"{week_start.year}-01-01") & (df_all["date_digitised"] <= f"{week_start.year}-12-31")]
    df_last_year.drop_duplicates(inplace=True)
    df_last_year.to_csv(f"years/digitised_{week_start.year}.csv", index=False)

df_year.to_parquet(f"years/digitised_{week_end.year}.parquet", index=False)

----

Created by [Tim Sherratt](https://timsherratt.au) for the [GLAM Workbench](https://glam-workbench.net).