View a summary of the harvested data in the Trove Newspapers Data Dashboard.
This repository contains an automated git scraper that uses the Trove API to save information about the number of digitised newspaper articles currently available through Trove. It runs every week and updates two datasets:
- total_articles_by_year.csv
- total_articles_by_year_and_state.csv
- total_articles_by_newspaper.csv
- total_articles_by_category.csv
By retrieving all versions of these files from the commit history, you can analyse changes in Trove over time.
The dataset is saved as a CSV file containing the following columns:
year: year of original publication of newspaper articletotal: total number of articles from that year available in Trove
The dataset is saved as a CSV file containing the following columns:
state: state in which newspaper article was originally publishedyear: year of original publication of newspaper articletotal: total number of articles from that year and state available in Trove
Trove uses the following values for state:
- ACT
- International
- National
- New South Wales
- Northern Territory
- Queensland
- South Australia
- Tasmania
- Victoria
- Western Australia
This harvest uses the title facet to get the number of articles for each title. This information is supplemented with title data from the newspapers API endpoint. At the moment (20 April 2022), eight titles are missing from the newspapers endpoint, so the title, state, issn, start_date, and end_date fields will be blank for these records. Once this bug is fixed, the harvest should pick up the missing title data automatically.
The dataset is saved as a CSV file containing the following columns:
title_id: the Trove newspaper title identifiertotal: total number of articles from this newspapertitle: title of the newspaperstate: state in which the newspaper was publishedissn: title ISSN (where there is no ISSN an internal identifier has been assigned))start_date: date of earliest issue in Troveend_date: date of latest issue in Trove
The dataset is saved as a CSV file containing the following columns:
category: article category (eg: 'Advertising', 'Article')total: total number of articles assigned to this category
The harvesting code is available in get_article_totals.py. It uses facets to fetch the number of articles available from Trove. For more information and examples see Visualise the total number of newspaper articles in Trove by year and state in the GLAM Workbench.
I've also created a repository of historical harvests, created between 2011 and 2022.
Created by Tim Sherratt, April 2022. If you think this is useful, you can become a GitHub sponsor.