# Parliamentary Papers

The NLA has digitised a large collection of reports and papers presented to the Commonwealth Parliament. 

In [1]:
import os
import time

import altair as alt
import pandas as pd
import requests
from dotenv import load_dotenv
from myst_nb import glue

load_dotenv()

True

In [2]:
YOUR_API_KEY = os.getenv("TROVE_API_KEY")

## Finding Parliamentary Papers

As documented in [](/understanding-search/finding-digitised-content), you can find NLA digitised resources by searching for `"nla.obj"` and selecting the 'Online' facet (if you're using the API set `l-availability` to `y`). To further limit the results to digitised Parliamentary Papers the best option seems to be adding `series:"Parliamentary paper (Australia. Parliament)` to your search query.

[![Try it!](https://troveconsole.herokuapp.com/static/img/try-trove-api-console.svg)](https://troveconsole.herokuapp.com/v3/?url=https%3A%2F%2Fapi.trove.nla.gov.au%2Fv3%2Fresult%3Fq%3D%22nla.obj%22+series%3A%22Parliamentary+paper+%28Australia.+Parliament%29%22%26category%3Dall%26l-availability%3Dy%26encoding%3Djson%26bulkHarvest%3Dtrue&comment=)

The `series` index is generated from the `isPartOf` field. This approach seems to return more Parliamentary Papers and much less noise than other options, such as setting `format` to `Government publication`.

Using this query, you can find the total number of work-level records describing digitised Parliamentary Papers in Trove.

In [27]:
params = {
    "q": '"nla.obj" series:"Parliamentary paper (Australia. Parliament)"',
    "category": "all", # Get results from all categories
    "l-availability": "y",
    "encoding": "json",
    "n": 0,
    "bulkHarvest": "true" # This will combine the results from all categories
}

headers = {"X-API-KEY": YOUR_API_KEY}

response = requests.get(
    "https://api.trove.nla.gov.au/v3/result", params=params, headers=headers
)
data = response.json()

print(f'There are {data["category"][0]["records"]["total"]:,} work records!')

There are 229,410 work records!


That's a lot of records! But before you take that number at face value, it's worth examining how those records are distributed across categories and formats.

Here's the number of records in each category. Remember that records can be duplicated across categories, so if you add up the category totals it'll be more than the total number calculated above.

In [23]:
params = {
    "q": '"nla.obj" series:"Parliamentary paper (Australia. Parliament)"',
    "category": "all",
    "l-availability": "y",
    "encoding": "json",
    "n": 0
}

headers = {"X-API-KEY": YOUR_API_KEY}

response = requests.get(
    "https://api.trove.nla.gov.au/v3/result", params=params, headers=headers
)
data = response.json()

totals = [{"category": c["code"], "total": c["records"]["total"]} for c in data["category"]]
pd.DataFrame(totals).style.format(thousands=",").hide()

category,total
book,13667
diary,15
image,12
list,0
magazine,206382
music,7
newspaper,0
people,0
research,189206


And here's the number of records by format. Remember that digitised resources can be [merged with other versions into works](/what-is-trove/works-and-versions), resulting in an odd mix of formats.

In [22]:
params = {
    "q": '"nla.obj" series:"Parliamentary paper (Australia. Parliament)"',
    "category": "all",
    "l-availability": "y",
    "encoding": "json",
    "n": 0,
    "facet": "format",
    "bulkHarvest": "true"
}

headers = {"X-API-KEY": YOUR_API_KEY}

response = requests.get(
    "https://api.trove.nla.gov.au/v3/result", params=params, headers=headers
)
data = response.json()

facets = [{"format": f["search"], "total": f["count"]} for f in data["category"][0]["facets"]["facet"][0]["term"]]

pd.DataFrame(facets).style.format(thousands=",").hide()

format,total
Article,217655
Book,13146
Government publication,11350
Periodical,606
Microform,102
Conference Proceedings,73
Archived website,60
Published,12
Map,11
Audio book,6


Looking at the tables above, you can see that most of the records relating to Parliamentary Papers have been assigned the `Article` format and can be found in the **Magazines & Newsletters** category. It also looks like many of these records are duplicated in **Research & Reports**.

You might be wondering why Parliamentary Papers would be described as 'articles'. If you look at [the results in the **Magazines & Newsletters** category](https://trove.nla.gov.au/search/category/magazines?keyword=%22nla.obj%22%20series%3A%22Parliamentary%20paper%20%28Australia.%20Parliament%29%22&l-availability=y), you'll see that the records describe *sections* of Parliamentary Papers, not complete publications. In other words, **the Parliamentary Papers are being treated like issues of a periodical** – the contents of each paper is being split up into sections (like articles in a journal), and a record is being created for each individual section.

This generates some odd 'articles', such as contents pages and appendices. When combined with the grouping of versions into works, it can also have some unfortunate consequences. For example, [here's a record](https://trove.nla.gov.au/work/237938382) where the 'Table of contents' sections of different Parliamentary Papers have been grouped as a single work!

The splitting of Parliamentary Papers into 'articles' also inflates the number of records. As a result, the total number of Parliamentary Papers will be considerably less than the total number of work-level records. 

How then can you limit the search to only show complete Parliamentary Papers and exclude the 'articles'? I don't think you can. If you add `NOT format:Article` to your search you'll exclude reports with the format `Article/Report`, and probably lose other publications that are grouped with `Article` records. You could just ignore the **Magazines & Newsletters** category, but many of the 'articles' are also in **Research & Reports** where they're mixed with other publication formats. There's no way to drop the 'articles' without losing other, more relevant, records.

To create a dataset that only contains details of complete Parliamentary Papers, you need to harvest metadata from the search above and then inspect the details of each record to exclude the ones you don't want. But this is further complicated by the different ways Parliamentary Papers are grouped and described in Trove.

## Grouping and describing Parliamentary Papers

The way digitised Parliamentary Papers are grouped and described in Trove is inconsistent and sometimes confusing.

As noted above, individual Parliamentary Papers are often treated as issues of a periodical. But not always. Sometimes they are described as standalone works. This [report by the Australian Science and Technology Council on 'Marine sciences and technologies in Australia'](https://trove.nla.gov.au/work/9710970?) is treated like a book, and is linked to a single digitised resource. It might have been grouped with a [follow-up report from the next year](https://trove.nla.gov.au/work/9988298), but it doesn't really matter as they're both easily discoverable.

Sometimes individual Parliamentary Papers are not described at all. While attempting to harvest a full list of Parliamentary Papers, I noticed that I couldn't find the parent publications of some 'articles'. These publications are digitised and accessible, but they don't turn up in Trove's search results. The only way to find them, in either the web interface or API, is to navigate upwards from an 'article'.

Trove represents collections of resources in [a number of different ways](/what-is-trove/collections). Where Parliamentary Papers are grouped together as 'issues' (for example, all the annual reports of an agency), they're generally created as collections within the digitised item viewer. For example, the work record for [Report of the Senate Select Committee on Superannuation](https://trove.nla.gov.au/work/22095680) links to a page with a **Browse this collection** button. Clicking on the button displays details of 28 different reports published between 1992 and 2001.

In this case, both the collection and the individual reports within it have their own separate work records. So the digitised version of the 1993 report on the *Super Complaints Tribunal* can be accessed directly from [this work record](https://trove.nla.gov.au/work/237349942), or by using the **Browse this collection** page.

However, there are other examples where there are only work records for the collection, not the individual reports. This means you can only find and access the reports from the collection page, or in disaggregated form as separate articles.

All of this means that search results in the Parliamentary Papers are a mix of different types of things – collections, publications, and articles – and it can be difficult to figure out what it is that you're actually searching.

The quality of the metadata also varies. The report on the *Super Complaints Tribunal*, for example, actually has the title 'PP no. 388 of 1993, Report no. 10', and so won't be returned by a title search for 'super complaints tribunal'. Added to that, there are a large number of duplicate records. 

## Research implications

Search - can't assume everything will be in results, drill down through collections

API 


## Overview of Parliamentary Papers

Using harvested dataset -- visualise by year etc.

## Problems


Reports grouped as collections

Duplicates

Records for articles not issues

Lots of articles from reports -- hard to exclude, 'Report' is a child of 'Article'

- most records 'Article'
- most articles in both research and magazine

Are the entries that point to sections only in magazine?

----

Individual publications -- eg: https://trove.nla.gov.au/work/9988298 (format Article, Government publication), similar report not grouped into collection https://trove.nla.gov.au/work/9710970; https://trove.nla.gov.au/work/10010110

Versions of works: https://api.trove.nla.gov.au/v3/work/11860613?encoding=json&include=workversions,links

Collections (ie of annual reports over time)

- some have work records for individual parts eg: collection - https://trove.nla.gov.au/work/22095680 and part - https://trove.nla.gov.au/work/237581960 and duplicate of part - https://trove.nla.gov.au/work/10005796
- some don't have individual records eg: https://trove.nla.gov.au/work/10055730

Sections of individual reports (or issues), treated like journal articles, can look like issues

Difficult to distinguish between these and find way to full digitised publication. GW approach -- testing for pages, testing for text.

Use harvested data set.

In [96]:
df = pd.read_csv("https://github.com/GLAM-Workbench/trove-parliamentary-papers-data/raw/main/trove-parliamentary-papers.csv")

In [97]:
subjects = df["subject"].str.split("|").explode().to_frame()
subjects["subject"] = subjects["subject"].str.strip(".")
subjects["subject"].value_counts().to_frame().reset_index()[:20].style.format(thousands=",").hide()

subject,count
Australia,6560
Australian,6558
"Finance, Public -- Australia -- Accounting -- Periodicals",1568
Tariff--Australia,1563
Administrative agencies -- Australia -- Auditing -- Periodicals,1165
"Finance, Public -- Australia -- Auditing",1139
"Finance, Public -- Auditing",1135
Executive departments -- Australia -- Auditing -- Periodicals,1135
Tariff Australia,1115
Federal issue,1112


In [98]:
def clean_contributor(value):
    if cleaned := re.search(r"(.*?) [0-9]+ [0-9a-z\-]+$", str(value)):
        return cleaned.group(1).strip(".")
    else:
        return str(value).strip(".")

import re
contributors = df["contributor"].str.split("|").explode().to_frame()
contributors["cleaned"] = contributors["contributor"]
contributors["cleaned"] = contributors["contributor"].apply(clean_contributor)
contributors.dropna()["cleaned"].value_counts().to_frame().reset_index()[:20].style.format(thousands=",").hide()

cleaned,count
Australia. Tariff Board,3799
Australia. Parliament,3275
Australian National Audit Office,3012
Australia. Parliament. Standing Committee on Public Works,2041
Australia. Industries Assistance Commission,1049
Australia. Parliament. Joint Committee of Public Accounts,820
Australia. Parliament. issuing body,787
Australia,417
Australia. Parliament. Senate. Committee of Privileges,388
Australia. Parliament. Joint Standing Committee on Treaties,341


In [99]:
import altair as alt

df["year"] = df["date"].str.extract(r"\b(\d{4})$")
years = df["year"].value_counts().to_frame().reset_index()

alt.Chart(years).mark_bar().encode(
    x="year:T",
    y="count:Q"
).properties(width="container")