# Accessing data from periodicals

In [4]:
# Let's import the libraries we need.
from myst_nb import glue
load_dotenv()

True

In [2]:
# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

## What is a periodical?

Periodicals are publications that are issued at regular intervals. They can include newspapers, magazines, and academic journals. Newspapers are managed and delivered separately in Trove, so this section focuses on all the other types of digitised periodicals. 

Sometimes it's not clear whether a publication is a periodical or not. What about annual reports produced by government departments? Or almanacs that are updated each year? 

- records describing periodicals are mostly in the **Books & Libraries** category
- records describing articles extracted from periodicals are in the **Magazines & Newsletters** category
- records describing selected periodicals and articles are in the **Research & Reports**

## Digitised and born digital

Note about NED publications

## What periodicals have been digitised?

Search for ["nla.obj" NOT series:"Parliamentary paper (Australia. Parliament)" NOT nuc:"ANL:NED" and `l-format=Periodical` and `l-availability=y`](https://trove.nla.gov.au/search/category/books?keyword=%22nla.obj%22%20NOT%20series%3A%22Parliamentary%20paper%20%28Australia.%20Parliament%29%22%20NOT%20nuc%3A%22ANL%3ANED%22&l-format=Periodical&l-availability=y) in different categories (mostly B&L) : 968 results

API provides another option

Limits/problems:

- includes parliamentary papers
- hundreds of duplicates
- titles that have had articles extracted from them
- some missing?
- no links to works (some ISSNs)

Differences between the two sources.

No work links from API but can get extra metadata from collection page -- include link to NLA catalogue and MARC data

- Get a list of titles
- Get a list of issues for a title
- Get extra info about issues (number of pages etc)
- Search for articles
- Get text from articles
- Get articles in an issue (from embedded metadata)
- Get text from issue
- Get text from title
- Get covers of title
- Get pages of issue

## Parliamentary papers

Digitised Parliamentary Papers are often treated as periodicals in Trove. For example, sections extracted from them appear as 'articles' in the **Magazines & Newsletters** category, and some turn up in searches for works with a `format` value of `Periodical`. You can understand why annual reports by government agencies, for example, might be considered as periodicals. However, the treatment of the Parliamentary Papers isn't consistent across Trove, and their numbers tend to overwhelm other periodicals. For this reason, I think it makes sense to consider Parliamentary Papers separately.

## Periodical titles

Finding out which periodicals in Trove have been digitised is not straightforward. There are two basic approaches:

- use the `magazine/titles` API endpoint to retrieve a list of digitised periodical titles
- search for `"nla.obj"` to find digitised items, setting the `format` facet to `Periodical` and the `availability` facet to `y`

These two approaches return similar, but not identical, sets of results. Both have problems and inconsistencies. The best method will probably depend on what you want to do with the data.

### Get titles using the `magazine/titles` API endpoint

If you send a request to the `magazine/titles` API endpoint you get back the first group of periodical titles. The maximum number of titles per request is 100, you can set this using the `limit` parameter. To harvest details of *all* titles, you need to work your way through the complete dataset by using the `offset` parameter to move each request forward to the next group of titles. So if `limit` is set to `100`, your first request would have an `offset` value of `0`, and your second request would have an `offset` value of `100`. You'd then keep incrementing the `offset` value by 100 until you reach the end of the dataset. Here's a full example.

In [30]:
import requests

# The maximum number of records at a time is 100
# To page through the complete list use the offset value
params = {
    "encoding": "json",
    "limit": 100,
    "offset": 0
}

headers = {"X-API-KEY": API_KEY}

# Create a list to store the harvested titles in
titles = []
# When we reach the end, this will be set to False
more = True

# Continue until there's no more records
while more:
    # Request the data
    response = requests.get("https://api.trove.nla.gov.au/v3/magazine/titles", params=params, headers=headers)
    data = response.json()
    
    # If there are results there'll be a `magazine` key.
    # We can check this to see if we're at the end of the data.
    if "magazine" in data:
        # Add the titles
        titles += data["magazine"]
        # Update the offset value
        params["offset"] += 100
    # If there's no records we must be at the end of the data
    else:
        more = False

# Display the first two titles
display(titles[0:2])

[{'id': 'nla.obj-2526944948',
  'title': '... Annual report of the Canned Fruits Control Board for the year ...',
  'publisher': 'Printed and published for the Government of the Commonwealth of Australia by L. F. Johnston, Government Printer',
  'place': ['Australia'],
  'troveUrl': 'https://nla.gov.au/nla.obj-2526944948',
  'startDate': '1927-01-01',
  'endDate': '1937-06-30'},
 {'id': 'nla.obj-244631375',
  'title': '... musical cabinet, no. 1-37 by W.H. Glen & Co. [1875-1903]',
  'publisher': 'W.H. Glen & Co.',
  'troveUrl': 'https://nla.gov.au/nla.obj-244631375'}]

By converting the list of titles into a dataframe, you can explore the contents.

How many titles are there?

In [34]:
df_titles = pd.DataFrame(titles)
df_titles.shape[0]

2504

The metadata available for each title varies. Every entry has an `id`, `title`, and `troveUrl`, and most have a `publisher`, `startDate` and `endDate`. Here's the percentage of missing values for each column.

In [52]:
(df_titles.isnull().sum() / df_titles.shape[0]).to_frame().style.format({0: "{:.2%}"}).hide(axis=1)

0,1
id,0.00%
title,0.00%
publisher,0.59%
place,20.66%
troveUrl,0.00%
startDate,0.50%
endDate,11.40%
issn,63.16%


#### Removing duplicates

Unfortunately the `magazine/titles` endpoint data includes a significant number of duplicate records. Here's some examples.

In [37]:
df_titles.loc[df_titles.duplicated(["id"], keep=False)].sort_values("id").head(6)

Unnamed: 0,id,title,publisher,place,troveUrl,startDate,endDate,issn
166,nla.obj-1006706212,Annual report,Australian Government Publishing Service,[Australia],https://nla.gov.au/nla.obj-1006706212,1986-01-01,1991-06-30,0818-6049
508,nla.obj-1006706212,Annual report,Australian Government Publishing Service,[Australia],https://nla.gov.au/nla.obj-1006706212,1986-01-01,1991-06-30,0818-6049
165,nla.obj-1006814935,Annual report,Australian Govt. Pub. Service,[Australia],https://nla.gov.au/nla.obj-1006814935,1986-01-01,1990-06-30,0818-4763
509,nla.obj-1006814935,Annual report,Australian Govt. Pub. Service,[Australia],https://nla.gov.au/nla.obj-1006814935,1986-01-01,1990-06-30,0818-4763
164,nla.obj-1006922376,Annual report,Australian Government Publishing Service,[Australia],https://nla.gov.au/nla.obj-1006922376,1986-01-01,1990-06-30,0819-5293
510,nla.obj-1006922376,Annual report,Australian Government Publishing Service,[Australia],https://nla.gov.au/nla.obj-1006922376,1986-01-01,1990-06-30,0819-5293


If you know that they're there, the duplicates are easy to remove.

In [41]:
df_titles.drop_duplicates(["id"], inplace=True)

How many titles are there now?

In [42]:
df_titles.shape[0]

2193

#### Removing parliamentary papers

To get a sense of the types of periodicals in the dataset you can look at the titles. You'll see that many of them are just called 'Annual report'.

In [63]:
display(df_titles["title"].value_counts()[:10].to_frame().style.hide(axis=1))

title,Unnamed: 1
Annual report,350
Report,55
Notes on the science of building.,26
The Newcastle and Maitland Catholic Sentinel : the official organ of the diocese of Maitland.,11
Report for the period ...,9
Annual report for year ended ...,5
Report for ...,5
Bulletin,4
Reports on the examination of annual reports,4
Civil works program,4


This reflects the fact that many of the periodicals in the `magazine/titles` endpoint data are actually Parliamentary Papers. This may not be what you expect or want. There's no metadata field in the API results that identifies Parliamentary Papers, so there's no easy way to filter them out. The only way to exclude them is to compare the list of periodical titles with a previously-harvested list of Parliamentary Papers. The code below extracts identifiers from a dataset of Parliamentary Papers, and uses them as a filter on the list of titles.

In [21]:
# Load the Parliamentary Pepers dataset
df_pp = pd.read_csv("https://media.githubusercontent.com/media/GLAM-Workbench/trove-parliamentary-papers-data/main/trove-parliamentary-papers.csv", keep_default_na=False)

# The PP dataset contains individual publications (issues), the parent of an issue should be the periodical title.
# Extract and dedupe the ids from the parent field.
pp_ids = list(df_pp.loc[df_pp["parent"] != ""]["parent"].str.split("|").explode().reset_index()["parent"].unique())

# Exclude titles that whose id is in the list of PP parent ids
df_notpp = df_titles.loc[~df_titles["id"].isin(pp_ids)]

964

How many titles are left?

In [55]:
df_notpp.shape[0]

964

#### Data problems

Once you start working with the data from the `magazine/titles` endpoint you might notice other problems, such as:

- some of the entries point to periodical issues, rather than titles (that's why there's multiple entries for *The Newcastle and Maitland Catholic Sentinel* in the list above)
- some of the titles point to

### Get details of an individual title

You can use the `magazine/title` endpoint to retrieve information about an individual periodical title. You can supply either an `nla.obj` identifier or a numeric work identifier. For example, to get details about the [*Journal of Soil Conservation*](https://trove.nla.gov.au/work/10411388) you can either use it's work identifier `10411388`:

`https://api.trove.nla.gov.au/v3/magazine/title/10411388?encoding=json`

[![Try it!](https://troveconsole.herokuapp.com/static/img/try-trove-api-console.svg)](https://troveconsole.herokuapp.com/v3/?url=https%3A%2F%2Fapi.trove.nla.gov.au%2Fv3%2Fmagazine%2Ftitle%2F10411388%3Fencoding%3Djson&comment=)

Or you can use it's digital object identifier `nla.obj-740911077`:

`https://api.trove.nla.gov.au/v3/magazine/title/nla.obj-740911077?encoding=json`

[![Try it!](https://troveconsole.herokuapp.com/static/img/try-trove-api-console.svg)](https://troveconsole.herokuapp.com/v3/?url=https%3A%2F%2Fapi.trove.nla.gov.au%2Fv3%2Fmagazine%2Ftitle%2Fnla.obj-740911077%3Fencoding%3Djson&comment=)

Both of these requests return exactly the same data:

```json
{
    "id": "nla.obj-740911077",
    "title": "Journal of the Soil Conservation Service of New South Wales.",
    "issn": "0028-6818",
    "publisher": "Govt. Printer",
    "place": [
        "New South Wales"
    ],
    "troveUrl": "https://nla.gov.au/nla.obj-740911077",
    "startDate": "1945-01-01",
    "endDate": "1982-01-01"
}
```

You might notice that while you can use a periodical's work id to retrieve its API record, the data doesn't actually include a link *back* to the work. This means that there's no simple way to look up additional metadata describing a periodical using this API endpoint. To find a corresponding work record, you have to search for the digital object identifier using the `/result` endpoint. This is not an exact search, and will match the identifier wherever it appears in a record. As a result, it's likely to return multiple results and require some manual checking. Setting the `l-format` parameter to `Periodical` and `l-availability` to `y` should help narrow things down.

[![Try it!](https://troveconsole.herokuapp.com/static/img/try-trove-api-console.svg)](https://troveconsole.herokuapp.com/v3/?url=https%3A%2F%2Fapi.trove.nla.gov.au%2Fv3%2Fresult%3Fq%3D%22nla.obj-740911077%22%26category%3Dbook%26l-format%3DPeriodical%26l-availability%3Dy%26encoding%3Djson&comment=)

The data returned about individual periodicals using the `magazine/title` endpoint is the same as the data you get back about multiple titles using `magazine/titles`. However, there are a couple of additional parameters you can add to get information about issues from that periodical. These are described below.

## Issues

You can use the `magazine/title` API endpoint to retrieve information about issues of a periodical that have been digitised and are available through Trove. As described above, you can get details about an individual periodical using either it's work identifier or `nla.obj` identifier.

### Find the number of issues per year

You can find the number of digitised issues per year by setting the `include` parameter to `years`. The data returned will include a list of years for which digitised issues are available, and the number of issues available each year. For example, to get issue counts from the *Australasian pocket almanack* (`nla.obj-2967431735`), you'd request:

`https://api.trove.nla.gov.au/v3/magazine/title/nla.obj-2967431735?encoding=json&include=years`

[![Try it!](https://troveconsole.herokuapp.com/static/img/try-trove-api-console.svg)](https://troveconsole.herokuapp.com/v3/?url=https%3A%2F%2Fapi.trove.nla.gov.au%2Fv3%2Fmagazine%2Ftitle%2Fnla.obj-2967431735%3Fencoding%3Djson%26include%3Dyears&comment=)

Here's the data returned:

```json
{
    "id": "nla.obj-2967431735",
    "title": "Australasian pocket almanack : for the year of Our Lord ...",
    "publisher": "Compiled and printed by Robert Howe",
    "place": [
        "Australasia"
    ],
    "troveUrl": "https://nla.gov.au/nla.obj-2967431735",
    "startDate": "1822-01-01",
    "endDate": "1826-01-01",
    "year": [
        {
            "date": "1822",
            "issuecount": 1
        },
        {
            "date": "1823",
            "issuecount": 1
        },
        {
            "date": "1824",
            "issuecount": 1
        },
        {
            "date": "1825",
            "issuecount": 1
        },
        {
            "date": "1826",
            "issuecount": 1
        }
    ]
}
```

It's important to note that the `startDate` and `endDate` values don't necessarily match the range of years returned. I'm assuming that `startDate` and `endDate` are based on the available bibliographic data, while the list of years is generated from the issues that have actually been digitised. If any issues are undated, the list of years will include a value with the `date` set to `unknown`.

Using this data you can calculate the range of available years and the total number of issues available.

In [73]:
import requests

obj_id = "nla.obj-740911077"

params = {
    "include": "years",
    "encoding": "json"
}

headers = {"X-API-KEY": API_KEY}

response = requests.get(f"https://api.trove.nla.gov.au/v3/magazine/title/{obj_id}", params=params, headers=headers)
data = response.json()
print(data["title"])

years = [y["date"] for y in data["year"]]
years = sorted(years)
print(f"First year: {years[0]}")
print(f"Last year: {years[-1]}")

issue_count = 0
for year in data["year"]:
    issue_count += year["issuecount"]
print(f"Total issues: {issue_count}")

Journal of the Soil Conservation Service of New South Wales.
First year: 1945
Last year: 1982
Total issues: 183


You can also visualise the distribution of issues over time.

In [79]:
import requests
import altair as alt
import pandas as pd

work_id = "11500235"

params = {
    "include": "years",
    "encoding": "json"
}

headers = {"X-API-KEY": API_KEY}

response = requests.get(f"https://api.trove.nla.gov.au/v3/magazine/title/{work_id}", params=params, headers=headers)
data = response.json()

df_counts = pd.DataFrame(data["year"])

alt.Chart(df_counts).mark_bar().encode(
    x="date:T",
    y="issuecount:Q"
).properties(width=600, title=data["title"])

### Get a list of individual issues

You can get details of individual issues from the `magazines/title` API endpoint by setting the `include` parameter to `years` and adding the `range` parameter to specify a date range. The `range` parameter needs to be in the format `YYYYMMDD-YYYYMMDD`. For example, to retrieve details of issues from 1920 to 1950, you'd use a `range` value of `19200101-19501231`.

If the number of issues is fairly small, you can just use dummy dates to set the range well beyond the limits of the periodical, something like `18000101-21001231` for example. That means you'll get everything at once. 

So to get details of all issues in the *Australasian pocket almanack* (`nla.obj-2967431735`) you might request:

`https://api.trove.nla.gov.au/v3/magazine/title/nla.obj-2967431735?encoding=json&include=years&range=18000101-21001231`

[![Try it!](https://troveconsole.herokuapp.com/static/img/try-trove-api-console.svg)](https://troveconsole.herokuapp.com/v3/?url=https%3A%2F%2Fapi.trove.nla.gov.au%2Fv3%2Fmagazine%2Ftitle%2Fnla.obj-2967431735%3Fencoding%3Djson%26include%3Dyears%26range%3D18000101-21001231&comment=)

The data will include a list of issues for each year:

```json
{
    "id": "nla.obj-2967431735",
    "title": "Australasian pocket almanack : for the year of Our Lord ...",
    "publisher": "Compiled and printed by Robert Howe",
    "place": [
        "Australasia"
    ],
    "troveUrl": "https://nla.gov.au/nla.obj-2967431735",
    "startDate": "1822-01-01",
    "endDate": "1826-01-01",
    "year": [
        {
            "date": "1822",
            "issuecount": 1,
            "issue": [
                {
                    "id": "nla.obj-2967947168",
                    "date": "1822-01-01",
                    "url": "https://nla.gov.au/nla.obj-2967947168"
                }
            ]
        },
        {
            "date": "1823",
            "issuecount": 1,
            "issue": [
                {
                    "id": "nla.obj-2977982467",
                    "date": "1823-01-01",
                    "url": "https://nla.gov.au/nla.obj-2977982467"
                }
            ]
        },
        {
            "date": "1824",
            "issuecount": 1,
            "issue": [
                {
                    "id": "nla.obj-2969887983",
                    "date": "1824-01-01",
                    "url": "https://nla.gov.au/nla.obj-2969887983"
                }
            ]
        },
        {
            "date": "1825",
            "issuecount": 1,
            "issue": [
                {
                    "id": "nla.obj-2967965114",
                    "date": "1825-01-01",
                    "url": "https://nla.gov.au/nla.obj-2967965114"
                }
            ]
        },
        {
            "date": "1826",
            "issuecount": 1,
            "issue": [
                {
                    "id": "nla.obj-2967986225",
                    "date": "1826-01-01",
                    "url": "https://nla.gov.au/nla.obj-2967986225"
                }
            ]
        }
    ]
}
```

You can see the data for each issue is pretty minimal, really just an identifier/url and a date. 

Unknown are missing

However, some periodicals have thousands of issues and requesting them all in one hit might cause problems

Problems with API:
- unknown dates
- some issues missing
- some issues are actually sub-collections

### Finding missing issues

## Articles


```{admonition} Some articles are grouped as works!
:class: warning

Search results for periodical articles in the **Magazines & Newsletters** category are returned as work-level records. Usually the work records will only contain a single 'version' – the article. However, advertisements are treated differently. You will sometimes come across work records, [like this](https://trove.nla.gov.au/work/232859878), that munge together *all* the advertisements in a specific issue as 'versions'. While this might make your search results more manageable, it will have an impact on the discoverability and analysis of periodical content.
```

### Metadata

`/search` in `magazine` category and `/work` endpoints 


`bibliographicCitation` in article records has structured publication metadata

Advertisements on multiple pages in an issue grouped as a single work record for discovery: https://trove.nla.gov.au/work/232859472?keyword=fullTextInd%3Ay

Can access as separate versions via the API: https://troveconsole.herokuapp.com/v3/?url=https%3A%2F%2Fapi.trove.nla.gov.au%2Fv3%2Fwork%2F232859472%3Fencoding%3Djson%26include%3Dall&comment=

Use bibliographicCitation metadata

### Text

Via API

### Images and PDFs

Page images

## Issues

### Metadata

API provides `/magazine/titles` and `/magazine/title/[ID]` endpoints 

Been prodding the new `/magazine/title` endpoint that was added to #Trove API v3. It provides details on periodical titles and issues (other than newspapers). So it's very useful, but also very not...

Of the 2,504 titles, 1,538 point to sets of parliamentary papers. I suppose annual reports count as periodicals, but it would be good to be able to separate them out. In any case I've already got a full harvest of PPs.

Of the 966 left, 114 have no issues. That seems to be either because they're actually issues rather than titles, or they're just brokened. 

Another 124 titles have incomplete lists of issues, either because some of the issues have no date, or they're just brokened.

So as with just about everything involving Trove data, I'll have to develop a series of workarounds to deal with the problems and inconsistences. This is my life now. #TroveDataGuide #GLAM #digitalHumanities

#### Format `periodical` and "nla.obj"

There are 2,500 titles in the title endpoint, but only about 1,000 when you search for `"nla.obj"` & `l-format=periodical`. Is there any way to reconcile? Is it because of PP?

Check by getting ids from title endpoint, then extracting embedded metadata? Will that help?

Get lists of nla.obj ids from both methods and compare -- see what the difference is.

#### `/magazine/titles`

- paginated using `limit` and `offset`

Example record:

``` json
{
    "id": "nla.obj-8423556",
    "title": "\"Coo-ee!\" : the journal of the Bishops Knoll Hospital, Bristol.",
    "publisher": "Partridge & Love Ltd.",
    "troveUrl": "https://nla.gov.au/nla.obj-8423556",
    "startDate": "1916-01-01",
    "endDate": "1917-10-20"
}
```

#### `/magazine/title/[ID]`

- [ID] can either be a nla.obj id or a numeric work id (however the work ids aren't in the returned records)
- Get a list of issues by using `include=years` and `range=YYYYMMDD-YYYYMMDD`
- issues returned grouped by year

Example issue:

``` json
{
    "id": "nla.obj-8447243",
    "date": "1916-11-10",
    "url": "https://nla.gov.au/nla.obj-8447243"
},

Issue id

```{code-cell} ipython3

```

## Periodical titles