# Accessing data from digitised resources

````{card}
On this page

```{contents}
:local:
:backlinks: None
```
````

Trove's digitised resources are delivered in a number of different ways depending on their format and arrangement. See [](whatis:interfaces:digitised) and [](/accessing-data/using-web-interface) for hints on using digitised resources through Trove's web interface. 

Access to machine-readable data is even more complicated. The Trove API provides limited information about digitised resources, necessitating a variety of hacks and workarounds. Nonetheless, there are some methods of accessing data from digitised resources that work reliably across multiple formats. These are described below.

For more specific information relating to particular formats see:

- [Accessing data from oral histories](oral-histories/accessing-data.md)
- [Accessing data from periodicals](periodicals/accessing-data.md)

## Metadata

There are two main sources of machine-readable metadata that describe digitised resources:

- work/version records delivered by the Trove API
- JSON embedded in the digitised resource viewer

### Work/version records delivered by the Trove API



### JSON embedded in the digitised resource viewer

## Collections

````{margin}
```{seealso}
The GLAM Workbench notebook [Download a collection of digitised images](https://glam-workbench.net/trove-images/download-image-collection/) uses this method to find and download all the images in a collection or finding-aid.
```
````

The NLA’s digitised resources are often presented as 'collections'. A collection could be the volumes in a multi-volume work, the issues of a periodical, a map series, an album of photographs, or a manuscript collection. While you can use the `magazine/title` API endpoint to get a list of issues from a periodical, there’s no way to get the contents of other types of collections from the Trove API.

To get machine-readable information about the members of a digitised collection you need to extract information from the browse window of Trove's digitised collection viewer. This method is fully documented in [](how-to/get-collection-items)

(digitised:accessing-data:text)=
## Text

Digitised publications like books, pamphlets, and periodicals usually make their contents available as plain text, extracted from the digitised pages using Optical Character Recognition (OCR). There are two main ways of accessing OCRd text computationally:

- construct download links for a complete publication or range of pages
- download OCR data for a single page

(digitised:accessing-data:download-text-link)=
### Construct download links for a complete publication or range of pages

**This method is fully documented in [](how-to/get-downloads), but here's a quick summary.**

To download the complete OCRd text of a single publication you need to know the number of pages in the publication. This can be found by [extracting the metadata](/other-digitised-resources/how-to/extract-embedded-metadata) embedded in the digitised book and journal viewer and [getting the length of the `page` list](digitised:howto:embedded:pages).

You can then construct a url to download the OCRd text using the publications `nla.obj` identifier and the total number of pages:

`https://nla.gov.au/[NLA.OBJ ID]/download?downloadOption=ocr&firstPage=0&lastPage=[TOTAL PAGES - 1]`

Note that the `lastPage` parameter is set to the total number of pages, minus one. This is because page numbering starts at zero. For example, [this issue](https://nla.gov.au/nla.obj-326379450) of *Pacific Islands Monthly* contains 164 pages, so the url to download the complete OCRd text would be:

<a href="https://nla.gov.au/nla.obj-326379450/download?downloadOption=ocr&firstPage=0&lastPage=163">https://nla.gov.au/nla.obj-326379450/download?downloadOption=ocr&firstPage=0&lastPage=163</a>

You can use the same url pattern to download OCRd text from any range of pages. For example, to download text from the first five pages of a publication, you'd set `firstPage` to `0` and `lastPage` to `4`. To download text from page two, you'd set both `firstPage` and `lastPage` to `1`.

(digitised:accessing-data:ocr)=
### Download OCR data for a single page

**This method is fully documented in [](how-to/get-ocr-layout-data), but here's a quick summary.**

If you know the `nla.obj` identifier of a specific page in a digitised publication, you can access machine-readable information about the OCR process by simply adding `/ocr` to the identifier url. For example, this [page](http://nla.gov.au/nla.obj-326405522) in *Pacific Islands Monthly* has the identifier `nla.obj-326405522`. To retrieve the OCR data you just add `/ocr` to the identifier: 

<a href="http://nla.gov.au/nla.obj-326405522/ocr">http://nla.gov.au/nla.obj-326405522/ocr</a>

To find the `nla.obj` identifiers for all the pages in a publication, you can [access the metadata](/other-digitised-resources/how-to/extract-embedded-metadata) embedded in the digitised book and journal viewer and then [extract the page identifiers from the `page` list](digitised:howto:embedded:pages).

The OCR data is quite complex. It contains information about the position of *every word* on the page. To extract just the text you have to find all the text blocks, then loop through each line and word, stitching them back together as a plain text document. If all you want is the text, the method described above is probably more efficient, but if you're interested in the layout as well as the content of a page, this methods opens up some new possibilities.

(digitised:accessing-data:images)=
## Images and PDFs

Most digitised resources include images you can download. Images can be digitised versions of visual material such as photographs, maps, or artworks, but they can also be scanned copies of pages in a publication or manuscript collection. There are two main methods for accessing digitised images computationally:

- Construct download links for a range of images
- Constructing image urls using `nla.obj` identifiers

In addition, it's possible to extract illustrations from pages of digitised books and periodicals by using data generated through the OCR process.

(digitised:accessing-data:download-images-link)=
### Construct download links for a range of images

**This method is fully documented in [](how-to/get-downloads), but here's a quick summary.**

This method is basically the same as [the method described above](digitised:accessing-data:download-text-link) to download OCRd text, you just need to set the `downloadOption` parameter in the url to either `zip` for images or `pdf` for a PDF. For example, the [E.J. Brady collection of photographs](https://nla.gov.au/nla.obj-141826952) (nla.obj-141826952) contains 14 images, so the url to download the complete collection in a single zip file would be: 

<a href="https://nla.gov.au/nla.obj-141826952/download?downloadOption=zip&firstPage=0&lastPage=13">https://nla.gov.au/nla.obj-141826952/download?downloadOption=zip&firstPage=0&lastPage=13</a>

Similarly, the [The gold finder of Australia : how he went, how he fared, how he made his fortune](https://nla.gov.au/nla.obj-248742150) is a pamphlet with 80 pages, so the url to download it as a PDF would be:

<a href="https://nla.gov.au/nla.obj-248742150/download?downloadOption=pdf&firstPage=0&lastPage=79">https://nla.gov.au/nla.obj-248742150/download?downloadOption=pdf&firstPage=0&lastPage=79</a>

You can also adjust the `firstPage` and `lastPage` to download selected images.

It's important to note that zip files containing multiple images can get very large. If you want to download all the images from publications or collections, you should probably use the method described below to download one image at a time.

(digitised:data:image-urls)=
### Constructing image urls using `nla.obj` identifiers

**This method is fully documented in [](/other-digitised-resources/how-to/download-images). but here's a quick summary.**

If you know the `nla.obj` identifier for a page or image, you can download it simply by adding an `/image` suffix to the identifier url. For example, this [photograph of a group of school children with gardening tools](https://nla.gov.au/nla.obj-141828112) has the identifier `nla.obj-141828112`. To create a direct link to the image, you just add `/image` to the identifier url:

<https://nla.gov.au/nla.obj-141828112/image>


### Extract illustrations from pages of digitised books and periodicals

**This method is fully documented in [](other-digitised:ocr-data:crop-images), but here's a quick summary.**

As described above, if you know the `nla.obj` identifier of a specific page in a digitised publication, you can [access machine-readable information](/other-digitised-resources/how-to/get-ocr-layout-data) about the OCR process by simply adding `/ocr` to the identifier url.

Within the OCR data there are `zs` blocks describing the position of each illustration. You can loop through each of these blocks and use the coordinates to crop the illustrations from the full page image. However, the coordinates in the OCR data are sometimes derived from higher resolution versions of the page images than you can download. To workaround this, you can you can [access the metadata](/other-digitised-resources/how-to/extract-embedded-metadata) embedded in the digitised book and journal viewer, [extract the dimensions](digitised:howto:embedded:pages) of the high-resolution version of the page, and then convert the coordinates to work with the downloadable version.

```{figure} /images/cat-collection.png
:name: cat-collection

Sample from a <a href="https://www.dropbox.com/scl/fo/60imdoyf4ss2b6vh01q1w/h?rlkey=zuwbjaqnmr7qvkuinovdu5ot0&dl=0">collection of cat photos</a> harvested from a search for articles with `cat` or `kitten` in their title [using the GLAM Workbench](https://glam-workbench.net/trove-journals/harvest-illustrations-from-periodicals/)
```
