# Harvest articles published on page one of newspapers

We're going to use the Trove Newspaper Harvester as a library, rather than a command line tool. This makes it easier to manage the results of very large harvests. Instead of converting everything into a CSV file, which is the Harvester's default behaviour, we'll just save the API results into an `ndjson` file (one JSON object per line). Then in another notebook we'll filter and convert the data into a more compact dataset.

In [1]:
import os

from trove_newspaper_harvester.core import Harvester, prepare_query

In [7]:
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

In [8]:
# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

In [9]:
# This is the query to search for articles on page one from the web interface
query = "https://trove.nla.gov.au/search/category/newspapers?keyword=firstpageseq%3A1"

# Convert the web interface query into a set of parameters for the API
query_params = prepare_query(query=query, api_key=API_KEY)

In [10]:
query_params

{'q': 'firstpageseq:1',
 'zone': 'newspaper',
 'key': 'gq29l1g1h75pimh4',
 'encoding': 'json',
 'reclevel': 'full',
 'bulkHarvest': 'true'}

In [11]:
# Initiate the harvester with the parameters
harvester = Harvester(query_params=query_params)

In [None]:
# Start the harvest
harvester.harvest()

The data will be saved into a `results.ndjson` file, in a sub-directory of `data` named according to the current date and time. If the harvest is interrupted, add a `harvest_dir` parameter to the `Harvester()` initialisation with the value set to the name of the current harvest directory, for example `20230722015049`. Then run `harvester.harvest()` again to pick up where things left off.