# cli

> command-line interface for the harvester

In [None]:
#| default_exp cli

Before you do any harvesting you need to get yourself a [Trove API key](https://trove.nla.gov.au/about/create-something/using-api).

There are three basic commands:

* **start** -- start a new harvest
* **restart** -- restart a stalled harvest
* **report** -- view harvest details

### Start a harvest

To start a new harvest you can just do:

``` sh
troveharvester start "[Trove query]" [Trove API key]
```

The Trove query can either be a url copied and pasted from a search in the [Trove web interface](http://trove.nla.gov.au/newspaper/), or a Trove API query url constructed using something like the [Trove API Console](https://troveconsole.herokuapp.com/). Enclose the url in double quotes.

Unless you specify otherwise, a `data` directory will be automatically created to hold all of your harvests. Each harvest will be saved into a directory named using the current datetime. Details of harvested articles are written to a CSV file named `results.csv`. The harvest configuration details are also saved to a `metadata.json` file.

The CLI automatically saves the harvested metadata in a CSV file and, by default, deletes the raw results in the `results.ndjson` file. You can change this behaviour with the `--keep_json` option. See [more information about the results generated](core.html#results) by the harvester.

#### Options:


`--data_dir`  

> directory in which your harvests will be stored (default is `data`)

`--harvest_dir`  

> directory in which this harvest will be stored within the output directory (default is current datetime)

`--text`  

> save the OCRd text of each article into a separate `.txt` file

`--pdf`

> save a copy of each each as a PDF (this makes the harvest a *lot* slower as you have to allow a couple of seconds for each PDF to generate)

`--image`

> save an image of each article into a separate `.jpg` file (if the article is split over more than one page there will be multiple images)

`--include_linebreaks`

> preserve linebreaks in saved text files
    
`--keep_json`

> saves harvested data in an `results.ndjson` file (one json object per line) as well as `results.csv`
    
`--max` [integer]  

> specify a maximum number of articles to harvest

#### More examples

Basic harvest with no options:

``` sh
troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy
```

Specify the data and harvest directories:

``` sh
troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy --data_dir my_harvests --harvest_dir wragge_search
```

Save the articles as individual text files:

``` sh
troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy --text
```

Save the articles as images and PDFs (this will be very slow):

``` sh
troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy --pdf --image
```

Keep the raw results in the `results.ndjson` file:

``` sh
troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy --keep_json
```

### Restart a harvest

Things go wrong and harvests get interrupted. If your harvest stops before it should, you can just do:

``` sh
troveharvester restart
```

By default the script will try to restart the most recent harvest. If you've used the `--data_dir` or `--harvest_dir` parameters, you'll have to supply these again to restart the harvest.

``` sh
troveharvester restart --data_dir my_harvests --harvest_dir my_latest_dataset
```

### Get a summary of a harvest

If you'd like to quickly check the status of a harvest, just try:

``` sh
troveharvester report
```

By default the script will report on the most recent harvest. If you've used the `--data_dir` or `--harvest_dir` parameters, you'll have to supply these again to generate a report.

``` sh
troveharvester report --data_dir my_harvests --harvest_dir my_latest_dataset
```

In [None]:
#| hide
import os
import shutil

from fastcore.test import test_stdout
from nbdev.showdoc import *

# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

In [None]:
#| export
import argparse
from pathlib import Path
from pprint import pprint

from requests.exceptions import HTTPError

from trove_newspaper_harvester.core import (
    Harvester,
    NoQueryError,
    get_harvest,
    get_metadata,
    prepare_query,
)

In [None]:
#| export


def start_harvest(
    query,
    key,
    data_dir="data",
    harvest_dir=None,
    text=False,
    pdf=False,
    image=False,
    include_linebreaks=False,
    max=None,
    keep_json=False,
):
    """
    Start a harvest.

    Parameters:

    * `query` [required, search url from Trove web interface or API, string]
    * `key` [required, Trove API key, string]
    * `data_dir` [optional, directory for harvests, string]
    * `harvest_dir` [optional, directory for this harvest, string]
    * `text` [optional, save articles as text files, True or False]
    * `pdf` [optional, save articles as PDFs, True or False]
    * `image` [optional, save articles as images, True or False]
    * `include_linebreaks` [optional, include linebreaks in text files, True or False]
    * `max` [optional, maximum number of results, integer]
    * `keep_json` [optional, keep the results.ndjson file, true or False]

    """
    # Turn the query url into a dictionary of parameters
    params = prepare_query(query, key, text=text)
    # Create the harvester
    try:
        harvester = Harvester(
            query_params=params,
            data_dir=data_dir,
            harvest_dir=harvest_dir,
            pdf=pdf,
            text=text,
            image=image,
            include_linebreaks=include_linebreaks,
            max=max,
        )
    except HTTPError as e:
        if e.response.status_code == 403:
            print("The request could not be authorised, check your API key.")
        else:
            raise
    except NoQueryError:
        print("No query parameters found, check your query url.")
    else:
        # Go!
        try:
            harvester.harvest()
        except AttributeError:
            pass
        else:
            if harvester.maximum > 0:
                harvester.save_csv()
                if not keep_json:
                    Path(harvester.harvest_dir, "results.ndjson").unlink()


def restart_harvest(data_dir="data", harvest_dir=None):
    """
    Restart a failed harvest.

    Parameters:

    * `data_dir` [optional, directory for harvests, string]
    * `harvest_dir` [optional, directory for this harvest, string]
    """
    if data_dir and harvest_dir:
        harvest = get_harvest(data_dir=data_dir, harvest_dir=harvest_dir)
    else:
        harvest = get_harvest()
    if Path(f"{'-'.join(harvest.parts)}.sqlite").exists():
        data_dir, harvest_dir = harvest.parts
        meta = get_metadata(harvest)
        if meta:
            harvester = Harvester(
                data_dir=data_dir,
                harvest_dir=harvest_dir,
                query_params=meta["query_parameters"],
                pdf=meta["pdf"],
                text=meta["text"],
                image=meta["image"],
                include_linebreaks=meta["include_linebreaks"],
                max=meta["max"],
            )
            harvester.harvest()


def report_harvest(data_dir="data", harvest_dir=None):
    """
    Provide some details of a harvest.
    If no harvest is specified, show the most recent.

    Parameters:

    * `data_dir` [optional, directory for harvests, string]
    * `harvest_dir` [optional, directory for this harvest, string]
    """
    harvest = get_harvest(data_dir=data_dir, harvest_dir=harvest_dir)
    meta = get_metadata(harvest)
    if meta:
        # results = get_results(data_dir)
        print("")
        print("HARVEST METADATA")
        print("================")
        print(f"Last harvest started: {meta['date_started']}")
        print(f"Harvest id: {meta['harvest_directory']}")
        print("Query parameters:")
        pprint(meta["query_parameters"], indent=2)
        print(f"Max results: {meta['max']}")
        print(f"Include PDFs: {meta['pdf']}")
        print(f"Include text: {meta['text']}")
        print(f"Include images: {meta['image']}")
        print(f"Include linebreaks: {meta['include_linebreaks']}")
        print(f"Harvested with: {meta['harvester']}")


# CLI


def main():
    """
    Sets up the command-line interface
    """
    parser = argparse.ArgumentParser(prog="troveharvester")
    subparsers = parser.add_subparsers(dest="action")
    parser_start = subparsers.add_parser("start", help="start a new harvest")
    parser_start.add_argument("query", help="url of the search you want to harvest")
    parser_start.add_argument("key", help="Your Trove API key")
    parser_start.add_argument(
        "--data_dir", default="data", help="directory for harvests"
    )
    parser_start.add_argument("--harvest_dir", help="directory for this harvest")
    parser_start.add_argument(
        "--max", type=int, default=0, help="maximum number of results to return"
    )
    parser_start.add_argument(
        "--pdf", action="store_true", help="save PDFs of articles"
    )
    parser_start.add_argument(
        "--text", action="store_true", help="save text contents of articles"
    )
    parser_start.add_argument(
        "--image", action="store_true", help="save images of articles"
    )
    parser_start.add_argument(
        "--include_linebreaks",
        action="store_true",
        help="preserve line breaks in text files",
    )
    parser_start.add_argument(
        "--keep_json", action="store_true", help="keep the raw ndjson results file"
    )
    parser_restart = subparsers.add_parser(
        "restart", help="restart an unfinished harvest"
    )
    parser_restart.add_argument("--data_dir", help="directory for harvests")
    parser_restart.add_argument("--harvest_dir", help="directory for this harvest")
    parser_report = subparsers.add_parser("report", help="report on a harvest")
    parser_report.add_argument("--data_dir", help="directory for harvests")
    parser_report.add_argument("--harvest_dir", help="directory for this harvest")
    args = parser.parse_args()
    if args.action == "report":
        report_harvest(
            data_dir=args.data_dir,
            harvest_dir=args.harvest_dir,
        )
    elif args.action == "restart":
        restart_harvest(
            data_dir=args.data_dir,
            harvest_dir=args.harvest_dir,
        )
    elif args.action == "start":
        start_harvest(
            query=args.query,
            key=args.key,
            data_dir=args.data_dir,
            harvest_dir=args.harvest_dir,
            text=args.text,
            pdf=args.pdf,
            image=args.image,
            include_linebreaks=args.include_linebreaks,
            keep_json=args.keep_json,
            max=args.max,
        )

## Functions

The functions below are all called by the command-line interface, so don't need to be accessed directly. See the core library for programmatic access to the `Harvester` class.

In [None]:
show_doc(start_harvest)

---

[source](https://github.com/wragge/trove-newspaper-harvester/blob/master/trove_newspaper_harvester/cli.py#L20){target="_blank" style="float:right; font-size:smaller"}

### start_harvest

>      start_harvest (query, key, data_dir='data', harvest_dir=None, text=False,
>                     pdf=False, image=False, include_linebreaks=False,
>                     max=None, keep_json=False)

Start a harvest.

Parameters:

* `query` [required, search url from Trove web interface or API, string]
* `key` [required, Trove API key, string]
* `data_dir` [optional, directory for harvests, string]
* `harvest_dir` [optional, directory for this harvest, string]
* `text` [optional, save articles as text files, True or False]
* `pdf` [optional, save articles as PDFs, True or False]
* `image` [optional, save articles as images, True or False]
* `include_linebreaks` [optional, include linebreaks in text files, True or False]
* `max` [optional, maximum number of results, integer]
* `keep_json` [optional, keep the results.ndjson file, true or False]

In [None]:
# Test for missing API key
def test_no_key():
    start_harvest(
        "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge", None
    )


test_stdout(test_no_key, "The request could not be authorised, check your API key.")

In [None]:
# Test for missing query
API_KEY = os.getenv("TROVE_API_KEY")


def test_no_query():
    start_harvest("", API_KEY)


test_stdout(test_no_query, "No query parameters found, check your query url.")

In [None]:
start_harvest(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo",
    API_KEY,
    text=True,
)

this_harvest = get_harvest()

assert Path(this_harvest, "results.csv").exists() is True
assert Path(this_harvest, "results.ndjson").exists() is False
assert Path(this_harvest, "text").exists() is True

shutil.rmtree(Path("data"))

  0%|          | 0/131 [00:00<?, ?article/s]

In [None]:
start_harvest(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo",
    API_KEY,
    text=True,
    keep_json=True,
)

this_harvest = get_harvest()

assert Path(this_harvest, "results.csv").exists() is True
assert Path(this_harvest, "results.ndjson").exists() is True
assert Path(this_harvest, "text").exists() is True

  0%|          | 0/131 [00:00<?, ?article/s]

In [None]:
show_doc(report_harvest)

---

[source](https://github.com/wragge/trove-newspaper-harvester/blob/master/trove_newspaper_harvester/cli.py#L112){target="_blank" style="float:right; font-size:smaller"}

### report_harvest

>      report_harvest (data_dir='data', harvest_dir=None)

Provide some details of a harvest.
If no harvest is specified, show the most recent.

Parameters:

* `data_dir` [optional, directory for harvests, string]
* `harvest_dir` [optional, directory for this harvest, string]

In [None]:
report_harvest()


HARVEST METADATA
Last harvest started: 2022-10-11T11:33:35.912988+00:00
Harvest id: data/20221011113335
Query parameters:
{ 'bulkHarvest': 'true',
  'encoding': 'json',
  'include': ['articleText'],
  'key': 'gq29l1g1h75pimh4',
  'l-illtype': ['Photo'],
  'l-illustrated': 'true',
  'l-state': ['Western Australia'],
  'q': 'wragge',
  'reclevel': 'full',
  'zone': 'newspaper'}
Max results: 131
Include PDFs: False
Include text: True
Include images: False
Include linebreaks: False
Harvested with: trove_newspaper_harvester v0.6.4


In [None]:
# TEST REPORT
test_stdout(report_harvest, "^\nHARVEST METADATA.*", regex=True)
test_stdout(
    report_harvest, ".*Harvested with: trove_newspaper_harvester v[0-9\.]+$", regex=True
)

shutil.rmtree(Path("data"))

In [None]:
show_doc(restart_harvest)

---

[source](https://github.com/wragge/trove-newspaper-harvester/blob/master/trove_newspaper_harvester/cli.py#L82){target="_blank" style="float:right; font-size:smaller"}

### restart_harvest

>      restart_harvest (data_dir='data', harvest_dir=None)

Restart a failed harvest.

Parameters:

* `data_dir` [optional, directory for harvests, string]
* `harvest_dir` [optional, directory for this harvest, string]

In [None]:
# TEST RESTART
# To test the restart function we'll create a new harvester but not start it
params = prepare_query(
    query="https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo",
    api_key=API_KEY,
    text=True,
)
harvester = Harvester(query_params=params, text=True)

# Should be no data yet
assert harvester.ndjson_file.exists() is False

# The cache should still exist
assert Path(f"{'-'.join(harvester.harvest_dir.parts)}.sqlite").exists()

# Now it should run with restart using the settings from above
restart_harvest()

# Should be data now
assert harvester.ndjson_file.exists() is True

# The cache should have been deleted
assert Path(f"{'-'.join(harvester.harvest_dir.parts)}.sqlite").exists() is False

# Clean up
shutil.rmtree(Path("data"))

  0%|          | 0/131 [00:00<?, ?article/s]

In [None]:
show_doc(main)

---

[source](https://github.com/wragge/trove-newspaper-harvester/blob/master/trove_newspaper_harvester/cli.py#L142){target="_blank" style="float:right; font-size:smaller"}

### main

>      main ()

Sets up the command-line interface

In [None]:
#| hide
import nbdev

nbdev.nbdev_export()

----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/). Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb).