# Downloading Collection Data and RO-Crate Metadata File using the Oni API

The Language Data Commons of Australia (LDaCA) packages data collections in an [ro-crate](https://www.researchobject.org/ro-crate/). There is a metadata file called `ro-crate-metadata.json` that comes with every data collection and this is how we can obtain metadata on this collection of research objects.

This notebook allows you to specify a collection from the <a href='https://data.ldaca.edu.au/'>LDaCA Portal</a>, and the files and metadata associated with that collection will be downloaded and zipped.

Still to do:
- specify a subset of files to download

## Install the basic requirements needed

In [1]:
%%capture
import sys
!{sys.executable} -m pip install -r requirements.txt

## Import libraries

Python needs the libraries that will be used by the notebook to be specified before they are used. We do this with the reserved word `import`, as shown below.

In [2]:
from ldaca.ldaca import LDaCA     # Loads the LDaCA ReST API wrapper
from dotenv import load_dotenv    # Parses a .env file and then loads all the variables found as environment variables
import os                         # Loads operating system libraries
import shutil                     # Utility functions for copying and archiving files and directory trees

## Variables 

We need to specify the path to the data collection for download, as well as the LDaCA API and require the user's API token.

In the vars.env file, you will need to store your API_KEY and COLLECTION. These will be required for downloading any files.

To get an API key:
- Login to <a href='https://data.ldaca.edu.au/'>LDaCA Portal</a>
- Under your name, select <i>User Information</i>
- Under API Key, select <i>Generate</i>

To get a COLLECTION ID:
- Go to <a href='https://data.ldaca.edu.au/'>LDaCA Portal</a>
- Select the collection you want to download data from
- Copy the <i>@id</i> URL starting with <i>arcp://</i>

Example vars.env:

API_KEY=12345

COLLECTION='arcp://name,doi10.4225%2F35%2F555d661071c76'

In [3]:
load_dotenv('vars.env')                         # Loads the environment variables located in the vars.env file
API_TOKEN = os.getenv('API_KEY')                # Specifies the user's API token
URL = 'https://data.ldaca.edu.au/api'           # Specifies the location of the LDaCA API
COLLECTION = os.getenv('COLLECTION')            # Specifies the location of the collection
print(f"URL: {URL}, COLLECTION: {COLLECTION}")  # Prints the URL of the LDaCA API and specified collection for download

URL: https://data.ldaca.edu.au/api, COLLECTION: arcp://name,doi10.4225%2F35%2F555d661071c76


## Configure the LDaCA ReST API wrapper

We need to configure an instance of the class `LDaCA` by providing it with the URL, API token, data directory, collection and type of collection.

In [4]:
data_dir = 'oni_data'                                       # Creates the variable for the directory where data will be downloaded
ldaca = LDaCA(url=URL, token=API_TOKEN, data_dir=data_dir)  # Provides the LDaCA class with the API URL, API KEY and download directory
ldaca.set_collection(COLLECTION)                            # Specifies the collection for the LDaCA class
ldaca.set_collection_type('Collection')                     # Specifies the type of collection for the LDaCA class

## Retrieve metadata

The `retrieve_collection` method will fetch the metadata for the specified collection and download the related RO-Crate (`ro-crate-metadata.json`) to the selected directory (`oni_data`).

In [5]:
ldaca.retrieve_collection(
    collection=COLLECTION,            # The ID of the collection
    collection_type='Collection',     # Can be either Collection or Object
    data_dir=data_dir)                # Sets the data directory

## Download files

Now that we have the RO-Crate, we can download the files in the collection to our selected directory, filtering the objects by entity type.

Once the download is complete, a message will be printed confirming this.

In [6]:
all_files = ldaca.store_data(entity_type='RepositoryObject')
print(f"Files downloaded, zipping directory.")

TypeError: store_data() missing 1 required positional argument: 'extension'

## Zip the downloaded data

To create an archive file of the downloaded data, we use the `make_archive` function.

- `base_name` is the name of the archive file to create.
- `format` is the archive format, e.g. zip, tar, etc.
- `root_dir` is the directory that will be the root directory of the archive.

In [None]:
shutil.make_archive(base_name=data_dir, format='zip', root_dir=data_dir)