# Braided Channels

Author: [Yifan Luo](mailto:jeffluoyifan@gmail.com)

Last updated: 7 September 2024

## Announcement

This notebook provides a step-by-step guide on using the [LDaCA](https://www.ldaca.edu.au) API to download the [Braided Channels](https://data.ldaca.edu.au/collection?id=arcp%3A%2F%2Fname%2Chdl10.4225~01~4F8E1281B8E2A&_crateId=arcp%3A%2F%2Fname%2Chdl10.4225~01~4F8E1281B8E2A) dataset.

Please note that this notebook is adapted from the original [Australian-Text-Analytics-Platform/cooee](https://github.com/Australian-Text-Analytics-Platform/cooee/blob/main/cooee.ipynb) repository on GitHub.

For any LDaCA API issues, feel free to reach out to the [Australian Text Analytics Platform (ATAP)](https://www.atap.edu.au) team via their [GitHub page](https://github.com/Australian-Text-Analytics-Platform).

## 1. Install/Load Packages

To install required packages, please uncomment and run the code block below.

In [1]:
# !pip install git+https://github.com/Language-Research-Technology/ldaca-py.git
# !pip install rocrate
# !pip install python-dotenv

In [2]:
import json
import os

import requests
from dotenv import load_dotenv          # Loads environment variables
from ldaca.ldaca import LDaCA           # Loads the LDaCA ReST api wrapper
from rocrate_lang.utils import as_list  # A handy utility for converting to list

## 2. Set up Your API Key

Before using the [LDaCA](https://data.ldaca.edu.au/) API to download the dataset, please:

1. Register your account on the [LDaCA website](https://data.ldaca.edu.au/login).

2. After logging in, navigate to your [User Information](https://data.ldaca.edu.au/user).

3. Click the Generate button to register a new API key.

4. Under the same directory, create a new file named `vars.env`.
   
5. Copy and save your API key into the file `vars.env` following the format:
   
   ```
   API_KEY=<Your_API_Key>
   ```

After finishing the above steps, your `vars.env` file should look like:

```
API_KEY=0463c21c-****-****-b750-************
```

In [3]:
# Load the environment variables located in the "vars.env" file
load_dotenv('vars.env')

# Store your environment variable in this notebook
API_TOKEN = os.getenv('API_KEY')

if not API_TOKEN:
    print("No API key found. Please follow the above steps to set up your API key.")


## 3. Find All Available Resource Types & Metadata

During the process of retrieving the collection, you may encounter authorisation errors.

Please try:
1. Generating a new API key. 
2. Restarting your notebook kernel.
3. Replacing the old API key 
4. Rerunning the notebook.

The following code block will download and store the metadata `ro-crate-metadata.json` under the directory `METADATA_DIR`.

In [None]:
# LDaCA API URL (DO NOT CHANGE)
LDACA_API = 'https://data.ldaca.edu.au/api'

# The ID of Braided-Channels collection (Replace it with the "@id" of your collection)
COLLECTION_ID = 'arcp://name,hdl10.4225~01~4F8E1281B8E2A' 

# The directory for saving the retrieved metadata
METADATA_DIR = 'metadata'

ldaca = LDaCA(url=LDACA_API, token=API_TOKEN, data_dir=METADATA_DIR)

# Retrieve and store the metadata in a JSON file under the directory METADATA_DIR
ldaca.retrieve_collection(collection=COLLECTION_ID, collection_type='Collection', data_dir='metadata')

# Inspect the metadata
with open(os.path.join(METADATA_DIR, 'ro-crate-metadata.json'), 'r') as f:
    metadata = json.load(f)
print(json.dumps(metadata, indent=4))

## 3. Retrieve URLs for Main Resources

Retrieve the URLs for all main resources. 

The variable `resources` is a `dict` with the key as the name of the resource and the value as the URL.

In [None]:
# Specify where the main resources
PRIMARY_OBJECT = 'RepositoryObject'

# Retrieve the URLs for all main resources
resources = {}  # key: name, value: url
for entity in ldaca.crate.contextual_entities + ldaca.crate.data_entities:
    if PRIMARY_OBJECT in as_list(entity.type):
        items = ldaca.crate.dereference(entity.id).as_jsonld()['hasPart']
        if type(items) != list:
            items = [items]
        for item in items:
            url = item['@id']
            name = url.split('/')[-1]
            # Replace "%20" with "_" and "%26amp%3B" with "&" in file names
            name = name.replace('%20', '_').replace('%26amp%3B', '&')
            resources[name] = url
            print(name, url)

# Print the total number of resources
print(f"Retrieved {len(resources)} resources in total")

## 4. Download Resources from Retrieved URLs

Specify the path `SAVE_DIR` to save your downloaded resources. 

The default directory is `Braided-Channels` under the same directory. 

In [None]:
# Specify a path to save the resources
SAVE_DIR = 'Braided-Channels'

count = 0
# Download resources by sending requests
for name, url in resources.items():
    headers = {"Authorization": "Bearer %s" % API_TOKEN}
    # Send a GET request to the URL
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
    except requests.exceptions.HTTPError as http_err:
        print(f"HTTP error ({http_err}) occurred during requesting {name} from {url}")
    except requests.exceptions.ConnectionError as conn_err:
        print(f"Connection error ({conn_err}) occurred during requesting {name} from {url}")
    except requests.exceptions.Timeout as time_err:
        print(f"Timeout error ({time_err}) occurred during requesting {name} from {url}")
    except requests.exceptions.RequestException as err:
        print(f"Request error ({err}) occurred during requesting {name} from {url}")
    # Save the retrieved resource
    if SAVE_DIR is None:
        full_path = name
    else:
        # Check whether the save_path is existed
        if not os.path.exists(SAVE_DIR):
            os.makedirs(SAVE_DIR)
        full_path = os.path.join(SAVE_DIR, name)
    # Write the content of the response to a file
    with open(full_path, 'wb') as f:
        f.write(response.content)
    print(f"Downloaded and saved into {full_path}")
    count += 1

print(f"Successfully downloaded {count} out of {len(resources)} resources")