# A COrpus of Oz Early English (COOEE)

This notebook is used to download the dataset from [LDaCA](https://data.ldaca.edu.au/), and licensed under [MIT](https://opensource.org/license/mit).

For more information, please visit the original notebook `cooee.ipynb` from [GitHub - Australian-Text-Analytics-Platform/cooee](https://github.com/Australian-Text-Analytics-Platform/cooee/blob/main/cooee.ipynb).

## 1. Loading Packages

To install packages, please uncomment the following code.

In [1]:
# To install ldaca
# !pip install git+https://github.com/Language-Research-Technology/ldaca-py.git

# To install rocrate
# !pip install rocrate

# To install dotenv
# !pip install python-dotenv

In [2]:
import os

import requests
from ldaca.ldaca import LDaCA           # Loads the LDaCA ReST api wrapper
from rocrate_lang.utils import as_list  # A handy utility for converting to list
from dotenv import load_dotenv          # Loads environment variables

## 2. API Key Setting

Before using API from [LDaCA](https://data.ldaca.edu.au/) to download the dataset, please:

1. Register your account at [LDaCA](https://data.ldaca.edu.au).

2. Go to your **User information**, generate and copy your **API key**.

3. Copy your **API key** into the file `vars.env` under the same directory of this notebook.
   
   E.g., API_KEY=1a61****-****-****-****-********d1c5

In [3]:
# WARNING: DO NOT CHANGE
LDACA_API = 'https://data.ldaca.edu.au/api'
COLLECTION_ID = 'arcp://name,doi10.26180%2F23961609'

load_dotenv('vars.env')             # Load the environment variables located in the vars.env files
API_TOKEN = os.getenv('API_KEY')    # Store your environment variable in this notebook
if not API_TOKEN:
    print("Get a token from the portal, set a variable in the vars.env file named API_KEY, then restart the kernel.")


## 3. Fetching the Metadata

In [4]:
# Get the ro-crate metadata. This will create a JSON file under the directory 'metadata'
ldaca = LDaCA(url=LDACA_API, token=API_TOKEN, data_dir='metadata')
ldaca.retrieve_collection(collection=COLLECTION_ID, collection_type='Collection', data_dir='metadata')

# Inspect the metadata
metadata = ldaca.crate
metadata


<rocrate_lang.rocrate_plus.ROCratePlus at 0x10b9ccb30>

In [5]:
# TYPE values should be lists.
# We define a PRIMARY_OBJECT as a 'RepositoryObject' because that is where the main data is stored
PRIMARY_OBJECT = 'RepositoryObject'

# Find all types and find types that have linked objects
files = set()
types = list()
primary_object_types = list()

# Lets see what we can find in our metadata
for entity in ldaca.crate.contextual_entities + ldaca.crate.data_entities:
    entity_type = as_list(entity.type)  # We make sure that each type is a list
    for e_t in entity_type:
        types.append(e_t)


In [6]:
# Print the variables
# All the types, removing duplicates
list(dict.fromkeys(types))

['Person',
 'Book',
 'OrganizationReuseLicense',
 'RepositoryObject',
 'PropertyValue',
 'website',
 'DefinedTerm',
 'CreativeWork',
 'Language',
 'Geometry',
 'SoftwareSourceCode',
 'CreateAction',
 'File']

## Download the Dataset

In [7]:
# Types of PRIMARY_OBJECTs ie [PRIMARY_OBJECT, X]
for entity in ldaca.crate.contextual_entities + ldaca.crate.data_entities:
    if 'RepositoryObject' in as_list(entity.type):
        item = ldaca.crate.dereference(entity.id)
        primary_object_types.append(item.as_jsonld())

In [8]:
# Create a dictionary for storing names and URLs for each document
name_url = {}
for file in primary_object_types:
    url = file['hasPart'][0]['@id']
    name = url.split('/')[-1]
    name_url[name] = url
    
name_url

{'1-001-plain.txt': 'https://data.ldaca.edu.au/api/object/arcp%3A%2F%2Fname%2Cdoi10.26180%252F23961609/data/1-001-plain.txt',
 '1-002-plain.txt': 'https://data.ldaca.edu.au/api/object/arcp%3A%2F%2Fname%2Cdoi10.26180%252F23961609/data/1-002-plain.txt',
 '1-003-plain.txt': 'https://data.ldaca.edu.au/api/object/arcp%3A%2F%2Fname%2Cdoi10.26180%252F23961609/data/1-003-plain.txt',
 '1-004-plain.txt': 'https://data.ldaca.edu.au/api/object/arcp%3A%2F%2Fname%2Cdoi10.26180%252F23961609/data/1-004-plain.txt',
 '1-005-plain.txt': 'https://data.ldaca.edu.au/api/object/arcp%3A%2F%2Fname%2Cdoi10.26180%252F23961609/data/1-005-plain.txt',
 '1-006-plain.txt': 'https://data.ldaca.edu.au/api/object/arcp%3A%2F%2Fname%2Cdoi10.26180%252F23961609/data/1-006-plain.txt',
 '1-007-plain.txt': 'https://data.ldaca.edu.au/api/object/arcp%3A%2F%2Fname%2Cdoi10.26180%252F23961609/data/1-007-plain.txt',
 '1-008-plain.txt': 'https://data.ldaca.edu.au/api/object/arcp%3A%2F%2Fname%2Cdoi10.26180%252F23961609/data/1-008-plai

In [42]:
def download(save_path=None):
    """Download and unzip the dataset.

    Args:
        save_path (str, optional): The root path to save the file. If the save_path is None, 
            the file will be saved in the current directory. Defaults to None.
    """
    for name, url in name_url.items():
        headers = {"Authorization": "Bearer %s" % API_TOKEN}
        # Send a GET request to the URL
        response = requests.get(url, headers=headers)
        if save_path is None:
            full_path = name
        else:
            # Check whether the save_path is existed
            if not os.path.exists(save_path):
                os.makedirs(save_path)
            full_path = os.path.join(save_path, name)
        # Write the content of the response to a file
        with open(full_path, 'wb') as f:
            f.write(response.content)
        print(f"File downloaded and saved as {full_path}")

In [43]:
download('dataset/COOEE')

File downloaded and saved as dataset/COOEE/1-001-plain.txt
File downloaded and saved as dataset/COOEE/1-002-plain.txt
File downloaded and saved as dataset/COOEE/1-003-plain.txt
File downloaded and saved as dataset/COOEE/1-004-plain.txt
File downloaded and saved as dataset/COOEE/1-005-plain.txt
File downloaded and saved as dataset/COOEE/1-006-plain.txt
File downloaded and saved as dataset/COOEE/1-007-plain.txt
File downloaded and saved as dataset/COOEE/1-008-plain.txt
File downloaded and saved as dataset/COOEE/1-009-plain.txt
File downloaded and saved as dataset/COOEE/1-010-plain.txt
File downloaded and saved as dataset/COOEE/1-011-plain.txt
File downloaded and saved as dataset/COOEE/1-012-plain.txt
File downloaded and saved as dataset/COOEE/1-013-plain.txt
File downloaded and saved as dataset/COOEE/1-014-plain.txt
File downloaded and saved as dataset/COOEE/1-015-plain.txt
File downloaded and saved as dataset/COOEE/1-016-plain.txt
File downloaded and saved as dataset/COOEE/1-017-plain.t