# Download ORACC JSON Files
This script downloads open data from the Open Richly Annotated Cuneiform Corpus ([ORACC](http://oracc.org)) in `json` format. The JSON files are made available in a ZIP file. For a description of the various JSON files included in the ZIP see the [open data](http://oracc.museum.upenn.edu/doc/opendata) page on [ORACC](http://oracc.org). 

In [27]:
import pandas as pd   
import requests
import io
import tqdm
import json
import os
import zipfile

# Create Download Directory
Create a directory called `jsonzip`. If the directory already exists, do nothing.

For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist).

In [2]:
import errno
import os
try:
    os.mkdir('jsonzip')
except OSError as exc:
    if exc.errno !=errno.EEXIST:
        raise
    pass

In [None]:
ROOT_PATH = os.getcwd()

PROJECTS_METADATA_PATH = os.path.join('projects_metadata')
CSV_PROJECTS_DF = os.path.join(PROJECTS_METADATA_PATH, 'projects.csv')
LIST_OF_PROJECTS = os.path.join(PROJECTS_METADATA_PATH, 'projects.txt')

ZIP_PATHS = os.path.join(os.getcwd(), 'jsonzip')
EXTRACT_PATH = os.path.join(os.getcwd(), 'projectsdata')

# Get up-to-date list of existing projects and subprojects

As listed in [The Oracc Project List](https://oracc.museum.upenn.edu/projectlist.html)

In [36]:
projects_url = 'https://oracc.museum.upenn.edu/projectlist.html'
response = requests.get(projects_url, verify=False)

response_data = response.text
print(response_data)



<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:xpd="http://oracc.org/ns/xpd/1.0">
  <head>
    <link rel="icon" href="/favicon.ico" type="image/x-icon"/>
    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"/>
    <link rel="stylesheet" type="text/css" href="/css/oracchome.css"/>
    <link rel="stylesheet" type="text/css" href="/css/oracc3home.css"/>
    <meta charset="utf-8"/>
    <title>Oracc Project List</title>
  </head>
  <body class="projlist">
    <div class="o3banner">
      <h1>
        <a href="/">The Oracc Project List</a>
      </h1>
    </div>
    <div class="projects">
      <div class="project-entry">
        <h2 class="proj-head">
          <a target="_blank" href="./adsd">ADsD: Astronomical Diaries Digital</a>
        </h2>
        <p class="proj-img">
          <a target="_blank" href="./adsd">
            <img class="project-float" width="88px" height="66px" src="/agg/adsd.png" alt=""/>
          </a>
        <

In [38]:
lines_in_html = response_data.split('\n')

projects_dict = {}
run_shortcuts = []

idx=0
for line in lines_in_html:
    if 'href="./' in line:
        line_parts = line.split('href="./')
        line_parts_2 = line_parts[1].split('">')
        project_shortcut = line_parts_2[0]
        if project_shortcut in run_shortcuts:
            continue
        else:
            line_parts_3 = line_parts_2[1].split('</a>')
            project_name = line_parts_3[0]
            project_shortcut = project_shortcut.replace('/', '-')
            projects_dict[idx] = {'name': project_name, 'shortcut': project_shortcut, 'project_json_link': f'https://oracc.museum.upenn.edu/json/{project_shortcut}.zip'}
            run_shortcuts.append(project_shortcut)
            idx += 1
        
print(projects_dict)
projects_df = pd.DataFrame.from_dict(projects_dict)
projects_df.to_csv()

{0: {'name': 'ADsD: Astronomical Diaries Digital', 'shortcut': 'adsd', 'project_json_link': 'https://oracc.museum.upenn.edu/json/adsd.zip'}, 1: {'name': 'adsd/adart1: adsd/Astronomical Diaries and Related Texts 1', 'shortcut': 'adsd-adart1', 'project_json_link': 'https://oracc.museum.upenn.edu/json/adsd-adart1.zip'}, 2: {'name': '', 'shortcut': 'adsd-adart1', 'project_json_link': 'https://oracc.museum.upenn.edu/json/adsd-adart1.zip'}, 3: {'name': 'adsd/adart2: adsd/Astronomical Diaries and Related Texts 2', 'shortcut': 'adsd-adart2', 'project_json_link': 'https://oracc.museum.upenn.edu/json/adsd-adart2.zip'}, 4: {'name': '', 'shortcut': 'adsd-adart2', 'project_json_link': 'https://oracc.museum.upenn.edu/json/adsd-adart2.zip'}, 5: {'name': 'adsd/adart3: adsd/Astronomical Diaries and Related Texts 3', 'shortcut': 'adsd-adart3', 'project_json_link': 'https://oracc.museum.upenn.edu/json/adsd-adart3.zip'}, 6: {'name': '', 'shortcut': 'adsd-adart3', 'project_json_link': 'https://oracc.museum

# Input List of Text IDs or a project abbreviation
Identify a list of text IDs (P, Q, and X numbers) in the directory `text_ids`. The IDs are six-digit P, Q, or X numbers preceded by a project abbreviation in the format 'PROJECT/P######' or 'PROJECT/SUBPROJECT/Q######'. For example:
* dcclt/P117395
* etcsri/Q001203
* rinap/rinap1/Q003421

The list should be created with a flat text editor such as Textedit or Emacs, and the filename should end in `.txt`. The list may also include start and stop lines (if you need only part of a text). The download module, however, only looks at the project names.

Alternatively, one may enter the name (abbreviation) of a project or sub-project in [ORACC](http://oracc.org) and pull all the lemmatized data from that project. Note that the script will not automatically pull data from subprojects, they have to be requested separately. Examples:
* saao/saa01
* aemw/amarna
* rimanum

In [22]:
filename = 'projects.txt'

# Parse the file with text IDs
The following code reads the file with text ID and pulls out the project names. The code removes accidental spaces at the beginning and the end of each line as well as blank lines. Each line in the file with text IDs is split at the first space - everything after the first space is ignored. The last 8 digits of the text ID are removed, to leave only the project name.

#FV note: I want only projects, not texts --> iggnoring the removal of last 8 characters.

In [29]:
if filename[-4:] == '.txt':
    textids = 'text_ids/' + filename
    with open(textids, 'r') as f:
        pqxnos = f.readlines()
    pqxnos = [x.strip() for x in pqxnos]  # strip spaces left and right
    pqxnos = [x for x in pqxnos if not x == ""] # remove empty lines
    pqxnos = [x.split()[0] for x in pqxnos] # strip everything after first space
    projects = list(set(pqxnos))
else:
    name = filename.strip().lower()
    projects = [name]

In [31]:
for project in projects:
    print(project)

ribo/babylon3


## Download the ZIP files
For each project from which files are to be processed download the entire project (all the json files) from `http://build-oracc.museum.upenn.edu/json/`. The file is called `PROJECT.zip` (for instance: `dcclt.zip`). For subprojects the file is called `PROJECT-SUBPROJECT.zip` (for instance `cams-gkab.zip`). 

For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `zip` file may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. The `iter_content()` function in the `requests` library takes care of that.

Although downloading the entire zip file is time consuming, it will make processing the individual files much more efficient and the code is less likely to break due to interruption in connectivity.


In [32]:
CHUNK = 16 * 1024
for project in tqdm.tqdm(projects):
    project = project.replace('/', '-')
    url = "http://build-oracc.museum.upenn.edu/json/" + project + ".zip"
    file = 'jsonzip/' + project + '.zip'
    r = requests.get(url, verify=False)
    if r.status_code == 200:
        print("Downloading " + url + " saving as " + file)
        with open(file, 'wb') as f:
            for c in r.iter_content(chunk_size=CHUNK):
                f.write(c)
    else:
        print(url + " does not exist.")

100%|██████████| 1/1 [00:01<00:00,  1.04s/it]

Downloading http://build-oracc.museum.upenn.edu/json/ribo-babylon3.zip saving as jsonzip/ribo-babylon3.zip





# Extracting ZIP files

TODO: add function description

In [33]:
def extract_and_delete_zip():
    zipped_projects = os.listdir(ZIP_PATHS)
    for z_file in zipped_projects:
        if z_file[-4:] == '.zip':
            with zipfile.ZipFile(os.path.join(ZIP_PATHS, z_file), 'r') as zip_ref:
                zip_ref.extractall(EXTRACT_PATH)
    
            os.remove(os.path.join(ZIP_PATHS, z_file))
    
            print(f"File {z_file} has been extracted to folder projectsdata and deleted.")

In [34]:
extract_and_delete_zip()

File ribo-babylon3.zip has been extracted to folder projectsdata and deleted.
