# Analyze JSON from PATHs API

This notebook explores the JSON files downloaded from the [PATHs API](https://atlas.paths-erc.eu/api). 

For each collection (atlas, places, manuscripts, works, authors, titles, colophons, persons, collections), I entered general queries like this:

```bash
https://bdus.cloud/db/api/paths/?verb=search&shortsql=@manuscripts&full_records=0&format=json
https://bdus.cloud/db/api/paths/?verb=search&shortsql=@manuscripts&full_records=0&format=json&limit=100&offset=0
```

I downloaded the JSON files and saved them to my laptop. Here I need to explore those files to see what I can extract from them.

In [None]:
# Import the necessary libraries
import json
import pandas as pd
import os
import glob
from pprint import pprint

In [None]:
# Get an overview of the JSON files
json_directory = "../api-downloads"

# Load JSON files
json_files = glob.glob(os.path.join(json_directory, "*.json"))

# Loop over the files and print their contents
for file in json_files:
    with open(file, "r") as read_file:
        data = json.load(read_file)
        print(f"Data from {file}:")
        for item in data:
            print(item)

Data from ../api-downloads/manuscripts.json:
head
debug
records
Data from ../api-downloads/collections.json:
head
debug
records
Data from ../api-downloads/persons.json:
head
debug
records
Data from ../api-downloads/works.json:
head
debug
records
Data from ../api-downloads/titles.json:
head
debug
records
Data from ../api-downloads/colophons.json:
head
debug
records


In [38]:
# Make dataframes from each of the JSON files, with the file name as the name of the dataframe
dataframes = {}
for file in json_files:
    with open(file, "r") as read_file:
        data = json.load(read_file)
        df_name = os.path.splitext(os.path.basename(file))[0]
        dataframes[df_name] = pd.DataFrame(data['records'])
        print(f"DataFrame created for {df_name} with {len(dataframes[df_name])} records.")

DataFrame created for manuscripts with 30 records.
DataFrame created for collections with 30 records.
DataFrame created for persons with 30 records.
DataFrame created for works with 30 records.
DataFrame created for titles with 30 records.
DataFrame created for colophons with 30 records.


Something is wrong, since I'm only getting 30 records in each JSON file. The API must have a default limit. I'll see if I can automate this process and get the full sets.

The `places` collection has a different format. I'll save it as a CSV.

In [10]:
# Base API endpoint
base_url = "https://bdus.cloud/db/api/paths/"
resources = [
    "places"
]

# Pretend to be a browser
headers = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/114.0.0.0 Safari/537.36"
    )
}

for resource in resources:
    all_rows = []
    page = 1
    last_page_data = None

    while True:
        url = (
            f"{base_url}?verb=search"
            f"&shortsql=@{resource}"
            f"&full_records=1"
            f"&format=json"
            f"&page={page}"
        )
        response = requests.get(url, headers=headers)
        try:
            data = response.json()
        except Exception as e:
            print(f"Error decoding JSON on page {page}: {e}")
            break

        data = response.json()
        print(f"Page {page} response keys: {list(data.keys())}")
        print(f"Head info: {data.get('head', {})}")
        print(f"First 1 subject(s): {list(data.keys())[:1]}")
        
        rows = data
        all_rows.extend(rows)

        if not rows:
            break  # No more pages
        
        current_page_data = data

        if current_page_data == last_page_data:
            print("⚠️  Repeating page detected. Stopping to avoid infinite loop.")
            break

        last_page_data = current_page_data

        time.sleep(1)

        print(f"  Page {page}: {len(rows)} rows")
        page += 1

    # Save to JSON file
    with open(f"{resource}.json", "w", encoding="utf-8") as f:
        json.dump(all_rows, f, ensure_ascii=False, indent=2)

    print(f"Saved {len(all_rows)} records to {resource}.json\n")

Page 1 response keys: ['http://paths.uniroma1.it/data/places#agents/me', 'http://paths.uniroma1.it/atlas/places/1', '_:genid1', '_:genid2', '_:genid3', '_:genid4', '_:genid5', '_:genid6', '_:genid7', '_:genid8', '_:genid9', '_:genid10', '_:genid11', '_:genid12', '_:genid13', '_:genid14', '_:genid15', '_:genid16', '_:genid17', '_:genid18', '_:genid19', '_:genid20', '_:genid21', '_:genid22', '_:genid23', 'http://paths.uniroma1.it/atlas/places/2', '_:genid24', '_:genid25', '_:genid26', '_:genid27', '_:genid28', '_:genid29', '_:genid30', '_:genid31', '_:genid32', '_:genid33', '_:genid34', '_:genid35', '_:genid36', '_:genid37', '_:genid38', '_:genid39', '_:genid40', 'http://paths.uniroma1.it/atlas/places/3', '_:genid41', '_:genid42', '_:genid43', '_:genid44', '_:genid45', '_:genid46', '_:genid47', '_:genid48', '_:genid49', '_:genid50', '_:genid51', '_:genid52', '_:genid53', 'http://paths.uniroma1.it/atlas/places/4', '_:genid54', '_:genid55', '_:genid56', '_:genid57', '_:genid58', '_:genid59

In [39]:
from pprint import pprint
import json

json_directory = "../coptic_metadata_viewer/data"
json_files = glob.glob(os.path.join(json_directory, "*.json"))

for file in json_files:
    if os.path.basename(file) != "places.json":
        with open(file) as jsonFile:
            print(f"Processing file: {file}")
            data = json.load(jsonFile)
            item = data[0]
            # Print all keys and subkeys in the first item
            def print_keys(d, indent=0):
                for key, value in d.items():
                    print("  " * indent + f"{key}:")
                    if isinstance(value, dict):
                        print_keys(value, indent + 1)
                    else:
                        print("  " * (indent + 1) + f"{value}")

            print_keys(item)

Processing file: ../coptic_metadata_viewer/data/manuscripts.json
metadata:
  tb_id:
    paths__manuscripts
  rec_id:
    name:
      id
    label:
      Coptic Literary Manuscript (CLM) ID
    val:
      1
  tb_stripped:
    manuscripts
  tb_label:
    Manuscripts
core:
  id:
    name:
      id
    label:
      Coptic Literary Manuscript (CLM) ID
    val:
      1
  creator:
    name:
      creator
    label:
      Record creator
    val:
      1
  cmclid:
    name:
      cmclid
    label:
      CMCL
    val:
      CMCL.AA
  tm:
    name:
      tm
    label:
      TM
    val:
      
  ldab:
    name:
      ldab
    label:
      LDAB
    val:
      None
  lcbm:
    name:
      lcbm
    label:
      LCBM
    val:
      None
  dbmnt:
    name:
      dbmnt
    label:
      DBMNT
    val:
      None
  alias:
    name:
      alias
    label:
      Alias
    val:
      None
  issinglefrag:
    name:
      issinglefrag
    label:
      Preserved in a single fragment
    val:
      None
  isbook

In [2]:
import requests
import time

# Base API endpoint
base_url = "https://bdus.cloud/db/api/paths/"
resources = [
    "manuscripts", "works", "authors",
    "titles", "colophons", "persons", "collections"
]

headers = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/114.0.0.0 Safari/537.36"
    ),
    "Accept": "text/turtle"
}

for resource in resources:
    print(f"Fetching (Turtle): {resource}")
    page = 1

    with open(f"{resource}.ttl", "w", encoding="utf-8") as f:
        while True:
            url = (
                f"{base_url}?verb=search"
                f"&shortsql=@{resource}"
                f"&full_records=1"
                f"&format=ttl"
                f"&page={page}"
            )
            response = requests.get(url, headers=headers)

            if response.status_code != 200 or not response.text.strip():
                print(f"  No more pages or error (code {response.status_code}) on page {page}")
                break

            f.write(response.text.strip() + "\n\n")

            print(f"  Page {page}: {len(response.text.splitlines())} lines")
            page += 1
            time.sleep(1)

    print(f"Saved Turtle to {resource}.ttl\n")


Fetching (Turtle): manuscripts
  Page 1: 1 lines
  Page 2: 1 lines
  Page 3: 1 lines
  Page 4: 1 lines
  Page 5: 1 lines
  Page 6: 1 lines
  Page 7: 1 lines
  Page 8: 1 lines
  Page 9: 1 lines
  Page 10: 1 lines
  Page 11: 1 lines
  Page 12: 1 lines
  Page 13: 1 lines
  Page 14: 1 lines
  Page 15: 1 lines
  Page 16: 1 lines
  Page 17: 1 lines
  Page 18: 1 lines
  Page 19: 1 lines
  Page 20: 1 lines
  Page 21: 1 lines
  Page 22: 1 lines
  Page 23: 1 lines
  Page 24: 1 lines
  Page 25: 1 lines
  Page 26: 1 lines
  Page 27: 1 lines
  Page 28: 1 lines
  Page 29: 1 lines
  Page 30: 1 lines
  Page 31: 1 lines
  Page 32: 1 lines
  Page 33: 1 lines
  Page 34: 1 lines
  Page 35: 1 lines
  Page 36: 1 lines
  Page 37: 1 lines
  Page 38: 1 lines
  Page 39: 1 lines
  Page 40: 1 lines
  Page 41: 1 lines
  Page 42: 1 lines
  Page 43: 1 lines
  Page 44: 1 lines
  Page 45: 1 lines
  Page 46: 1 lines
  Page 47: 1 lines
  Page 48: 1 lines
  Page 49: 1 lines
  Page 50: 1 lines
  Page 51: 1 lines
  Page 52

KeyboardInterrupt: 

In [2]:
with open("../coptic_metadata_viewer/data/manuscripts.json", "r", encoding="utf-8") as jsonFile:
    data = json.load(jsonFile)
    item = data[0]
    # Print all keys and subkeys in the first item
    def print_keys(d, indent=0):
        for key, value in d.items():
            print("  " * indent + f"{key}:")
            if isinstance(value, dict):
                print_keys(value, indent + 1)
            else:
                print("  " * (indent + 1) + f"{value}")

    print_keys(item)

metadata:
  tb_id:
    paths__manuscripts
  rec_id:
    name:
      id
    label:
      Coptic Literary Manuscript (CLM) ID
    val:
      1
  tb_stripped:
    manuscripts
  tb_label:
    Manuscripts
core:
  id:
    name:
      id
    label:
      Coptic Literary Manuscript (CLM) ID
    val:
      1
  creator:
    name:
      creator
    label:
      Record creator
    val:
      1
  cmclid:
    name:
      cmclid
    label:
      CMCL
    val:
      CMCL.AA
  tm:
    name:
      tm
    label:
      TM
    val:
      
  ldab:
    name:
      ldab
    label:
      LDAB
    val:
      None
  lcbm:
    name:
      lcbm
    label:
      LCBM
    val:
      None
  dbmnt:
    name:
      dbmnt
    label:
      DBMNT
    val:
      None
  alias:
    name:
      alias
    label:
      Alias
    val:
      None
  issinglefrag:
    name:
      issinglefrag
    label:
      Preserved in a single fragment
    val:
      None
  isbookbinding:
    name:
      isbookbinding
    label:
      Only book