# Loading a document

This notebook demonstrates the use of the TrustGraph triple text load API to submit text for processing.

The text load API optionally allows extra arbitrary metadata to be submitted with the document which is associated with the document and added to the triple store.  In this example, we add metadata in a form structured in line with schema.org Organization, PublicationEvent and Document schemas.

The additional metadata is optional, the metadata element can be ignored.  However, if additional metadata is known, it can be integrated with TrustGraph processing.

In [1]:
import requests
import json
import base64

In [2]:
# URL of the TrustGraph triplestore query API
url = "http://localhost:8088/api/v1/load/text"

In [3]:
# Open a text document from this repo.  PDFs are binary blobs
text = open("../sources/README.cats", "rb").read()

In [4]:
# Some random identifiers.  The doc ID is important, as extracted knowledge is linked back to this identifier
org_id = "https://trustgraph.ai/org/3c35111a-f8ce-54b2-4dd6-c673f8bf0d09"
doc_id = "https://trustgraph.ai/doc/4faa45c1-f91a-a96a-d44f-2e57b9813db8"
pub_id = "https://trustgraph.ai/pubev/a847d950-a281-4099-aaab-c5e35333ff61"

In [5]:
# Organization metadata
org_facts = [
    [org_id, "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", "https://schema.org/Organization"],
    [org_id, "http://www.w3.org/2000/01/rdf-schema#label", "trustgraph.ai"],
    [org_id, "https://schema.org/name", "trustgraph.ai"]
]

In [6]:
# Puublication metadata.  Note how it links to the Organization
pub_facts = [
    [pub_id, "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", "https://schema.org/PublicationEvent"],
    [pub_id, "https://schema.org/description", "Uploading to Github"],
    [pub_id, "https://schema.org/endDate", "2024-10-23"],
    [pub_id, "https://schema.org/publishedBy", org_id],
    [pub_id, "https://schema.org/startDate", "2024-10-23"]
]

In [7]:
# Document metadata.  Note how it links to the publication event
doc_facts = [
    [doc_id, "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", "https://schema.org/DigitalDocument"],
    [doc_id, "http://www.w3.org/2000/01/rdf-schema#label", "Mark's cats"],
    [doc_id, "https://schema.org/copyrightHolder", "trustgraph.ai"],
    [doc_id, "https://schema.org/copyrightNotice", "Public domain"],
    [doc_id, "https://schema.org/copyrightYear", "2024"],
    [doc_id, "https://schema.org/description", "This document describes Mark's cats"],
    [doc_id, "https://schema.org/keywords", "animals"],
    [doc_id, "https://schema.org/keywords", "cats"],
    [doc_id, "https://schema.org/keywords", "home-life"],
    [doc_id, "https://schema.org/name", "Mark's cats"],
    [doc_id, "https://schema.org/publication", pub_id],
    [doc_id, "https://schema.org/url", "https://example.com"]
]

In [8]:
# Convert the above metadata into the right form
metadata = [
    { "s": t[0], "p": t[1], "o": t[2] }
    for t in org_facts + pub_facts + doc_facts
]

In [9]:
# The input
input = {

    # Document identifer.  Knowledge derived by TrustGraph is linked to this identifier, so
    # the additional metadata specified above is linked to the derived knowledge and users of
    # the knowledge graph could see information about the source of knowledge
    "id": doc_id,

    # Additional metadata in the form of RDF triples
    "metadata": metadata,

    # Text character set.  Default is UTF-8
    "charset": "utf-8",

    # The PDF document, is presented as a base64 encoded document.
    "text": base64.b64encode(text).decode("utf-8")
    
}

In [10]:
# Invoke the API, input is passed as JSON
resp = requests.post(url, json=input)

In [11]:
# Should be a 200 status code
resp.status_code

200

In [12]:
# The document load returns no response.  A 200 response shows the submitted text is queued to enter processing flows