# Loading a document

This notebook demonstrates the use of the TrustGraph triple text load API to submit text for processing.

The text load API optionally allows extra arbitrary metadata to be submitted with the document which is associated with the document and added to the triple store.  In this example, we add metadata in a form structured in line with schema.org Organization, PublicationEvent and Document schemas.

The additional metadata is optional, the metadata element can be ignored.  However, if additional metadata is known, it can be integrated with TrustGraph processing.

In [1]:
import requests
import json
import base64

In [2]:
text = """

My name is Mark.

I have 2 cats:
- Fred is a big, fat, orange, stripy cat.  He is 12 years old and has 4 legs.
- Hope is a small, black cat.  She is 7 years old and also has 4 legs.

Fred has 4 legs.

Hope has 4 legs.

Fred and Hope are nice animals, but occasionally they fight.

Fred is lazy and sleeps a lot.  Hope is energetic, runs around a lot and
climbs trees.

Both cats have tails and whiskers like all cats do.

Cats have the species name Felis catus.

The cat (Felis catus), also referred to as domestic cat or house cat, is a
small domesticated carnivorous mammal. It is the only domesticated species of
the family Felidae. Advances in archaeology and genetics have shown that the
domestication of the cat occurred in the Near East around 7500 BC. It is
commonly kept as a pet and farm cat, but also ranges freely as a feral cat
avoiding human contact. Valued by humans for companionship and its ability to
kill vermin, the cat's retractable claws are adapted to killing small prey
like mice and rats. It has a strong, flexible body, quick reflexes, and sharp
teeth, and its night vision and sense of smell are well developed. It is a
social species, but a solitary hunter and a crepuscular predator. Cat
communication includes vocalizations—including meowing, purring, trilling,
hissing, growling, and grunting–as well as body language. It can hear sounds
too faint or too high in frequency for human ears, such as those made by small
mammals. It secretes and perceives pheromones.
"""

In [3]:
# ID of flow
flow = "default"
base_url = "http://localhost:8088"

In [4]:
# URL of the TrustGraph prompt API
url = f"{base_url}/api/v1/flow/{flow}/service/prompt"

In [5]:
# Some random identifiers.  The doc ID is important, as extracted knowledge is linked back to this identifier
org_id = "https://trustgraph.ai/org/3c35111a-f8ce-54b2-4dd6-c673f8bf0d09"
doc_id = "https://trustgraph.ai/doc/4faa45c1-f91a-a96a-d44f-2e57b9813db8"
pub_id = "https://trustgraph.ai/pubev/a847d950-a281-4099-aaab-c5e35333ff61"

In [6]:
# Organization metadata
org_facts = [
    [org_id, "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", "https://schema.org/Organization"],
    [org_id, "http://www.w3.org/2000/01/rdf-schema#label", "trustgraph.ai"],
    [org_id, "https://schema.org/name", "trustgraph.ai"]
]

In [7]:
# Puublication metadata.  Note how it links to the Organization
pub_facts = [
    [pub_id, "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", "https://schema.org/PublicationEvent"],
    [pub_id, "https://schema.org/description", "Uploading to Github"],
    [pub_id, "https://schema.org/endDate", "2024-10-23"],
    [pub_id, "https://schema.org/publishedBy", org_id],
    [pub_id, "https://schema.org/startDate", "2024-10-23"]
]

In [8]:
# Document metadata.  Note how it links to the publication event
doc_facts = [
    [doc_id, "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", "https://schema.org/DigitalDocument"],
    [doc_id, "http://www.w3.org/2000/01/rdf-schema#label", "Mark's cats"],
    [doc_id, "https://schema.org/copyrightHolder", "trustgraph.ai"],
    [doc_id, "https://schema.org/copyrightNotice", "Public domain"],
    [doc_id, "https://schema.org/copyrightYear", "2024"],
    [doc_id, "https://schema.org/description", "This document describes Mark's cats"],
    [doc_id, "https://schema.org/keywords", "animals"],
    [doc_id, "https://schema.org/keywords", "cats"],
    [doc_id, "https://schema.org/keywords", "home-life"],
    [doc_id, "https://schema.org/name", "Mark's cats"],
    [doc_id, "https://schema.org/publication", pub_id],
    [doc_id, "https://schema.org/url", "https://example.com"]
]

In [9]:
# Convert the above metadata into the right form
metadata = [
    {
        "s": {
            "v": t[0],
            "e": True,
        },
        "p": {
            "v": t[1],
            "e": True,
        },
        "o": {
            "v": t[2],
            "e": t[2].startswith("http")
        }
    }
    for t in org_facts + pub_facts + doc_facts
]

In [10]:
metadata

[{'s': {'v': 'https://trustgraph.ai/org/3c35111a-f8ce-54b2-4dd6-c673f8bf0d09',
   'e': True},
  'p': {'v': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', 'e': True},
  'o': {'v': 'https://schema.org/Organization', 'e': True}},
 {'s': {'v': 'https://trustgraph.ai/org/3c35111a-f8ce-54b2-4dd6-c673f8bf0d09',
   'e': True},
  'p': {'v': 'http://www.w3.org/2000/01/rdf-schema#label', 'e': True},
  'o': {'v': 'trustgraph.ai', 'e': False}},
 {'s': {'v': 'https://trustgraph.ai/org/3c35111a-f8ce-54b2-4dd6-c673f8bf0d09',
   'e': True},
  'p': {'v': 'https://schema.org/name', 'e': True},
  'o': {'v': 'trustgraph.ai', 'e': False}},
 {'s': {'v': 'https://trustgraph.ai/pubev/a847d950-a281-4099-aaab-c5e35333ff61',
   'e': True},
  'p': {'v': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', 'e': True},
  'o': {'v': 'https://schema.org/PublicationEvent', 'e': True}},
 {'s': {'v': 'https://trustgraph.ai/pubev/a847d950-a281-4099-aaab-c5e35333ff61',
   'e': True},
  'p': {'v': 'https://schema.org/desc

In [11]:
# The input
input = {

    # Document identifer.  Knowledge derived by TrustGraph is linked to this identifier, so
    # the additional metadata specified above is linked to the derived knowledge and users of
    # the knowledge graph could see information about the source of knowledge
    "id": doc_id,

    # Additional metadata in the form of RDF triples
    "metadata": metadata,

    # Text character set.  Default is UTF-8
    "charset": "utf-8",

    # The PDF document, is presented as a base64 encoded document.
    "text": base64.b64encode(text.encode("utf-8")).decode("utf-8")
    
}

In [12]:
# Invoke the API, input is passed as JSON
resp = requests.post(url, json=input)

In [13]:
# Should be a 200 status code
resp.status_code

200

In [14]:
# The document load returns no response.  A 200 response shows the submitted text is queued to enter processing flows