# Loading a document

This notebook demonstrates the use of the TrustGraph librarian API to submit text for processing.

The API optionally allows extra arbitrary metadata to be submitted with the document which is associated with the document and added to the triple store.  In this example, we add metadata in a form structured in line with schema.org Organization, PublicationEvent and Document schemas.

The additional metadata is optional, the metadata element can be ignored.  However, if additional metadata is known, it can be integrated with TrustGraph processing.

This particular processing uses the following API calls:
- Load a document into the library
- Create a new flow
- Submit the document for processing in that flow

This will fail if the document, flow and flow submission already exist, so don't execute this notebook more than once.

In [1]:
!pip install trustgraph-base


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import trustgraph.api as tg
from trustgraph.knowledge import DigitalDocument, Organization
from trustgraph.knowledge import PublicationEvent, hash, to_uri
from trustgraph.knowledge import PREF_PUBEV, PREF_ORG, PREF_DOC
import time

In [3]:
# Open a text document from this repo.  PDFs are binary blobs
pdf = open("../sources/Challenger-Report-Vol1.pdf", "rb").read()

title = "Challenger Report Volume 1"

In [4]:
cli = tg.Api()

In [5]:
org_id = to_uri(PREF_ORG, "1dd51ece-8bd3-48b8-98ce-1ac9164c5214")
doc_id = to_uri(PREF_DOC, "72ef3374-af7a-40c4-8c7b-45050aef5b90")
pub_id = to_uri(PREF_PUBEV, "59012ae1-65d4-441f-8288-b6f3c6c15333")

In [6]:
org = Organization(
    id = org_id,
    name = "NASA",
)

pubev = PublicationEvent(
    id = pub_id,
    description = "Presidential commission publication",
    start_date = "1986-06-06",
    end_date = "1986-06-06",
    organization = org,
)

doc = DigitalDocument(
    id = doc_id,
    name = "Challenger Report Volume 1",
    description = "The findings of the Presidential Commission regarding the circumstances surrounding the Challenger accident are reported and recommendations for corrective action are outlined",
    copyright_holder = "US Government",
    copyright_notice = "Work of the US Gov. Public Use Permitted",
    copyright_year = "1986",
    keywords = ["nasa", "challenger", "space-shuttle", "shuttle", "orbiter"],
    publication = pubev,
    url = "https://ntrs.nasa.gov/citations/19860015255",
)

In [7]:
resp = cli.library().add_document(
    document = pdf,
    id = doc_id,
    metadata = doc,
    user = "trustgraph",
    title = title,
    comments = "A test data document",
    kind = "application/pdf",
    tags = [ "nasa", "safety engineering" ]
)

# Start a flow

In [8]:
cli.flow().start(
    class_name = "document-rag+graph-rag", 
    id = "my-flow2",
    description = "My new flow",
)

# Submit document for processing

In [9]:
cli.library().start_processing(
    id = "proc02",
    document_id = doc_id,
    flow = "my-flow2",
    user = "trustgraph",
    collection = "default",
    tags = [ "my document", "processing test" ]
)

{}