# DerivaML Ingest

DerivaML is a class library built on the Deriva Scientific Asset management system that is designed to help simplify a number of the basic operations associated with building and testing ML libraries based on common toolkits such as TensorFlow.  This notebook reviews the basic features of the DerivaML library.

## Set up DerivaML  for test case

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml.demo_catalog import create_demo_catalog, DemoML
from deriva_ml import MLVocab, ExecutionConfiguration, Workflow, DerivaSystemColumns, VersionPart, DatasetSpec, FileSpec
from IPython.display import display, Markdown, HTML, JSON

Set the details for the catalog we want and authenticate to the server if needed.

In [3]:
hostname = 'dev.eye-ai.org'
domain_schema = 'demo-schema'

In [4]:
gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
    print("You are already logged in.")
else:
    gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
    print("Login Successful")


You are already logged in.


Create a test catalog and get an instance of the DemoML class.

In [5]:
test_catalog = create_demo_catalog(hostname, domain_schema)
ml_instance = DemoML(hostname, test_catalog.catalog_id, use_minid=False)



Execution RID: https://dev.eye-ai.org/id/2035/3SC@33C-9JFN-25R0

## Configure DerivaML Datasets

In Deriva-ML a dataset is used to aggregate instances of entities.  However, before we can create any datasets, we must configure 
Deriva-ML for the specifics of the datasets.  The first stp is we need to tell Deriva-ML what types of use defined objects can be associated with a dataset.  

Note that out of the box, Deriva-ML is configured to allow datasets to contained dataset (i.e. nested datasets), so we don't need to do anything for that specific configuration.

Now that we have configured our datasets, we need to identify the dataset types so we can distinguish between them.

Now create datasets and populate with elements from the test catalogs.

In [6]:
ml_instance.add_term(MLVocab.workflow_type, "Data Ingest Notebook", description="A Workflow that Ingests data into a catalog")

# Now lets create model configuration for our program.
api_workflow = ml_instance.create_workflow(
    name="Data Ingest",
    workflow_type="Data Ingest Notebook",
    description="An example of how to use the file table"
)

ingest_execution = ml_instance.create_execution(
    ExecutionConfiguration(
        workflow=api_workflow,
        description="Our Sample Workflow instance")
)



Execution RID: https://dev.eye-ai.org/id/2035/3TM@33C-9JGJ-9WA8

2025-05-27 16:20:48,925 - deriva_ml.INFO - Downloading assets ...
2025-05-27 16:20:49,570 - deriva_ml.INFO - Initialize status finished.


In [7]:
with ingest_execution.execute() as exe:
    files = FileSpec.create_filespecs('/Users/carl/Repos/Projects/deriva-ml/src', 'my stuff')
    exe.add_files(files, file_types=[])

2025-05-27 16:20:53,268 - deriva_ml.INFO - Start execution  ...
2025-05-27 16:20:53,410 - deriva_ml.INFO - Start execution  ...
2025-05-27 16:20:54,205 - deriva_ml.INFO - Successfully run Ml.
2025-05-27 16:20:54,330 - deriva_ml.INFO - Algorithm execution ended.


And now that we have defined some datasets, we can add elements of the appropriate type to them.  We can see what is in our new datasets by listing the dataset members.

In [8]:
# Get list of subjects and images from the catalog using the DataPath API.
ml_instance.list_files()

[]

For ths example, lets partition the data based on the name of the subject.  Of course in real examples, we would do a more complex analysis in deciding
what subset goes into each data set.

In [None]:
display(HTML(f'<a href={ml_instance.chaise_url("Dataset")}>Browse Datasets</a>'))

In [None]:
test_catalog.delete_ermrest_catalog(really=True)