# Amazon SageMaker Lineage
Amazon SageMaker Lineage enables events that happen within SageMaker to be traced via a graph structure.  The data simplifies generating reports, making comparisons, or discovering relationships between events.  For example easily trace both how a model was generated and where the model was deployed. 

The lineage graph is created automatically by SageMaker and you can directly create or modify your own graphs.


## Key Concepts

* **Lineage Graph** - A connected graph tracing your machine learning workflow end to end. 
* **Artifacts** - Represents a URI addressable object or data.  Artifacts are typically inputs or outputs to Actions.  
* **Actions**  - Represents an action taken such as a computation, transformation, or job.  
* **Contexts** - Provides a method to logically group other entities.
* **Associations** - A directed edge in the lineage graph that links two entities.
* **Lineage Traversal** - Starting from an arbitrary point trace the lineage graph to discover and analyze relationships between steps in your workflow.
* **Experiments** - Experiment entites (Experiments, Trials, and Trial Components) are also part of the lineage graph and can be associated wtih Artifacts, Actions, or Contexts.


## Notebook Overview

This notebook demonstrates how to:
* Understand the basics of lineage entities.
* Create and associate lineage entities to track your workflow.
* Traverse the associations between lineage entities.

## Prerequisites

Select the `Python 3 (Data Science)` kernel in SageMaker Studio.

In [None]:
from botocore.exceptions import ClientError

import os
import sagemaker
import logging
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

In [17]:
from datetime import datetime
from sagemaker.lineage.context import Context
from sagemaker.lineage.action import Action
from sagemaker.lineage.association import Association
from sagemaker.lineage.artifact import Artifact

In [None]:
from sagemaker.lineage.visualizer import LineageTableVisualizer

In [None]:
unique_id = str(int(datetime.now().replace(microsecond=0).timestamp()))

print(f'Unique id is {unique_id}')

In [3]:
# create an example context

# the name must be unique across all other contexts
context_name = f'machine-learning-workflow-{unique_id}' 

ml_workflow_context = Context.create(
    context_name=context_name, 
    context_type='MLWorkflow',    
    source_uri=unique_id,
    # properties services as a method to store metdata on lineage entities in additional to Tags
    properties={"example": "true"})

In [4]:
# list all the contexts

contexts = Context.list(sort_by='CreationTime', sort_order='Descending')

for ctx in contexts:
    print(ctx.context_name)

machine-learning-workflow-1609278631


In [5]:
# create an example action and associate it with the context

model_build_action = Action.create(
    action_name=f"model-build-step-{unique_id}",
    action_type="ModelBuild",
    source_uri=unique_id,
    properties={"Example": "Metadata"},
)

In [6]:
# Association Type can be Produced|DerivedFrom|AssociatedWith|ContributedTo
context_action_association = Association.create(
    source_arn=ml_workflow_context.context_arn,
    destination_arn=model_build_action.action_arn,
    association_type='AssociatedWith'
)

In [7]:
# now the Action and Context are associated:
incoming_associations_to_action = Association.list(destination_arn=model_build_action.action_arn)
for association in incoming_associations_to_action:
    print(f'{model_build_action.action_name} has an incoming association from {association.source_name}')

outgoing_associations_from_context = Association.list(source_arn=ml_workflow_context.context_arn)
for association in outgoing_associations_from_context:
    print(f'{ml_workflow_context.context_name} has an outgoing association to {association.destination_name}')

model-build-step-1609278631 has an incoming association from machine-learning-workflow-1609278631
machine-learning-workflow-1609278631 has an outgoing association to model-build-step-1609278631


In [8]:
# create an artifact representing inputs to the model building action
input_test_images = Artifact.create(
    artifact_name='mnist-test-images',
    artifact_type='TestData',
    source_types=[{"SourceIdType": "Custom", "Value": unique_id}],
    source_uri='https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-images-idx3-ubyte.gz')

input_test_labels = Artifact.create(
    artifact_name='mnist-test-labels',
    artifact_type='TestLabels',
    source_types=[{"SourceIdType": "Custom", "Value": unique_id}],
    source_uri='https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-labels-idx1-ubyte.gz')

In [9]:
# create an artifact representing a trained model
output_model = Artifact.create(
    artifact_name='mnist-model',
    artifact_type='Model',
    source_types=[{"SourceIdType": "Custom", "Value": unique_id}],
    source_uri='s3://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/model/tensorflow-training-2020-11-20-23-57-13-077/model.tar.gz'
)

In [10]:
# associate the data set artifact with an incoming association to the example action
Association.create(source_arn=input_test_images.artifact_arn, destination_arn=model_build_action.action_arn)
Association.create(source_arn=input_test_labels.artifact_arn, destination_arn=model_build_action.action_arn)

Association(sagemaker_session=<sagemaker.session.Session object at 0x7f5bdfa67d10>,source_arn='arn:aws:sagemaker:us-east-1:231218423789:artifact/8eb66641d3961643ec91e3e7e3f49b28',destination_arn='arn:aws:sagemaker:us-east-1:231218423789:action/model-build-step-1609278631',association_type=None,response_metadata={'RequestId': '94b3957d-fe88-4a33-8a90-16c55444c408', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '94b3957d-fe88-4a33-8a90-16c55444c408', 'content-type': 'application/x-amz-json-1.1', 'content-length': '193', 'date': 'Tue, 29 Dec 2020 21:50:51 GMT'}, 'RetryAttempts': 0})

In [11]:
# associate the example action with an outgoing association to the model artifact
Association.create(source_arn=model_build_action.action_arn, destination_arn=output_model.artifact_arn)

Association(sagemaker_session=<sagemaker.session.Session object at 0x7f5bde901590>,source_arn='arn:aws:sagemaker:us-east-1:231218423789:action/model-build-step-1609278631',destination_arn='arn:aws:sagemaker:us-east-1:231218423789:artifact/7bf4d7516ddfd81e9acfec3e20f0f71a',association_type=None,response_metadata={'RequestId': 'ec9f26f6-7153-4547-be1e-3b3fe30fc022', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'ec9f26f6-7153-4547-be1e-3b3fe30fc022', 'content-type': 'application/x-amz-json-1.1', 'content-length': '193', 'date': 'Tue, 29 Dec 2020 21:50:52 GMT'}, 'RetryAttempts': 0})

# List All Artifacts With SageMaker ArtifactAnalytics

In [12]:
from sagemaker.analytics import ArtifactAnalytics

In [14]:
from sagemaker.analytics import ArtifactAnalytics
analytics = ArtifactAnalytics()
df = analytics.dataframe()

In [15]:
df

Unnamed: 0,ArtifactName,ArtifactArn,ArtifactType,ArtifactSourceUri,CreationTime,LastModifiedTime
0,,arn:aws:sagemaker:us-east-1:231218423789:artif...,Model,arn:aws:sagemaker:us-east-1:231218423789:model...,2020-12-22 17:43:43.494000+00:00,2020-12-22 19:37:07.535000+00:00
1,,arn:aws:sagemaker:us-east-1:231218423789:artif...,Model,s3://sagemaker-us-east-1-231218423789/BERT/out...,2020-12-22 17:43:40.963000+00:00,2020-12-22 17:43:40.963000+00:00
2,,arn:aws:sagemaker:us-east-1:231218423789:artif...,DataSet,s3://sagemaker-us-east-1-231218423789/sagemake...,2020-12-22 17:29:01.770000+00:00,2020-12-22 17:29:01.770000+00:00
3,,arn:aws:sagemaker:us-east-1:231218423789:artif...,DataSet,s3://sagemaker-us-east-1-231218423789/sagemake...,2020-12-22 17:29:01.720000+00:00,2020-12-22 17:29:01.720000+00:00
4,,arn:aws:sagemaker:us-east-1:231218423789:artif...,DataSet,s3://sagemaker-us-east-1-231218423789/sagemake...,2020-12-22 17:29:01.656000+00:00,2020-12-22 17:29:01.656000+00:00
...,...,...,...,...,...,...
124,,arn:aws:sagemaker:us-east-1:231218423789:artif...,DataSet,s3://sagemaker-us-east-1-231218423789/tensorfl...,2020-12-18 17:32:41.003000+00:00,2020-12-18 17:32:41.003000+00:00
125,,arn:aws:sagemaker:us-east-1:231218423789:artif...,DataSet,s3://sagemaker-us-east-1-231218423789/tensorfl...,2020-12-18 17:32:40.988000+00:00,2020-12-18 17:32:40.988000+00:00
126,,arn:aws:sagemaker:us-east-1:231218423789:artif...,DataSet,s3://sagemaker-us-east-1-231218423789/tensorfl...,2020-12-18 17:32:40.922000+00:00,2020-12-18 17:32:40.922000+00:00
127,,arn:aws:sagemaker:us-east-1:231218423789:artif...,Image,503895931360.dkr.ecr.us-east-1.amazonaws.com/s...,2020-12-18 17:32:40.877000+00:00,2020-12-18 17:32:40.877000+00:00


In [None]:
# tuner = HyperparameterTuningJobAnalytics("my-tuning-job", sagemaker_session=session)
# df = tuner.dataframe()

In [None]:
# trainer = TrainingJobAnalytics("my-training-job", ["train:acc"], sagemaker_session=session)
# df = trainer.dataframe()

# Visualize Lineage Graph as DataFrame

# Artifacts for Processing Job

In [38]:
import time
from sagemaker.lineage.visualizer import LineageTableVisualizer

viz = LineageTableVisualizer(sagemaker.session.Session())

df = viz.show(processing_job_name='pipelines-jvk06bcbvsxx-processing-v21c4wmlos')
df

Unnamed: 0,Name/Source,Direction,Type,Association Type,Lineage Type
0,preprocess-scikit-text-to-bert.py,Input,DataSet,ContributedTo,artifact
1,s3://...2020-12-29-22-19-21-922/output/bert-test,Output,DataSet,Produced,artifact
2,s3://...2-29-22-19-21-922/output/bert-validation,Output,DataSet,Produced,artifact
3,s3://...020-12-29-22-19-21-922/output/bert-train,Output,DataSet,Produced,artifact


# Artifacts for Training Job

In [24]:
import time
from sagemaker.lineage.visualizer import LineageTableVisualizer

viz = LineageTableVisualizer(sagemaker.session.Session())

df = viz.show(training_job_name='pipelines-jvk06bcbvsxx-Train-ja5pJvXyWE')
df

Unnamed: 0,Name/Source,Direction,Type,Association Type,Lineage Type
0,s3://...2020-12-29-22-19-21-922/output/bert-test,Input,DataSet,ContributedTo,artifact
1,s3://...2-29-22-19-21-922/output/bert-validation,Input,DataSet,ContributedTo,artifact
2,s3://...020-12-29-22-19-21-922/output/bert-train,Input,DataSet,ContributedTo,artifact
3,76310...ws.com/tensorflow-training:2.1.0-cpu-py3,Input,Image,ContributedTo,artifact
4,model.tar.gz,Output,Model,Produced,artifact


# Artifacts for Model Package

In [34]:
import time
from sagemaker.lineage.visualizer import LineageTableVisualizer

viz = LineageTableVisualizer(sagemaker.session.Session())

df = viz.show(model_package_arn='arn:aws:sagemaker:us-east-1:231218423789:model-package-group/bert-reviews-16092802964020664')
df

# Artifacts for Trial Component

In [47]:
!aws sagemaker list-trial-components \
    --experiment-name 'Amazon-Customer-Reviews-BERT-Experiment-1608317665'

{
    "TrialComponentSummaries": [
        {
            "TrialComponentName": "tensorflow-training-2020-12-18-18-56-45-812-aws-training-job",
            "TrialComponentArn": "arn:aws:sagemaker:us-east-1:231218423789:experiment-trial-component/tensorflow-training-2020-12-18-18-56-45-812-aws-training-job",
            "DisplayName": "train",
            "TrialComponentSource": {
                "SourceArn": "arn:aws:sagemaker:us-east-1:231218423789:training-job/tensorflow-training-2020-12-18-18-56-45-812",
                "SourceType": "SageMakerTrainingJob"
            },
            "Status": {
                "PrimaryStatus": "Stopped",
                "Message": "Status: Stopped, secondary status: Stopped, failure reason: ."
            },
            "CreationTime": 1608317808.144,
            "CreatedBy": {},
            "LastModifiedTime": 1608325164.65,
            "LastModifiedBy": {}
        },
        {
            "TrialComponentName": "TrialComponent-2020-12-18-185428-cgwf

In [48]:
import time
from sagemaker.lineage.visualizer import LineageTableVisualizer

viz = LineageTableVisualizer(sagemaker.session.Session())

df = viz.show(trial_component_name='tensorflow-training-2020-12-18-18-56-45-812-aws-training-job')
df

Unnamed: 0,Name/Source,Direction,Type,Association Type,Lineage Type


# Artifacts for All Pipeline Steps

In [None]:
import time
from sagemaker.lineage.visualizer import LineageTableVisualizer


viz = LineageTableVisualizer(sagemaker.session.Session())
for execution_step in reversed(execution.list_steps()):
    print(execution_step)
    display(viz.show(pipeline_execution_step=execution_step))
    time.sleep(5)


## Cleanup

In [None]:
def delete_associations(arn):
    # delete incoming associations
    incoming_associations = Association.list(destination_arn=arn)
    for summary in incoming_associations:
        assct = Association(
            source_arn=summary.source_arn, 
            destination_arn=summary.destination_arn,
            sagemaker_session=sagemaker_session)
        assct.delete()
        time.sleep(3)
    
    # delete outgoing associations
    outgoing_associations = Association.list(source_arn=arn)
    for summary in outgoing_associations:
        assct = Association(
            source_arn=summary.source_arn, 
            destination_arn=summary.destination_arn,
            sagemaker_session=sagemaker_session)
        assct.delete()
        time.sleep(3)        

import time

def delete_lineage_data():
    for summary in Context.list():
        print(f'Deleting context {summary.context_name}')
        delete_associations(summary.context_arn)
        ctx = Context(context_name=summary.context_name, sagemaker_session=sagemaker_session)        
        ctx.delete()
        time.sleep(3)

    for summary in Action.list():
        print(f'Deleting action {summary.action_name}')
        delete_associations(summary.action_arn)
        actn = Action(action_name=summary.action_name, sagemaker_session=sagemaker_session)
        actn.delete()
        time.sleep(3)        

    for summary in Artifact.list():
        print(f'Deleting artifact {summary.artifact_arn} {summary.artifact_name}')
        delete_associations(summary.artifact_arn)
        artfct = Artifact(artifact_arn=summary.artifact_arn, sagemaker_session=sagemaker_session)
        artfct.delete()
        time.sleep(3)        

delete_lineage_data()

## Caveats

* Associations cannot be created between two experiment entities. For example between an Experiment and Trial.
* Associations can only be created between the following resources: Experiment, Trial, Trial Component, Action, Artifact, or Context.
* The maximum number of manually created lineage entities are:
  * Artifacts: 6000
  * Contexts: 500
  * Actions: 3000
  * Associations: 6000
* There is no limit on the number of lineage entities created automatically by SageMaker.

## Contact

Submit any questions or issues to https://github.com/aws/sagemaker-experiments/issues or mention @aws/sagemakerexperimentsadmin