
# Train the model
- Write a script that reads train and test data and train an xgboost model.
- Create a training job using the above script

References:
 - https://github.com/aws/amazon-sagemaker-examples/blob/4534bff4b5b5062af5789d98c4ddca01b0cb5d1f/end_to_end/fraud_detection/2-lineage-train-assess-bias-tune-registry-e2e.ipynb

In [1]:
import sagemaker
import boto3
from sagemaker.xgboost.estimator import XGBoost



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [2]:
sagemaker_session = sagemaker.Session()
REGION = sagemaker_session.boto_region_name
BUCKET = sagemaker_session.default_bucket()
ROLE = sagemaker.get_execution_role()
PREFIX = "FraudDetection_AutoInsurance"
print(REGION)
print(BUCKET)
print(ROLE)

print(sagemaker_session.account_id())
#print(sagemaker_session.list_s3_files(bucket=BUCKET, key_prefix=PREFIX))
#sagemaker_session.list_feature_groups()

us-east-1
sagemaker-us-east-1-205930620783
arn:aws:iam::205930620783:role/service-role/AmazonSageMaker-ExecutionRole-20250401T145997
205930620783


In [3]:
# Setup a default session
# The following code likely be used when calling aws services from a local system.

boto3.setup_default_session(region_name=REGION)
boto_session = boto3.Session(region_name=REGION)
s3_client = boto_session.client('s3')
sagemaker_client = boto3.client('sagemaker', region_name=REGION)

sagemaker_session = sagemaker.session.Session(boto_session=boto_session, sagemaker_client=sagemaker_client)
account_id = sagemaker_session.account_id()
print(account_id)

205930620783


In [4]:
ESTIMATOR_OUTPUT_PATH = f"s3://{BUCKET}/{PREFIX}/training_jobs"
train_instance_count = 1
train_instance_type = "ml.c5.xlarge"

train_data_uri = f"s3://{BUCKET}/{PREFIX}/data/train.csv"
test_data_uri = f"s3://{BUCKET}/{PREFIX}/data/test.csv"
#data_uri = f"s3://{BUCKET}/{PREFIX}/data/"
model_out_uri = f"s3://{BUCKET}/{PREFIX}/model/"
target_var = "fraud"
#print(data_uri)
#s3://sagemaker-us-east-1-205930620783/FraudDetection_AutoInsurance/data/

### First write a training script
that trains a model on the input train data, validates on test data, and output the model and model results.

### Now test it locally

In [5]:
! python xgboost_model_script.py --train-file "s3://sagemaker-us-east-1-205930620783/FraudDetection_AutoInsurance/data/train.csv" \
--test-file "s3://sagemaker-us-east-1-205930620783/FraudDetection_AutoInsurance/data/test.csv" \
--model-out-dir "model/" \
--model-data-out-dir "model_results/"

Parsing arguments...
Namespace(train_data_path=None, train_file='s3://sagemaker-us-east-1-205930620783/FraudDetection_AutoInsurance/data/train.csv', test_data_path=None, test_file='s3://sagemaker-us-east-1-205930620783/FraudDetection_AutoInsurance/data/test.csv', model_out_dir='model/', model_data_out_dir='model_results/', target_var='fraud', features=None, max_depth=6, eta=0.3, objective='binary:logistic', num_boost_round=1, nfold=2)
s3://sagemaker-us-east-1-205930620783/FraudDetection_AutoInsurance/data/train.csv
Traceback (most recent call last):
  File "/home/sagemaker-user/Code/EndToEnd_FraudDetection/xgboost_model_script.py", line 32, in <module>
    print("Reading Train Data From", os.path.join(args.train_data_path, args.train_file))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen posixpath>", line 76, in join
TypeError: expected str, bytes or os.PathLike object, not NoneType


### Run this training script on an EC2 instance, How to do that?
### Using SageMaker Estimators

In [6]:
xgb_estimator = XGBoost(
    entry_point="xgboost_model_script.py",
    output_path=ESTIMATOR_OUTPUT_PATH,
    code_location=ESTIMATOR_OUTPUT_PATH,
    hyperparameters={'target-var':'fraud', 'max-depth':6, 'eta':0.3, 'objective':'binary:logistic', 'num-boost-round':100, 'nfold':5},
    role=ROLE,
    instance_count=train_instance_count,
    instance_type=train_instance_type,
    framework_version="1.0-1"
)

#### The key 'train' corresponds to the channel SM_CHANNEL_TRAIN, which you can access inside your training script like this:, and we have made train-data-path argument defaults to SM_CHANNEL_TRAIN
What Happens at Runtime?
- SageMaker downloads the training data to /opt/ml/input/data/train
- Sets the env var SM_CHANNEL_TRAIN=/opt/ml/input/data/train
- Passes CLI arguments for hyperparameters (--max_depth, --eta, etc.)
- Your script uses argparse to read these values
- You save the model to args.model_dir (== /opt/ml/model)
- SageMaker uploads that directory to S3

In [7]:
xgb_estimator.fit(inputs={'train':train_data_uri, 'test':test_data_uri})

2025-05-30 09:09:36 Starting - Starting the training job...
2025-05-30 09:09:51 Starting - Preparing the instances for training...
2025-05-30 09:10:38 Downloading - Downloading the training image......
2025-05-30 09:11:39 Training - Training image download completed. Training in progress.
2025-05-30 09:11:39 Uploading - Uploading generated training model[34m[2025-05-30 09:11:28.356 ip-10-2-66-37.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Invoking user training script.[0m
[34mINFO:sagemaker-containers:Module xgboost_model_script does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34mINFO:sagemaker-containers:Generating setup.cfg[0m
[34mINFO:sagemaker-containers:Generating MANIFEST.in[0m
[34mINFO:sagemaker-containers:Insta

## Amazon SageMaker Lineage Tracking
Amazon SageMaker Lineage Tracking helps you track the origin and movement of data, code, and models throughout your machine learning workflow. It's useful for debugging, auditing, compliance, and understanding how different ML components relate to each other.

🧩 What is Tracked in SageMaker Lineage?
SageMaker lineage tracking captures metadata for:

- Artifacts: Input/output data, models, feature groups, etc.
- Actions: Training jobs, processing jobs, transform jobs, etc.
- Contexts: Grouping of related actions/artifacts (e.g., a training experiment).
- Associations: Links between artifacts and actions (e.g., this dataset was used in this training job).

**Learn more in details, the following should reflect in "Experiments and Trials" but I couldn't found the page. Though the "jobs/training/" lists all jobs and its lineage tracking too, though I haven't manually done that as below**.

In [26]:
training_job_1_name = xgb_estimator.latest_training_job.job_name
print(training_job_1_name)
training_job_1_info = sagemaker_client.describe_training_job(TrainingJobName=training_job_1_name)
training_job_1_info

sagemaker-xgboost-2025-05-30-09-09-33-714


{'TrainingJobName': 'sagemaker-xgboost-2025-05-30-09-09-33-714',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:205930620783:training-job/sagemaker-xgboost-2025-05-30-09-09-33-714',
 'ModelArtifacts': {'S3ModelArtifacts': 's3://sagemaker-us-east-1-205930620783/FraudDetection_AutoInsurance/training_jobs/sagemaker-xgboost-2025-05-30-09-09-33-714/output/model.tar.gz'},
 'TrainingJobStatus': 'Completed',
 'SecondaryStatus': 'Completed',
 'HyperParameters': {'eta': '0.3',
  'max-depth': '6',
  'nfold': '5',
  'num-boost-round': '100',
  'objective': '"binary:logistic"',
  'sagemaker_container_log_level': '20',
  'sagemaker_job_name': '"sagemaker-xgboost-2025-05-30-09-09-33-714"',
  'sagemaker_program': '"xgboost_model_script.py"',
  'sagemaker_region': '"us-east-1"',
  'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-205930620783/FraudDetection_AutoInsurance/training_jobs/sagemaker-xgboost-2025-05-30-09-09-33-714/source/sourcedir.tar.gz"',
  'target-var': '"fraud"'},
 'AlgorithmSpe

### Code Atrifact

In [18]:
code_s3_uri = training_job_1_info["HyperParameters"]["sagemaker_submit_directory"]
print(code_s3_uri)

matching_artifacts = list(
    sagemaker.lineage.artifact.Artifact.list(
        source_uri=code_s3_uri, sagemaker_session=sagemaker_session
    )
)
print(matching_artifacts)

# use existing arifact if it's already been created, otherwise create a new artifact
if matching_artifacts:
    code_artifact = matching_artifacts[0]
    print(f"Using existing artifact: {code_artifact.artifact_arn}")
else:
    code_artifact = sagemaker.lineage.artifact.Artifact.create(
        artifact_name="TrainingScript",
        source_uri=code_s3_uri,
        artifact_type="Code",
        sagemaker_session=sagemaker_session,
    )
    print(f"Create artifact {code_artifact.artifact_arn}: SUCCESSFUL")


"s3://sagemaker-us-east-1-205930620783/FraudDetection_AutoInsurance/training_jobs/sagemaker-xgboost-2025-05-30-09-09-33-714/source/sourcedir.tar.gz"
[ArtifactSummary(artifact_arn='arn:aws:sagemaker:us-east-1:205930620783:artifact/2b4d84011d43116551397437a07a1679',artifact_name='TrainingScript',source=ArtifactSource(source_uri='"s3://sagemaker-us-east-1-205930620783/FraudDetection_AutoInsurance/training_jobs/sagemaker-xgboost-2025-05-30-09-09-33-714/source/sourcedir.tar.gz"',source_types=[]),artifact_type='Code',creation_time=datetime.datetime(2025, 5, 30, 9, 30, 23, 646000, tzinfo=tzlocal()),last_modified_time=datetime.datetime(2025, 5, 30, 9, 30, 23, 646000, tzinfo=tzlocal()))]
Using existing artifact: arn:aws:sagemaker:us-east-1:205930620783:artifact/2b4d84011d43116551397437a07a1679


### Training Data Artifact

In [19]:
training_data_s3_uri = training_job_1_info["InputDataConfig"][0]["DataSource"]["S3DataSource"][
    "S3Uri"
]

matching_artifacts = list(
    sagemaker.lineage.artifact.Artifact.list(
        source_uri=training_data_s3_uri, sagemaker_session=sagemaker_session
    )
)

if matching_artifacts:
    training_data_artifact = matching_artifacts[0]
    print(f"Using existing artifact: {training_data_artifact.artifact_arn}")
else:
    training_data_artifact = sagemaker.lineage.artifact.Artifact.create(
        artifact_name="TrainingData",
        source_uri=training_data_s3_uri,
        artifact_type="Dataset",
        sagemaker_session=sagemaker_session,
    )
    print(f"Create artifact {training_data_artifact.artifact_arn}: SUCCESSFUL")

Using existing artifact: arn:aws:sagemaker:us-east-1:205930620783:artifact/bcdfe1446f26e7658d562fc414702760


### Model Artifact

In [21]:
trained_model_s3_uri = training_job_1_info["ModelArtifacts"]["S3ModelArtifacts"]

matching_artifacts = list(
    sagemaker.lineage.artifact.Artifact.list(
        source_uri=trained_model_s3_uri, sagemaker_session=sagemaker_session
    )
)

if matching_artifacts:
    model_artifact = matching_artifacts[0]
    print(f"Using existing artifact: {model_artifact.artifact_arn}")
else:
    model_artifact = sagemaker.lineage.artifact.Artifact.create(
        artifact_name="TrainedModel",
        source_uri=trained_model_s3_uri,
        artifact_type="Model",
        sagemaker_session=sagemaker_session,
    )
    print(f"Create artifact {model_artifact.artifact_arn}: SUCCESSFUL")

Using existing artifact: arn:aws:sagemaker:us-east-1:205930620783:artifact/fdc919bc7d76ed906418703c767fa11f


## Set artifact associations

In [28]:
trial_component = sagemaker_client.describe_trial_component(
    TrialComponentName=training_job_1_name + "-aws-training-job"
)
trial_component_arn = trial_component["TrialComponentArn"]
print(trial_component_arn)

arn:aws:sagemaker:us-east-1:205930620783:experiment-trial-component/sagemaker-xgboost-2025-05-30-09-09-33-714-aws-training-job


In [29]:
input_artifacts = [code_artifact, training_data_artifact]

for a in input_artifacts:
    try:
        sagemaker.lineage.association.Association.create(
            source_arn=a.artifact_arn,
            destination_arn=trial_component_arn,
            association_type="ContributedTo",
            sagemaker_session=sagemaker_session,
        )
        print(f"Association with {a.artifact_type}: SUCCEESFUL")
    except:
        print(f"Association already exists with {a.artifact_type}")

Association with Code: SUCCEESFUL
Association already exists with DataSet


In [30]:
output_artifacts = [model_artifact]

for a in output_artifacts:
    try:
        sagemaker.lineage.association.Association.create(
            source_arn=a.artifact_arn,
            destination_arn=trial_component_arn,
            association_type="Produced",
            sagemaker_session=sagemaker_session,
        )
        print(f"Association with {a.artifact_type}: SUCCESSFUL")
    except:
        print(f"Association already exists with {a.artifact_type}")

Association with Model: SUCCESSFUL
