# CMS Open Payments XGBOOST Model setup and deployment

**Project:** AAI-540 Machine Learning Operations - Final Team Project  
**Purpose:** Setup AWS S3 Datalake for CMS Open Payments Data  
**Dataset:** CMS Open Payments Program Year 2024

---

## Table of Contents
1. [Environment Setup](#setup)
2. [AWS Configuration & S3 Bucket Creation](#aws-config)
3. [Download CMS Open Payments Data](#download)
4. [Upload Data to S3](#upload)
5. [Create Athena Database](#athena)
6. [Register Data with Athena](#register)
7. [Convert CSV to Parquet](#parquet)
8. [Query Data with AWS Data Wrangler](#query)
9. [Validation & Verification](#validation)


In [28]:
# retrieve the path variables from Notebook 01
%store -r bucket
%store -r database_name
%store -r table_name_parquet

In [29]:
%pip uninstall -y sagemaker sagemaker-core sagemaker-mlops sagemaker-serve sagemaker-train
%pip install "sagemaker<3" "boto3>=1.17.21"

Found existing installation: sagemaker 2.257.0
Uninstalling sagemaker-2.257.0:
  Successfully uninstalled sagemaker-2.257.0
Found existing installation: sagemaker-core 1.0.75
Uninstalling sagemaker-core-1.0.75:
  Successfully uninstalled sagemaker-core-1.0.75
[0mNote: you may need to restart the kernel to use updated packages.
Collecting sagemaker<3
  Using cached sagemaker-2.257.0-py3-none-any.whl.metadata (17 kB)
Collecting sagemaker-core<2.0.0,>=1.0.71 (from sagemaker<3)
  Using cached sagemaker_core-1.0.75-py3-none-any.whl.metadata (4.9 kB)
Using cached sagemaker-2.257.0-py3-none-any.whl (1.7 MB)
Using cached sagemaker_core-1.0.75-py3-none-any.whl (439 kB)
Installing collected packages: sagemaker-core, sagemaker
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [sagemaker]/2[0m [sagemaker]
[1A[2KSuccessfully installed sagemaker-2.257.0 sagemaker-core-1.0.75
Note: you may need to restart the kernel to use updated packages.


In [30]:
%pip install awswrangler pyathena

Note: you may need to restart the kernel to use updated packages.


In [31]:
import logging
logging.getLogger("sagemaker").setLevel(logging.ERROR)

In [32]:
import os
import boto3
import sagemaker

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = sess.boto_region_name

bucket = sess.default_bucket()
prefix = "CMS payments XGBOOST"



In [33]:
print(role)
print(sess)
print(region)

arn:aws:iam::996351798934:role/LabRole
<sagemaker.session.Session object at 0x7f6a7ffac640>
us-east-1


---
## Data sources

> Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

> Breast Cancer Wisconsin (Diagnostic) Data Set [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)].

> _Also see:_ Breast Cancer Wisconsin (Diagnostic) Data Set [https://www.kaggle.com/uciml/breast-cancer-wisconsin-data].

## Data preparation


Let's download the data and save it in the local folder with the name data.csv and take a look at it.

In [34]:
import pandas as pd
import numpy as np
import awswrangler as wr
from datetime import datetime
#from sklearn.preprocessing import RobustScaler
#from sklearn.ensemble import IsolationForest

# retrieve the path variables from Notebook 01
%store -r bucket
%store -r database_name
%store -r table_name_parquet

# reload the cleaned dataset from S3
# This ensures 'df' is defined in this notebook's memory
print("Loading processed data from S3...")
df = wr.athena.read_sql_query(
    sql=f"SELECT * FROM {database_name}.{table_name_parquet} LIMIT 100000",
    database=database_name,
)

print(f"Environment ready. Dataframe shape: {df.shape}")

Loading processed data from S3...
Environment ready. Dataframe shape: (100000, 91)


In [35]:
df.head()

Unnamed: 0,change_type,covered_recipient_type,teaching_hospital_ccn,teaching_hospital_id,teaching_hospital_name,covered_recipient_profile_id,covered_recipient_npi,covered_recipient_first_name,covered_recipient_middle_name,covered_recipient_last_name,...,associated_drug_or_biological_ndc_4,associated_device_or_medical_supply_pdi_4,covered_or_noncovered_indicator_5,indicate_drug_or_biological_or_device_or_medical_supply_5,product_category_or_therapeutic_area_5,name_of_drug_or_biological_or_device_or_medical_supply_5,associated_drug_or_biological_ndc_5,associated_device_or_medical_supply_pdi_5,payment_publication_date,program_year
0,NEW,Covered Recipient Physician,,,,11317595.0,1669107000.0,,,,...,,,,,,,,,06/30/2025,2024
1,NEW,Covered Recipient Non-Physician Practitioner,,,,10839254.0,1396997000.0,,,,...,,,,,,,,,,2024
2,NEW,Covered Recipient Physician,,,,525474.0,1699742000.0,,,,...,,,,,,,,,,2024
3,NEW,Covered Recipient Physician,,,,11395953.0,1477277000.0,,,,...,,,,,,,,,06/30/2025,2024
4,NEW,Covered Recipient Non-Physician Practitioner,,,,11250232.0,1740552000.0,,,,...,,,,,,,,,,2024


In [36]:
df.head(3).T

Unnamed: 0,0,1,2
change_type,NEW,NEW,NEW
covered_recipient_type,Covered Recipient Physician,Covered Recipient Non-Physician Practitioner,Covered Recipient Physician
teaching_hospital_ccn,,,
teaching_hospital_id,,,
teaching_hospital_name,,,
...,...,...,...
name_of_drug_or_biological_or_device_or_medical_supply_5,,,
associated_drug_or_biological_ndc_5,,,
associated_device_or_medical_supply_pdi_5,,,
payment_publication_date,06/30/2025,,


In [37]:
#select only feature which has 50% or more non-null values
features = df.columns
print(f"Selected features with at least 50% non-null values: {features}")

Selected features with at least 50% non-null values: Index(['change_type', 'covered_recipient_type', 'teaching_hospital_ccn',
       'teaching_hospital_id', 'teaching_hospital_name',
       'covered_recipient_profile_id', 'covered_recipient_npi',
       'covered_recipient_first_name', 'covered_recipient_middle_name',
       'covered_recipient_last_name', 'covered_recipient_name_suffix',
       'recipient_primary_business_street_address_line1',
       'recipient_primary_business_street_address_line2', 'recipient_city',
       'recipient_state', 'recipient_zip_code', 'recipient_country',
       'recipient_province', 'recipient_postal_code',
       'covered_recipient_primary_type_1', 'covered_recipient_primary_type_2',
       'covered_recipient_primary_type_3', 'covered_recipient_primary_type_4',
       'covered_recipient_primary_type_5', 'covered_recipient_primary_type_6',
       'covered_recipient_specialty_1', 'covered_recipient_specialty_2',
       'covered_recipient_specialty_3', 'co

In [38]:
# restore feature and dataset splits

# turn non-date strings into NaT 
df['date_of_payment'] = pd.to_datetime(df['date_of_payment'], errors='coerce')

# check if we have too many NaTs (indicating a major schema shift)
nan_dates = df['date_of_payment'].isna().sum()
if nan_dates > 0:
    print(f"Warning: {nan_dates} rows had invalid date formats and were set to NaT.")

# fill NaT with a placeholder
df['date_of_payment'] = df['date_of_payment'].ffill().bfill()

df['payment_month'] = df['date_of_payment'].dt.month
df['is_weekend'] = (df['date_of_payment'].dt.dayofweek >= 5).astype(int)

print(f"Success: Features restored. New shape: {df.shape}")

Success: Features restored. New shape: (100000, 93)


In [39]:
top_manu = df["applicable_manufacturer_or_applicable_gpo_making_payment_name"] \
              .value_counts() \
              .nlargest(100) \
              .index

df["manufacturer_clean"] = df[
    "applicable_manufacturer_or_applicable_gpo_making_payment_name"
].where(
    df["applicable_manufacturer_or_applicable_gpo_making_payment_name"].isin(top_manu),
    "OTHER"
)

In [40]:
df["manufacturer_avg_payment"] = df.groupby("manufacturer_clean")[
    "total_amount_of_payment_usdollars"
].transform("mean")

In [41]:
df.head()

Unnamed: 0,change_type,covered_recipient_type,teaching_hospital_ccn,teaching_hospital_id,teaching_hospital_name,covered_recipient_profile_id,covered_recipient_npi,covered_recipient_first_name,covered_recipient_middle_name,covered_recipient_last_name,...,product_category_or_therapeutic_area_5,name_of_drug_or_biological_or_device_or_medical_supply_5,associated_drug_or_biological_ndc_5,associated_device_or_medical_supply_pdi_5,payment_publication_date,program_year,payment_month,is_weekend,manufacturer_clean,manufacturer_avg_payment
0,NEW,Covered Recipient Physician,,,,11317595.0,1669107000.0,,,,...,,,,,06/30/2025,2024,10,0,ZIMVIE INC.,140.629599
1,NEW,Covered Recipient Non-Physician Practitioner,,,,10839254.0,1396997000.0,,,,...,,,,,,2024,10,0,100001351935,
2,NEW,Covered Recipient Physician,,,,525474.0,1699742000.0,,,,...,,,,,,2024,10,0,100000005397,
3,NEW,Covered Recipient Physician,,,,11395953.0,1477277000.0,,,,...,,,,,06/30/2025,2024,8,0,CooperVision Inc.,65.372609
4,NEW,Covered Recipient Non-Physician Practitioner,,,,11250232.0,1740552000.0,,,,...,,,,,,2024,8,0,100000000234,


In [42]:
import numpy as np
import pandas as pd

# ---- Required columns ----
ID_COL = "covered_recipient_profile_id"          # change to covered_recipient_npi if you prefer
DATE_COL = "date_of_payment"
AMT_COL  = "total_amount_of_payment_usdollars"

# ---- Ensure types ----
df[DATE_COL] = pd.to_datetime(df[DATE_COL], errors="coerce")
df[AMT_COL] = pd.to_numeric(df[AMT_COL], errors="coerce")

# ---- Sort for leakage-safe historical features ----
df = df.sort_values([ID_COL, DATE_COL]).reset_index(drop=True)

# ---- is_new_recipient: 1 if first payment for that recipient ----
df["is_new_recipient"] = (df.groupby(ID_COL).cumcount() == 0).astype(int)

# ---- hist_pay_avg: expanding mean of *prior* payments (shift(1) prevents leakage) ----
df["hist_pay_avg"] = (
    df.groupby(ID_COL)[AMT_COL]
      .expanding()
      .mean()
      .shift(1)
      .reset_index(level=0, drop=True)
)

# Fill first-payment NaNs with global median (robust default)
global_median = df[AMT_COL].median(skipna=True)
df["hist_pay_avg"] = df["hist_pay_avg"].fillna(global_median)

# ---- amt_to_avg_ratio: current amount divided by historical avg ----
df["amt_to_avg_ratio"] = df[AMT_COL] / df["hist_pay_avg"]
df["amt_to_avg_ratio"] = df["amt_to_avg_ratio"].replace([np.inf, -np.inf], np.nan).fillna(1.0)

# Optional: cap extreme ratios to keep things stable
df["amt_to_avg_ratio"] = df["amt_to_avg_ratio"].clip(lower=0, upper=20)

# Quick sanity peek
df[[ID_COL, DATE_COL, AMT_COL, "is_new_recipient", "hist_pay_avg", "amt_to_avg_ratio"]].head(10)

Unnamed: 0,covered_recipient_profile_id,date_of_payment,total_amount_of_payment_usdollars,is_new_recipient,hist_pay_avg,amt_to_avg_ratio
0,24.0,2024-04-09,20.07,1,20.04,1.001497
1,36.0,2024-11-04,125.0,1,20.07,6.228201
2,44.0,2024-04-22,28.26,1,125.0,0.22608
3,48.0,2024-05-29,,1,28.26,1.0
4,48.0,2024-11-14,14.79,0,20.04,0.738024
5,49.0,2024-04-29,9.39,1,14.79,0.634888
6,49.0,2024-06-07,120.86,0,9.39,12.87114
7,69.0,2024-01-11,16.98,1,65.125,0.260729
8,107.0,2024-03-05,7.98,1,16.98,0.469965
9,107.0,2024-10-19,26.05,0,7.98,3.264411


In [43]:
df.head()

Unnamed: 0,change_type,covered_recipient_type,teaching_hospital_ccn,teaching_hospital_id,teaching_hospital_name,covered_recipient_profile_id,covered_recipient_npi,covered_recipient_first_name,covered_recipient_middle_name,covered_recipient_last_name,...,associated_device_or_medical_supply_pdi_5,payment_publication_date,program_year,payment_month,is_weekend,manufacturer_clean,manufacturer_avg_payment,is_new_recipient,hist_pay_avg,amt_to_avg_ratio
0,NEW,Covered Recipient Physician,,,,24.0,1003015000.0,,,,...,,06/30/2025,2024,4,0,Amgen Inc.,85.38559,1,20.04,1.001497
1,NEW,Covered Recipient Physician,,,,36.0,1003037000.0,,,,...,,06/30/2025,2024,11,0,ABBVIE INC.,69.86714,1,20.07,6.228201
2,NEW,Covered Recipient Physician,,,,44.0,1003045000.0,,,,...,,,2024,4,0,OTHER,93085540.0,1,125.0,0.22608
3,NEW,Covered Recipient Physician,,,,48.0,1003049000.0,,,,...,,,2024,5,0,100000000234,,1,28.26,1.0
4,NEW,Covered Recipient Physician,,,,48.0,1003049000.0,,,,...,,06/30/2025,2024,11,0,Merck Sharp & Dohme LLC,84.25986,0,20.04,0.738024


In [46]:
df["physician_ownership_indicator"] = (
    df["physician_ownership_indicator"]
    .map({"Y":1, "N":0})
    .fillna(0)
)

df["third_party_payment_recipient_indicator"] = (
    df["third_party_payment_recipient_indicator"]
    .map({"Y":1, "N":0})
    .fillna(0)
)

df["high_risk"] = (
    (df["amt_to_avg_ratio"] > 4) |
    (df["total_amount_of_payment_usdollars"] > 10000)
).astype(int)

In [53]:
model_features = [

    # Core Financial Signal
    'total_amount_of_payment_usdollars',       

    # Behavioral Features
    'hist_pay_avg',
    'amt_to_avg_ratio',
     

    # Manufacturer Intelligence
    'manufacturer_avg_payment',   

    # Governance / Risk Indicators
    'physician_ownership_indicator',
    'third_party_payment_recipient_indicator',

    # Temporal Behavior
    'payment_month',
    'is_weekend',

    # Novelty / Pattern Break
    'is_new_recipient',
    
    # Target Variable
    'high_risk'
]

In [54]:
df_model = df[model_features]

In [55]:
df_model.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 10 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   total_amount_of_payment_usdollars        55244 non-null   float64
 1   hist_pay_avg                             100000 non-null  float64
 2   amt_to_avg_ratio                         100000 non-null  float64
 3   manufacturer_avg_payment                 66516 non-null   float64
 4   physician_ownership_indicator            100000 non-null  float64
 5   third_party_payment_recipient_indicator  100000 non-null  float64
 6   payment_month                            100000 non-null  int32  
 7   is_weekend                               100000 non-null  int64  
 8   is_new_recipient                         100000 non-null  int64  
 9   high_risk                                100000 non-null  int64  
dtypes: float64(6), int32(1), int64(3)

In [57]:
%pip install "sklearn>=0.24.0"

[31mERROR: Ignored the following yanked versions: 0.0.post2[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement sklearn>=0.24.0 (from versions: 0.0, 0.0.post1, 0.0.post4, 0.0.post5, 0.0.post7, 0.0.post9, 0.0.post10, 0.0.post11, 0.0.post12)[0m[31m
[0m[31mERROR: No matching distribution found for sklearn>=0.24.0[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [58]:
from sklearn.model_selection import train_test_split

X = df_model.drop("high_risk", axis=1)
y = df_model["high_risk"]

# First split train vs temp
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y,
    test_size=0.30,
    stratify=y,
    random_state=42
)

# Split temp into validation + test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.50,
    stratify=y_temp,
    random_state=42
)


#### Key observations:
* The data has 569 observations and 32 columns.
* The first field is the 'id' attribute that we will want to drop before batch inference and add to the final inference output next to the probability of malignancy.
* Second field, 'diagnosis', is an indicator of the actual diagnosis ('M' = Malignant; 'B' = Benign).
* There are 30 other numeric features that we will use for training and inferencing.

Let's replace the M/B diagnosis with a 1/0 boolean value. 

In [None]:
data["diagnosis"] = data["diagnosis"].apply(lambda x: ((x == "M")) + 0)
data.sample(8)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
102,862965,0,12.18,20.52,77.22,458.7,0.08013,0.04038,0.02383,0.0177,...,13.34,32.84,84.58,547.8,0.1123,0.08862,0.1145,0.07431,0.2694,0.06878
438,909231,0,13.85,19.6,88.68,592.6,0.08684,0.0633,0.01342,0.02293,...,15.63,28.01,100.9,749.1,0.1118,0.1141,0.04753,0.0589,0.2513,0.06911
63,859196,0,9.173,13.86,59.2,260.9,0.07721,0.08751,0.05988,0.0218,...,10.01,19.23,65.59,310.1,0.09836,0.1678,0.1397,0.05087,0.3282,0.0849
341,898143,0,9.606,16.84,61.64,280.5,0.08481,0.09228,0.08422,0.02292,...,10.75,23.07,71.25,353.6,0.1233,0.3416,0.4341,0.0812,0.2982,0.09825
194,87556202,1,14.86,23.21,100.4,671.4,0.1044,0.198,0.1697,0.08878,...,16.08,27.78,118.6,784.7,0.1316,0.4648,0.4589,0.1727,0.3,0.08701
323,895100,1,20.34,21.51,135.9,1264.0,0.117,0.1875,0.2565,0.1504,...,25.3,31.86,171.1,1938.0,0.1592,0.4492,0.5344,0.2685,0.5558,0.1024
346,898678,0,12.06,18.9,76.66,445.3,0.08386,0.05794,0.00751,0.008488,...,13.64,27.06,86.54,562.6,0.1289,0.1352,0.04506,0.05093,0.288,0.08083
117,864729,1,14.87,16.67,98.64,682.5,0.1162,0.1649,0.169,0.08923,...,18.81,27.37,127.1,1095.0,0.1878,0.448,0.4704,0.2027,0.3585,0.1065


Let's split the data as follows: 80% for training, 10% for validation and let's set 10% aside for our batch inference job. In addition, let's drop the 'id' field on the training set and validation set as 'id' is not a training feature. For our batch set however, we keep the 'id' feature. We'll want to filter it out prior to running our inferences so that the input data features match the ones of training set and then ultimately, we'll want to join it with inference result. We are however dropping the diagnosis attribute for the batch set since this is what we'll try to predict.

In [None]:
# data split in three sets, training, validation and batch inference
rand_split = np.random.rand(len(data))
train_list = rand_split < 0.8
val_list = (rand_split >= 0.8) & (rand_split < 0.9)
batch_list = rand_split >= 0.9

data_train = data[train_list].drop(["id"], axis=1)
data_val = data[val_list].drop(["id"], axis=1)
data_batch = data[batch_list].drop(["diagnosis"], axis=1)
data_batch_noID = data_batch.drop(["id"], axis=1)

Let's upload those data sets in S3

In [None]:
train_file = "train_data.csv"
data_train.to_csv(train_file, index=False, header=False)
sess.upload_data(train_file, key_prefix="{}/train".format(prefix))

validation_file = "validation_data.csv"
data_val.to_csv(validation_file, index=False, header=False)
sess.upload_data(validation_file, key_prefix="{}/validation".format(prefix))

batch_file = "batch_data.csv"
data_batch.to_csv(batch_file, index=False, header=False)
sess.upload_data(batch_file, key_prefix="{}/batch".format(prefix))

batch_file_noID = "batch_data_noID.csv"
data_batch_noID.to_csv(batch_file_noID, index=False, header=False)
sess.upload_data(batch_file_noID, key_prefix="{}/batch".format(prefix))

's3://sagemaker-us-east-1-996351798934/DEMO-breast-cancer-prediction-xgboost-highlevel/batch/batch_data_noID.csv'

---

## Training job and model creation

The below cell uses the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to kick off the training job using both our training set and validation set. Not that the objective is set to 'binary:logistic' which trains a model to output a probability between 0 and 1 (here the probability of a tumor being malignant).

In [None]:
%%time
from time import gmtime, strftime

job_name = "xgb-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
output_location = "s3://{}/{}/output/{}".format(bucket, prefix, job_name)
image = sagemaker.image_uris.retrieve(
    framework="xgboost", region=boto3.Session().region_name, version="1.7-1"
)

sm_estimator = sagemaker.estimator.Estimator(
    image,
    role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    volume_size=50,
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sess,
)

sm_estimator.set_hyperparameters(
    objective="binary:logistic",
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    verbosity=0,
    num_round=100,
)

train_data = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train".format(bucket, prefix),
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
)
validation_data = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validation".format(bucket, prefix),
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
)
data_channels = {"train": train_data, "validation": validation_data}

# Start training by calling the fit method in the estimator
sm_estimator.fit(inputs=data_channels, job_name=job_name, logs=True)

2026-02-02 21:50:34 Starting - Starting the training job...
2026-02-02 21:50:48 Starting - Preparing the instances for training...
2026-02-02 21:51:35 Downloading - Downloading the training image......
  import pkg_resources
[2026-02-02 21:52:38.074 ip-10-0-185-45.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2026-02-02 21:52:38.144 ip-10-0-185-45.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.
[2026-02-02:21:52:38:INFO] Imported framework sagemaker_xgboost_container.training
[2026-02-02:21:52:38:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.
Returning the value itself
[2026-02-02:21:52:38:INFO] No GPUs detected (normal if no gpus installed)
[2026-02-02:21:52:38:INFO] Running XGBoost Sagemaker in algorithm mode
[2026-02-02:21:52:38:INFO] Determined 0 GPU(s) available on the instance.
[2026-02-02:21:52:38:INFO] Determined delimiter of CSV input is ','
[2026-02-02:21:52:38:INFO] Determined delimiter of 

---

## Batch Transform

In SageMaker Batch Transform, we introduced 3 new attributes - __input_filter__, __join_source__ and __output_filter__. In the below cell, we use the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to kick-off several Batch Transform jobs using different configurations of these 3 new attributes. Please refer to [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html) to learn more about how to use them.




#### 1. Create a transform job with the default configurations
Let's first skip these 3 new attributes and inspect the inference results. We'll use it as a baseline to compare to the results with data processing.

In [None]:
%%time

sm_transformer = sm_estimator.transformer(1, "ml.m5.xlarge")

# start a transform job
input_location = "s3://{}/{}/batch/{}".format(
    bucket, prefix, batch_file_noID
)  # use input data without ID column
sm_transformer.transform(input_location, content_type="text/csv", split_type="Line")
sm_transformer.wait()


.................................
  import pkg_resources
[2026-02-02:21:58:50:INFO] No GPUs detected (normal if no gpus installed)
[2026-02-02:21:58:50:INFO] No GPUs detected (normal if no gpus installed)
[2026-02-02:21:58:50:INFO] nginx config: 
worker_processes auto;
daemon off;
pid /tmp/nginx.pid;
error_log  /dev/stderr;
worker_rlimit_nofile 4096;
events {
  worker_connections 2048;
}
http {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }
    location / {
      return 404 "{}";
    }
  }
}
[2026-02-02 21:58:50 +0000]

Let's inspect the output of the Batch Transform job in S3. It should show the list probabilities of tumors being malignant.

In [None]:
import re


def get_csv_output_from_s3(s3uri, batch_file):
    file_name = "{}.out".format(batch_file)
    match = re.match("s3://([^/]+)/(.*)", "{}/{}".format(s3uri, file_name))
    output_bucket, output_prefix = match.group(1), match.group(2)
    s3.download_file(output_bucket, output_prefix, file_name)
    return pd.read_csv(file_name, sep=",", header=None)

In [None]:
output_df = get_csv_output_from_s3(sm_transformer.output_path, batch_file_noID)
output_df.head(8)

Unnamed: 0,0
0,0.991745
1,0.993318
2,0.748455
3,0.854386
4,0.010544
5,0.990231
6,0.00642
7,0.249007


#### 2. Join the input and the prediction results 
Now, let's associate the prediction results with their corresponding input records. We can also use the __input_filter__ to exclude the ID column easily and there's no need to have a separate file in S3.

* Set __input_filter__ to "$[1:]": indicates that we are excluding column 0 (the 'ID') before processing the inferences and keeping everything from column 1 to the last column (all the features or predictors)  
  
  
* Set __join_source__ to "Input": indicates our desire to join the input data with the inference results  

* Leave __output_filter__ to default ('$'), indicating that the joined input and inference results be will saved as output.

In [None]:
# content_type / accept and split_type / assemble_with are required to use IO joining feature
sm_transformer.assemble_with = "Line"
sm_transformer.accept = "text/csv"

# start a transform job
input_location = "s3://{}/{}/batch/{}".format(
    bucket, prefix, batch_file
)  # use input data with ID column cause InputFilter will filter it out
sm_transformer.transform(
    input_location,
    split_type="Line",
    content_type="text/csv",
    input_filter="$[1:]",
    join_source="Input",
)
sm_transformer.wait()

  import pkg_resources
[2026-02-02:22:04:12:INFO] No GPUs detected (normal if no gpus installed)
[2026-02-02:22:04:12:INFO] No GPUs detected (normal if no gpus installed)
[2026-02-02:22:04:12:INFO] nginx config: 
worker_processes auto;
daemon off;
pid /tmp/nginx.pid;
error_log  /dev/stderr;
worker_rlimit_nofile 4096;
events {
  worker_connections 2048;
}
http {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }
    location / {
      return 404 "{}";
    }
  }
}
[2026-02-02 22:04:12 +0000] [18] [INFO] Starting gunicorn 23.

Let's inspect the output of the Batch Transform job in S3. It should show the list of tumors identified by their original feature columns and their corresponding probabilities of being malignant.

In [None]:
output_df = get_csv_output_from_s3(sm_transformer.output_path, batch_file)
output_df.head(8)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,846226,19.17,24.8,132.4,1123.0,0.0974,0.2458,0.2065,0.1118,0.2397,...,29.94,151.7,1332.0,0.1037,0.3903,0.3639,0.1767,0.3176,0.1023,0.991745
1,84862001,16.13,20.68,108.1,798.8,0.117,0.2022,0.1722,0.1028,0.2164,...,31.48,136.8,1315.0,0.1789,0.4233,0.4784,0.2073,0.3706,0.1142,0.993318
2,855138,13.48,20.82,88.4,559.2,0.1016,0.1255,0.1063,0.05439,0.172,...,26.02,107.3,740.4,0.161,0.4225,0.503,0.2258,0.2807,0.1071,0.748455
3,85715,13.17,18.66,85.98,534.6,0.1158,0.1231,0.1226,0.0734,0.2128,...,27.95,102.8,759.4,0.1786,0.4166,0.5006,0.2088,0.39,0.1179,0.854386
4,857810,13.05,19.31,82.61,527.2,0.0806,0.03789,0.000692,0.004167,0.1819,...,22.25,90.24,624.1,0.1021,0.06191,0.001845,0.01111,0.2439,0.06289,0.010544
5,858986,14.25,22.15,96.42,645.7,0.1049,0.2008,0.2135,0.08653,0.1949,...,29.51,119.1,959.5,0.164,0.6247,0.6922,0.1785,0.2844,0.1132,0.990231
6,859465,11.31,19.04,71.8,394.1,0.08139,0.04701,0.03709,0.0223,0.1516,...,23.84,78.0,466.7,0.129,0.09148,0.1444,0.06961,0.24,0.06641,0.00642
7,861648,14.62,24.02,94.57,662.7,0.08974,0.08606,0.03102,0.02957,0.1685,...,29.11,102.9,803.7,0.1115,0.1766,0.09189,0.06946,0.2522,0.07246,0.249007


#### 3. Update the output filter to keep only ID and prediction results
Let's change __output_filter__ to "$[0,-1]", indicating that when presenting the output, we only want to keep column 0 (the 'ID') and the last column (the inference result i.e. the probability of a given tumor to be malignant)

In [None]:
# start another transform job
sm_transformer.transform(
    input_location,
    split_type="Line",
    content_type="text/csv",
    input_filter="$[1:]",
    join_source="Input",
    output_filter="$[0,-1]",
)
sm_transformer.wait()

..............................
  import pkg_resources
[2026-02-02:22:09:53:INFO] No GPUs detected (normal if no gpus installed)
[2026-02-02:22:09:53:INFO] No GPUs detected (normal if no gpus installed)
[2026-02-02:22:09:53:INFO] nginx config: 
worker_processes auto;
daemon off;
pid /tmp/nginx.pid;
error_log  /dev/stderr;
worker_rlimit_nofile 4096;
events {
  worker_connections 2048;
}
http {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }
    location / {
      return 404 "{}";
    }
  }
}
  import pkg_resources
[2026-0

Now, let's inspect the output of the Batch Transform job in S3 again. It should show 2 columns: the ID and their corresponding probabilities of being malignant.

In [None]:
output_df = get_csv_output_from_s3(sm_transformer.output_path, batch_file)
output_df.head(8)

Unnamed: 0,0,1
0,846226,0.991745
1,84862001,0.993318
2,855138,0.748455
3,85715,0.854386
4,857810,0.010544
5,858986,0.990231
6,859465,0.00642
7,861648,0.249007


create_model(role=role, image_uri=XGBOOST_IMAGE)In summary, we can use newly introduced 3 attributes - __input_filter__, __join_source__, __output_filter__ to 
1. Filter / select useful features from the input dataset. e.g. exclude ID columns.
2. Associate the prediction results with their corresponding input records.
3. Filter the original or joined results before saving to S3. e.g. keep ID and probability columns only.

## Upload the Sagemaker Model created during our training job to the Sagemaker Model Registry

In [None]:
sagemaker = boto3.client("sagemaker")

model_name = job_name
print(model_name)


info = sagemaker.describe_training_job(TrainingJobName=model_name)
model_data = info["ModelArtifacts"]["S3ModelArtifacts"]

primary_container = {"Image": image, "ModelDataUrl": model_data}

# Save our model to the Sagemaker Model Registry
create_model_response = sagemaker.create_model(
    ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=primary_container
)

print(create_model_response["ModelArn"])

xgb-2026-02-02-21-50-32
arn:aws:sagemaker:us-east-1:996351798934:model/xgb-2026-02-02-21-50-32


In [None]:
# Inspect Training Job Details
info

{'TrainingJobName': 'xgb-2026-02-02-21-50-32',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:996351798934:training-job/xgb-2026-02-02-21-50-32',
 'ModelArtifacts': {'S3ModelArtifacts': 's3://sagemaker-us-east-1-996351798934/DEMO-breast-cancer-prediction-xgboost-highlevel/output/xgb-2026-02-02-21-50-32/xgb-2026-02-02-21-50-32/output/model.tar.gz'},
 'TrainingJobStatus': 'Completed',
 'SecondaryStatus': 'Completed',
 'HyperParameters': {'eta': '0.2',
  'gamma': '4',
  'max_depth': '5',
  'min_child_weight': '6',
  'num_round': '100',
  'objective': 'binary:logistic',
  'subsample': '0.8',
  'verbosity': '0'},
 'AlgorithmSpecification': {'TrainingImage': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1',
  'TrainingInputMode': 'File',
  'MetricDefinitions': [{'Name': 'train:mae',
    'Regex': '.*\\[[0-9]+\\].*#011train-mae:([-+]?[0-9]*\\.?[0-9]+(?:[eE][-+]?[0-9]+)?).*'},
   {'Name': 'validation:aucpr',
    'Regex': '.*\\[[0-9]+\\].*#011validation-aucpr:([-+]?[0-9]*\

In [None]:
# Create Endpoint Configuration


# Create an endpoint config name. Here we create one based on the date  
# so it we can search endpoints based on creation time.
endpoint_config_name = 'lab4-1-endpoint-config' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())                            
                            
instance_type = 'ml.m5.xlarge'

endpoint_config_response = sagemaker.create_endpoint_config(
    EndpointConfigName=endpoint_config_name, # You will specify this name in a CreateEndpoint request.
    # List of ProductionVariant objects, one for each model that you want to host at this endpoint.
    ProductionVariants=[
        {
            "VariantName": "variant1", # The name of the production variant.
            "ModelName": model_name, 
            "InstanceType": instance_type, # Specify the compute instance type.
            "InitialInstanceCount": 1 # Number of instances to launch initially.
        }
    ]
)

print(f"Created EndpointConfig: {endpoint_config_response['EndpointConfigArn']}")


Created EndpointConfig: arn:aws:sagemaker:us-east-1:996351798934:endpoint-config/lab4-1-endpoint-config2026-02-02-22-10-33


In [None]:
# Deploy our model to real-time endpoint

endpoint_name = 'lab4-1-endpoint' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())                            


create_endpoint_response = sagemaker.create_endpoint(
                                            EndpointName=endpoint_name, 
                                            EndpointConfigName=endpoint_config_name) 

In [None]:
# Wait for endpoint to spin up
from time import sleep
sagemaker.describe_endpoint(EndpointName=endpoint_name)

while True:
    print("Getting Job Status")
    res = sagemaker.describe_endpoint(EndpointName=endpoint_name)
    state = res["EndpointStatus"]
    
    if state == "InService":
        print("Endpoint in Service")
        break
    elif state == "Creating":
        print("Endpoint still creating...")
        sleep(60)
    else:
        print("Endpoint Creation Error - Check Sagemaker Console")
        break

Getting Job Status
Endpoint still creating...
Getting Job Status
Endpoint in Service


In [None]:
# Invoke Endpoint

sagemaker_runtime = boto3.client("sagemaker-runtime", region_name=region)

response = sagemaker_runtime.invoke_endpoint(
                            EndpointName=endpoint_name,
                            ContentType='text/csv',
                            Body=data_batch_noID.to_csv(header=None, index=False).strip('\n').split('\n')[0]
                            )
print(response['Body'].read().decode('utf-8'))

0.9917450547218323



In [None]:
# Examine Response Body

response

{'ResponseMetadata': {'RequestId': '589c67d8-fd38-45df-adf3-8065df9425b4',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '589c67d8-fd38-45df-adf3-8065df9425b4',
   'x-amzn-invoked-production-variant': 'variant1',
   'date': 'Mon, 02 Feb 2026 22:14:22 GMT',
   'content-type': 'text/csv; charset=utf-8',
   'content-length': '19',
   'connection': 'keep-alive'},
  'RetryAttempts': 0},
 'ContentType': 'text/csv; charset=utf-8',
 'InvokedProductionVariant': 'variant1',
 'Body': <botocore.response.StreamingBody at 0x7f5e19cc4be0>}

In [None]:
# Delete Endpoint

sagemaker.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': '3e679ba7-f814-491c-af11-fe2f06330849',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '3e679ba7-f814-491c-af11-fe2f06330849',
   'strict-transport-security': 'max-age=47304000; includeSubDomains',
   'x-frame-options': 'DENY',
   'content-security-policy': "frame-ancestors 'none'",
   'cache-control': 'no-cache, no-store, must-revalidate',
   'x-content-type-options': 'nosniff',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Mon, 02 Feb 2026 22:16:22 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}

## Part 1: Set Up Model Group

In [None]:
sm_client = boto3.client('sagemaker', region_name=region)
import time
model_package_group_name = "breast-cancer-prediction-" + str(round(time.time()))
model_package_group_input_dict = {
 "ModelPackageGroupName" : model_package_group_name,
 "ModelPackageGroupDescription" : "breast-cancer-prediction model group"
}

create_model_package_group_response = sm_client.create_model_package_group(**model_package_group_input_dict)
print('ModelPackageGroup Arn : {}'.format(create_model_package_group_response['ModelPackageGroupArn']))

ModelPackageGroup Arn : arn:aws:sagemaker:us-east-1:996351798934:model-package-group/breast-cancer-prediction-1770070737


In [None]:
sm_client.list_model_packages(ModelPackageGroupName=model_package_group_name)

{'ModelPackageSummaryList': [],
 'ResponseMetadata': {'RequestId': 'a571dadf-0b5b-4b74-8ff5-895620bf2cba',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'a571dadf-0b5b-4b74-8ff5-895620bf2cba',
   'strict-transport-security': 'max-age=47304000; includeSubDomains',
   'x-frame-options': 'DENY',
   'content-security-policy': "frame-ancestors 'none'",
   'cache-control': 'no-cache, no-store, must-revalidate',
   'x-content-type-options': 'nosniff',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '30',
   'date': 'Mon, 02 Feb 2026 22:25:24 GMT'},
  'RetryAttempts': 0}}

In [None]:
response = sm_client.describe_model_package_group(
    ModelPackageGroupName=model_package_group_name
)
response

{'ModelPackageGroupName': 'breast-cancer-prediction-1770070737',
 'ModelPackageGroupArn': 'arn:aws:sagemaker:us-east-1:996351798934:model-package-group/breast-cancer-prediction-1770070737',
 'ModelPackageGroupDescription': 'breast-cancer-prediction model group',
 'CreationTime': datetime.datetime(2026, 2, 2, 22, 18, 56, 939000, tzinfo=tzlocal()),
 'CreatedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-east-1:996351798934:user-profile/d-yz44u9kobd5z/default-1768850168304',
  'UserProfileName': 'default-1768850168304',
  'DomainId': 'd-yz44u9kobd5z',
  'IamIdentity': {'Arn': 'arn:aws:sts::996351798934:assumed-role/LabRole/SageMaker',
   'PrincipalId': 'AROA6P6ZR52LH3BHX4XRM:SageMaker'}},
 'ModelPackageGroupStatus': 'Completed',
 'ResponseMetadata': {'RequestId': '9b84abee-51bf-4e7f-93f4-12d4e751c2ce',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '9b84abee-51bf-4e7f-93f4-12d4e751c2ce',
   'strict-transport-security': 'max-age=47304000; includeSubDomains',
   'x-frame-opti

## Part 2: Set Up Model Package

In [None]:
from sagemaker.image_uris import retrieve

image_uri = retrieve(
    framework="xgboost",
    region="us-east-1",
    version="1.7-1",
    instance_type="ml.m5.large"
)

print(image_uri)

683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1


In [None]:
import boto3

s3 = boto3.client("s3")

s3.head_object(
    Bucket="sagemaker-us-east-1-996351798934",
    Key="DEMO-breast-cancer-prediction-xgboost-highlevel/output/xgb-2026-02-02-21-50-32/xgb-2026-02-02-21-50-32/output/model.tar.gz"
)

print("✅ Model exists")

✅ Model exists


In [None]:
# Specify the model source
model_url = "s3://sagemaker-us-east-1-996351798934/DEMO-breast-cancer-prediction-xgboost-highlevel/output/xgb-2026-02-02-21-50-32/xgb-2026-02-02-21-50-32/output/model.tar.gz"

modelpackage_inference_specification = {
    "InferenceSpecification": {
        "Containers": [
            {
                "Image": image_uri,
                "ModelDataUrl": model_url
            }
        ],
        "SupportedContentTypes": ["text/csv"],
        "SupportedResponseMIMETypes": ["text/csv"],

        # ⭐ ADD THIS
        "SupportedRealtimeInferenceInstanceTypes": ["ml.m5.large"],
    }
}

create_model_package_input_dict = {
    "ModelPackageGroupName": model_package_group_name,
    "ModelPackageDescription": "Model to predict Breast Cancer",
    "ModelApprovalStatus": "PendingManualApproval"
}

create_model_package_input_dict.update(modelpackage_inference_specification)

In [None]:
create_model_package_response = sm_client.create_model_package(**create_model_package_input_dict)
model_package_arn = create_model_package_response["ModelPackageArn"]
print('ModelPackage Version ARN : {}'.format(model_package_arn))

ModelPackage Version ARN : arn:aws:sagemaker:us-east-1:996351798934:model-package/breast-cancer-prediction-1770070737/1


In [None]:
import boto3
sm = boto3.client("sagemaker")

model_package_arn = "arn:aws:sagemaker:us-east-1:996351798934:model-package/breast-cancer-prediction-1770070737/1"

sm.update_model_package(
    ModelPackageArn=model_package_arn,
    ModelApprovalStatus="Approved"
)

print("✅ Approved model package")

In [None]:
import boto3

client = boto3.client("sagemaker")

response = client.describe_model_package(
    ModelPackageName="arn:aws:sagemaker:us-east-1:996351798934:model-package/breast-cancer-prediction-1770070737/1"
)

print(response)

{'ModelPackageGroupName': 'breast-cancer-prediction-1770070737', 'ModelPackageVersion': 1, 'ModelPackageRegistrationType': 'Registered', 'ModelPackageArn': 'arn:aws:sagemaker:us-east-1:996351798934:model-package/breast-cancer-prediction-1770070737/1', 'ModelPackageDescription': 'Model to predict Breast Cancer', 'CreationTime': datetime.datetime(2026, 2, 2, 22, 50, 41, 244000, tzinfo=tzlocal()), 'InferenceSpecification': {'Containers': [{'Image': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1', 'ImageDigest': 'sha256:b4f13edb198529c460692015797fa1ca6a8ff1ed64a149297174d922121b8fc4', 'ModelDataUrl': 's3://sagemaker-us-east-1-996351798934/DEMO-breast-cancer-prediction-xgboost-highlevel/output/xgb-2026-02-02-21-50-32/xgb-2026-02-02-21-50-32/output/model.tar.gz', 'ModelDataETag': '09379bb66633ef9350e8d2286fc9ce3e', 'IsCheckpoint': False}], 'SupportedRealtimeInferenceInstanceTypes': ['ml.m5.large'], 'SupportedContentTypes': ['text/csv'], 'SupportedResponseMIMETypes': [

In [None]:
response

{'ModelPackageGroupName': 'breast-cancer-prediction-1770070737',
 'ModelPackageVersion': 1,
 'ModelPackageRegistrationType': 'Registered',
 'ModelPackageArn': 'arn:aws:sagemaker:us-east-1:996351798934:model-package/breast-cancer-prediction-1770070737/1',
 'ModelPackageDescription': 'Model to predict Breast Cancer',
 'CreationTime': datetime.datetime(2026, 2, 2, 22, 50, 41, 244000, tzinfo=tzlocal()),
 'InferenceSpecification': {'Containers': [{'Image': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1',
    'ImageDigest': 'sha256:b4f13edb198529c460692015797fa1ca6a8ff1ed64a149297174d922121b8fc4',
    'ModelDataUrl': 's3://sagemaker-us-east-1-996351798934/DEMO-breast-cancer-prediction-xgboost-highlevel/output/xgb-2026-02-02-21-50-32/xgb-2026-02-02-21-50-32/output/model.tar.gz',
    'ModelDataETag': '09379bb66633ef9350e8d2286fc9ce3e',
    'IsCheckpoint': False}],
  'SupportedRealtimeInferenceInstanceTypes': ['ml.m5.large'],
  'SupportedContentTypes': ['text/csv'],
  'Su

## Part 3: Write the Model Card

In [None]:
import boto3
import json

client = boto3.client("sagemaker")

model_card_content = {
    "model_overview": {
        "model_description": "Predicts whether a tumor is malignant or benign using XGBoost.",
        "algorithm_type": "XGBoost",
        "problem_type": "BinaryClassification",
    },
    "intended_uses": {
        "purpose_of_model": "Educational / lab use for ML prediction.",
        "intended_uses": (
            "Intended for demonstrating binary classification with tabular data. "
            "Not intended for real medical diagnosis or clinical decision-making."
        ),
        "factors_affecting_model_efficiency": (
            "Performance depends on the input feature distribution matching training data; "
            "data quality and preprocessing consistency are critical."
        ),
        "risk_rating": "Low",
        "explanations_for_risk_rating": (
            "Low risk because it is used for coursework/demo only and not for real clinical use."
        ),
    },
}

response = client.create_model_card(
    ModelCardName="breast-cancer-model-card",
    Content=json.dumps(model_card_content),
    ModelCardStatus="Draft",
    Tags=[
        {"Key": "project", "Value": "breast-cancer"},
        {"Key": "team", "Value": "ml-lab"},
    ],
)

print("✅ Model Card Created")
print(response["ModelCardArn"])

✅ Model Card Created
arn:aws:sagemaker:us-east-1:996351798934:model-card/breast-cancer-model-card


In [None]:
import boto3
sm = boto3.client("sagemaker")

mp_arn = "arn:aws:sagemaker:us-east-1:996351798934:model-package/breast-cancer-prediction-1770070737/1"
resp = sm.describe_model_package(ModelPackageName=mp_arn)

print("ModelPackageStatus:", resp["ModelPackageStatus"])
print("ModelApprovalStatus:", resp["ModelApprovalStatus"])

ModelPackageStatus: Completed
ModelApprovalStatus: PendingManualApproval


In [None]:
import boto3

client = boto3.client("sagemaker")

response = client.describe_model_card(
    ModelCardName="breast-cancer-model-card"
)

response

{'ModelCardArn': 'arn:aws:sagemaker:us-east-1:996351798934:model-card/breast-cancer-model-card',
 'ModelCardName': 'breast-cancer-model-card',
 'ModelCardVersion': 1,
 'Content': '{"model_overview": {"model_description": "Predicts whether a tumor is malignant or benign using XGBoost.", "algorithm_type": "XGBoost", "problem_type": "BinaryClassification"}, "intended_uses": {"purpose_of_model": "Educational / lab use for ML prediction.", "intended_uses": "Intended for demonstrating binary classification with tabular data. Not intended for real medical diagnosis or clinical decision-making.", "factors_affecting_model_efficiency": "Performance depends on the input feature distribution matching training data; data quality and preprocessing consistency are critical.", "risk_rating": "Low", "explanations_for_risk_rating": "Low risk because it is used for coursework/demo only and not for real clinical use."}}',
 'ModelCardStatus': 'Draft',
 'CreationTime': datetime.datetime(2026, 2, 2, 22, 58, 