# Sample of SageMaker-byoc-tf2-byos-deepctr-deepfm-on-notebookInstance

NOTE：Run this notebook in SageMaker instance instead of SageMaker Stuio

Core requirements:
1. TensorFlow2.8+py3.8
2. BYOC
3. SageMaker+DeepCTR-DeepFM
4. BYOS in S3 for AirFlow usage
5. using FastFile mode to access dataset files in S3

Steps:
1. Generate Dockerfile
2. Generate requirements.txt
3. Generate train.py
4. Build docker image
5. SageMaker setting
6. tar train.py requirements and upload to s3
7. Local test image
8. Upload container image to Amazon ECR
9. Start SageMaker training job
10. [TBD]Build a SageMaker pipeline

## 1. Generate Dockerfile

In [1]:
%%writefile Dockerfile
FROM tensorflow/tensorflow:2.8.4

# Install sagemaker-training toolkit that contains the common functionality necessary to create a container compatible with SageMaker and the Python SDK.
RUN /usr/bin/python3 -m pip install --upgrade pip
RUN pip3 install sagemaker-training && pip3 install scikit-learn && pip3 install pandas
#RUN pip3 install -U scikit-learn

# Copies the training code inside the container
#COPY train.py /opt/ml/code/train.py

# Defines train.py as script entrypoint
#ENV SAGEMAKER_PROGRAM train.py
WORKDIR /opt/ml/code

Overwriting Dockerfile


## 2. Generate requirements.txt

In [2]:
%%writefile requirements.txt
deepctr

Overwriting requirements.txt


## 3. Generate train.py

In [3]:
%%writefile train_fastfile.py
import pandas as pd
import tensorflow as tf
import os
import argparse
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

from deepctr.models import *
from deepctr.feature_column import SparseFeat, DenseFeat, get_feature_names

if __name__ == "__main__":
    parser = argparse.ArgumentParser()

#     parser.add_argument("--learning-rate", type=float, default=0.01)
#     parser.add_argument("--batch-size", type=int, default=128)
#     parser.add_argument("--batch-norm", type=bool, default=False)
#     parser.add_argument("--dnn-hidden-units", type=str, default="128,64,32")
#     parser.add_argument("--dropout-rate", type=float, default=0.0)

#     parser.add_argument("--checkpoint", type=str, default=None)
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))

    args, _ = parser.parse_known_args()
    
    
    data = pd.read_csv('/opt/ml/input/data/train/criteo_sample.txt')

    sparse_features = ['C' + str(i) for i in range(1, 27)]
    dense_features = ['I' + str(i) for i in range(1, 14)]

    data[sparse_features] = data[sparse_features].fillna('-1', )
    data[dense_features] = data[dense_features].fillna(0, )
    target = ['label']

    # 1.Label Encoding for sparse features,and do simple Transformation for dense features
    for feat in sparse_features:
        lbe = LabelEncoder()
        data[feat] = lbe.fit_transform(data[feat])
    mms = MinMaxScaler(feature_range=(0, 1))
    data[dense_features] = mms.fit_transform(data[dense_features])

    # 2.count #unique features for each sparse field,and record dense feature field name

    fixlen_feature_columns = [SparseFeat(feat, vocabulary_size=data[feat].max() + 1, embedding_dim=4)
                              for i, feat in enumerate(sparse_features)] + [DenseFeat(feat, 1, )
                                                                            for feat in dense_features]

    dnn_feature_columns = fixlen_feature_columns
    linear_feature_columns = fixlen_feature_columns

    feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)

    # 3.generate input data for model

    train, test = train_test_split(data, test_size=0.2, random_state=2020)
    train_model_input = {name: train[name] for name in feature_names}
    test_model_input = {name: test[name] for name in feature_names}

    # 4.Define Model,train,predict and evaluate
    model = DeepFM(linear_feature_columns, dnn_feature_columns, task='binary')
    model.compile("adam", "binary_crossentropy",
                  metrics=['binary_crossentropy'], )

    history = model.fit(train_model_input, train[target].values,
                        batch_size=256, epochs=10, verbose=2, validation_split=0.2, )
    model.summary()
    model.save('/opt/ml/model/deepctr-deepfm')
    pred_ans = model.predict(test_model_input, batch_size=256)
    print("test LogLoss", round(log_loss(test[target].values, pred_ans), 4))
    print("test AUC", round(roc_auc_score(test[target].values, pred_ans), 4))


Overwriting train_fastfile.py


## 4. Build docker image

In [4]:
%%sh
algorithm_name='byoc1'
docker build -t ${algorithm_name} .

Sending build context to Docker daemon  286.7kB
Step 1/4 : FROM tensorflow/tensorflow:2.8.4
 ---> a5a47af37160
Step 2/4 : RUN /usr/bin/python3 -m pip install --upgrade pip
 ---> Using cache
 ---> c466f0a5f1d3
Step 3/4 : RUN pip3 install sagemaker-training && pip3 install scikit-learn && pip3 install pandas
 ---> Using cache
 ---> 220c83868125
Step 4/4 : WORKDIR /opt/ml/code
 ---> Using cache
 ---> 3b3f55984bef
Successfully built 3b3f55984bef
Successfully tagged byoc1:latest


## 5. SageMaker setting

In [5]:
%%time
#! python3 -m pip install --upgrade sagemaker
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
import boto3

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = "byoc1"

role = (
    get_execution_role()
)  # provide a pre-existing role ARN as an alternative to creating a new role
role_name = role.split(["/"][-1])
print(f"SageMaker Execution Role: {role}")
print(f"The name of the Execution role: {role_name[-1]}")

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]
print(f"AWS account: {account}")

session = boto3.session.Session()
region = session.region_name
print(f"AWS region: {region}")

SageMaker Execution Role: arn:aws:iam::432088571089:role/AmazonSageMaker-ExecutionRole-20210324T123126
The name of the Execution role: AmazonSageMaker-ExecutionRole-20210324T123126
AWS account: 432088571089
AWS region: us-east-1
CPU times: user 1.13 s, sys: 111 ms, total: 1.24 s
Wall time: 1.46 s


## 6. tar train.py requirements and upload to s3

In [6]:
!tar zcvf train.tar.gz train_fastfile.py train_tf_data.py requirements.txt
source_dir_s3=sagemaker_session.upload_data(path='train.tar.gz', bucket=bucket, key_prefix=prefix)
print(source_dir_s3)

train_fastfile.py
train_tf_data.py
requirements.txt
s3://sagemaker-us-east-1-432088571089/byoc1/train.tar.gz


## 7. Local test image

dataset file can be downloaded from this site
https://github.com/shenweichen/DeepCTR/blob/master/examples/criteo_sample.txt, then upload to the proper directory of your s3 bucket

In [7]:
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
import time

algorithm_name='byoc1'
dataset_path="datasets/deepctr/"

train_channel=TrainingInput(
        s3_data=f's3://{bucket}/{dataset_path}',
        input_mode='FastFile'  # Available options: File | Pipe | FastFile
    )

estimator = Estimator(image_uri=f'{algorithm_name}:latest',
                      role=role,
                      entry_point='train_fastfile.py',
                      source_dir=source_dir_s3,#'.',
                      instance_count=1,
                      instance_type='local')

estimator.fit(inputs={"train":train_channel},
              job_name="hstong-"+time.strftime("%Y%m%d%H%M%S", time.localtime()))

INFO:sagemaker:Creating training-job with name: hstong-20230116111407
INFO:sagemaker.local.local_session:Starting training job
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker.local.image:No AWS credentials found in session but credentials from EC2 Metadata Service are available.
INFO:sagemaker.local.image:docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-fj7sc:
    command: train
    container_name: zvxjfcjair-algo-1-fj7sc
    environment:
    - '[Masked]'
    - '[Masked]'
    image: byoc1:latest
    networks:
      sagemaker-local:
        aliases:
        - algo-1-fj7sc
    stdin_open: true
    tty: true
    volumes:
    - /tmp/tmpmwymrw4a/algo-1-fj7sc/input:/opt/ml/input
    - /tmp/tmpmwymrw4a/algo-1-fj7sc/output:/opt/ml/output
    - /tmp/tmpmwymrw4a/algo-1-fj7sc/output/data:/opt/ml/output/data
    - /tmp/tmpmwymrw4a/model:/opt/ml/model
    - /opt/ml/metadata:/opt/ml/metadata


Creating zvxjfcjair-algo-1-fj7sc ... 
Creating zvxjfcjair-algo-1-fj7sc ... done
Attaching to zvxjfcjair-algo-1-fj7sc
[36mzvxjfcjair-algo-1-fj7sc |[0m 2023-01-16 11:14:09,642 botocore.credentials INFO     Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36mzvxjfcjair-algo-1-fj7sc |[0m 2023-01-16 11:14:09,791 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
[36mzvxjfcjair-algo-1-fj7sc |[0m /usr/bin/python3 -m pip install -r requirements.txt
[36mzvxjfcjair-algo-1-fj7sc |[0m Collecting deepctr
[36mzvxjfcjair-algo-1-fj7sc |[0m   Downloading deepctr-0.9.3-py3-none-any.whl (141 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.2/141.2 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m31m?[0m eta [36m-:--:--[0m
[36mzvxjfcjair-algo-1-fj7sc |[0m [?25hCollecting h5py==2.10.0
[36mzvxjfcjair-algo-1-fj7sc |[0m   Downloading h5py-2.10.0-cp38-cp38-manylinux1_x86_64.whl (2.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

[36mzvxjfcjair-algo-1-fj7sc |[0m 2023-01-16 11:14:16.124914: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
[36mzvxjfcjair-algo-1-fj7sc |[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[36mzvxjfcjair-algo-1-fj7sc |[0m   return t[start:end]
[36mzvxjfcjair-algo-1-fj7sc |[0m Epoch 1/10
[36mzvxjfcjair-algo-1-fj7sc |[0m 1/1 - 5s - loss: 0.7288 - binary_crossentropy: 0.7288 - val_loss: 0.7394 - val_binary_crossentropy: 0.7394
[36mzvxjfcjair-algo-1-fj7sc |[0m Epoch 2/10
[36mzvxjfcjair-algo-1-fj7sc |[0m 1/1 - 0s - loss: 0.7055 - binary_crossentropy: 0.7055 - val_loss: 0.7250 - val_binary_crossentropy: 0.7250
[36mzvxjfcjair-algo-1-fj7sc |[0m Epoch 3/10
[36mzvxjfcjair-algo-1-fj7sc |[0m 1/1 - 0s - loss: 0.6848 - binary_crossentropy: 0.6848 - val_loss: 

[36mzvxjfcjair-algo-1-fj7sc |[0m 2023-01-16 11:14:28.223497: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
[36mzvxjfcjair-algo-1-fj7sc |[0m test LogLoss 0.5757
[36mzvxjfcjair-algo-1-fj7sc |[0m test AUC 0.5556
[36mzvxjfcjair-algo-1-fj7sc |[0m 2023-01-16 11:14:36,047 sagemaker-training-toolkit INFO     Reporting training SUCCESS
[36mzvxjfcjair-algo-1-fj7sc exited with code 0
[0mAborting on container exit...
===== Job Complete =====


## 8. Upload container image to Amazon ECR

In [8]:
%%sh

# Specify an algorithm name
algorithm_name='byoc1'

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly

aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}



Login Succeeded
Sending build context to Docker daemon  286.7kB
Step 1/4 : FROM tensorflow/tensorflow:2.8.4
 ---> a5a47af37160
Step 2/4 : RUN /usr/bin/python3 -m pip install --upgrade pip
 ---> Using cache
 ---> c466f0a5f1d3
Step 3/4 : RUN pip3 install sagemaker-training && pip3 install scikit-learn && pip3 install pandas
 ---> Using cache
 ---> 220c83868125
Step 4/4 : WORKDIR /opt/ml/code
 ---> Using cache
 ---> 3b3f55984bef
Successfully built 3b3f55984bef
Successfully tagged byoc1:latest
The push refers to repository [432088571089.dkr.ecr.us-east-1.amazonaws.com/byoc1]
58697834ab2f: Preparing
3e1368f8b4b9: Preparing
43868a7864ec: Preparing
27cf45bd0fda: Preparing
a9bf518b515d: Preparing
fea392c39757: Preparing
956a9add2009: Preparing
554867544514: Preparing
fb960b14dacd: Preparing
03cb4b1e2dc9: Preparing
f4462d5b2da2: Preparing
554867544514: Waiting
fb960b14dacd: Waiting
03cb4b1e2dc9: Waiting
f4462d5b2da2: Waiting
fea392c39757: Waiting
3e1368f8b4b9: Layer already exists
58697834ab2

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



# 9. Start SageMaker training job

In [9]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
# please modify this byoc_image_uri variable
byoc_image_uri='please modify.dkr.ecr.us-east-1.amazonaws.com/byoc1:latest'
train_channel=TrainingInput(
        s3_data=f's3://{bucket}/{dataset_path}',
        input_mode='FastFile'  # Available options: File | Pipe | FastFile
    )

estimator = Estimator(image_uri=byoc_image_uri,
                      role=role,
                      entry_point='train_fastfile.py',
                      source_dir=source_dir_s3,#'.',
                      instance_count=1,
                      instance_type='ml.c5.xlarge')

estimator.fit(inputs={"train":train_channel},
              job_name="hstong-"+time.strftime("%Y%m%d%H%M%S", time.localtime()))

INFO:sagemaker:Creating training-job with name: hstong-20230116111439


2023-01-16 11:14:40 Starting - Starting the training job...
2023-01-16 11:14:54 Starting - Preparing the instances for training......
2023-01-16 11:16:09 Downloading - Downloading input data
2023-01-16 11:16:09 Training - Training image download completed. Training in progress...[34m2023-01-16 11:16:14,499 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/usr/bin/python3 -m pip install -r requirements.txt[0m
[34mCollecting deepctr
  Downloading deepctr-0.9.3-py3-none-any.whl (141 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 141.2/141.2 kB 3.4 MB/s eta 0:00:00[0m
[34mCollecting h5py==2.10.0
  Downloading h5py-2.10.0-cp38-cp38-manylinux1_x86_64.whl (2.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.9/2.9 MB 108.7 MB/s eta 0:00:00[0m
[34mInstalling collected packages: h5py, deepctr
  Attempting uninstall: h5py
    Found existing installation: h5py 3.7.0
    Uninstalling h5py-3.7.0:
      Successfully uninstalled h5py-3.7.0[0m
[


                                                                 linear0sparse_emb_C17[0][0]      
                                                                 linear0sparse_emb_C18[0][0]      
                                                                 linear0sparse_emb_C19[0][0]      
                                                                 linear0sparse_emb_C20[0][0]      
                                                                 linear0sparse_emb_C21[0][0]      
                                                                 linear0sparse_emb_C22[0][0]      
                                                                 linear0sparse_emb_C23[0][0]      
                                                                 linear0sparse_emb_C24[0][0]      
                                                                 linear0sparse_emb_C25[0][0]      
                                                                 linear0sparse_emb_C26[0][0]      [0m
[34m


2023-01-16 11:16:45 Completed - Training job completed
Training seconds: 62
Billable seconds: 62
