In [None]:
!python train_tabnet.py --train_path s3://fireguarddata/data/preprocessed_data/train.csv --val_path s3://fireguarddata/data/preprocessed_data/val.csv --epochs 2 --batch_size 8192

Steps to Train Your Model Using SageMaker Estimator:
## Upload Your Training Script to Local directory
Make sure your training scripts are uploaded locally. SageMaker expects the training script to be located in a local directory, not to S3.

### Setting up Training Job 
### Upload Tabnet py script to s3

**uploading script to s3**
training jobs expect scripts saved locally and not in s3. 

In [19]:
!aws s3 cp train_tabnet_balanced.py s3://fireguarddata/scripts/train_tabnet_balanced.py

upload: ./train_tabnet_balanced.py to s3://fireguarddata/scripts/train_tabnet_balanced.py


In [None]:
!aws s3 cp train_tabnet_focal.py s3://fireguarddata/scripts/train_tabnet_focal.py

In [20]:
#verify upload
!aws s3 ls s3://fireguarddata/scripts/

2025-03-30 05:04:51       2999 train_tabnet.py
2025-03-30 05:57:32       3803 train_tabnet_balanced.py


In [1]:
#list files in local directory
!ls -l


total 104
drwxr-xr-x 2 vanel vanel  4096 Jan 26 13:20 Data
drwxr-xr-x 2 vanel vanel  4096 Mar 24 02:42 cb_2022_us_state_20m
drwxr-xr-x 2 vanel vanel  4096 Apr  2 20:00 files-moved
drwxr-xr-x 3 vanel vanel  4096 Mar 24 00:35 mtbs_fod_pts_data
drwxr-xr-x 2 vanel vanel  4096 Apr  2 20:00 scripts
-rw-r--r-- 1 vanel vanel 85940 Apr  2 19:40 sdk_tabnet_train.ipynb


In [3]:
#Create the scripts directory if it doesn't already exist:
!mkdir -p scripts

In [30]:
#move script to the scripts folder
!mv train_tabnet_balanced.py ./scripts/

**Create the Training Job Using SageMaker Python SDK**
let's set up the sagemaker training job

In [4]:
ls -l ./scripts

total 4
-rw-r--r-- 1 vanel vanel 2999 Apr  2 19:40 train_tabnet.py


Run the Cell and Monitor
- The cell will initiate the SageMaker training job.
- You can monitor the training job in the SageMaker Console under Training > Training jobs.
- Once the job completes, your trained model will be stored in:`s3://fireguarddata/models/tabnet_balanced/`


## 1. SageMaker Python SDK Training Job for train_tabnet.py 
Create the Training Job Using SageMaker Python SDK

1. Import necessary libraries.
2. Set up an Estimator object with your training script and S3 paths.
3. Launch the training job.
    - Image URI: Uses a GPU-based PyTorch image for your g5 instance.
    - Instance Type: ml.g5.12xlarge is ideal for deep learning with high GPU resources.
    - Region: Uses the US East (N. Virginia) region for the image URI.
    - Hyperparameters: Set epochs to 10 and batch size to 16384 as you mentioned earlier.

In [6]:
!pip install sagemaker

Collecting sagemaker
  Downloading sagemaker-2.243.0-py3-none-any.whl.metadata (16 kB)
Collecting attrs<24,>=23.1.0 (from sagemaker)
  Downloading attrs-23.2.0-py3-none-any.whl.metadata (9.5 kB)
Collecting docker (from sagemaker)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting fastapi (from sagemaker)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting google-pasta (from sagemaker)
  Using cached google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting importlib-metadata<7.0,>=1.4.0 (from sagemaker)
  Downloading importlib_metadata-6.11.0-py3-none-any.whl.metadata (4.9 kB)
Collecting numpy<2.0,>=1.9.0 (from sagemaker)
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting omegaconf<=2.3,>=2.2 (from sagemaker)
  Downloading omegaconf-2.3.0-py3-none-any.whl.metadata (3.9 kB)
Collecting pathos (from sagemaker)
  Using cached pathos-0.3.3-py3-none-any.whl.metadata (11 kB)


In [None]:
# Import Required Libraries and Set Up
import sagemaker
from sagemaker.pytorch import PyTorch
import boto3

# Initialize SageMaker session and role
sagemaker_session = sagemaker.Session()
#role = sagemaker.get_execution_role() use only in sagemaker instance
role = "AmazonSageMaker-ExecutionRole"
region = boto3.Session().region_name

# Image URI for PyTorch training (GPU version)
image_uri = "placeholderdkr.amazonaws.com/pytorch-training:1.13.1-gpu-py39"
bucket = "fireguarddata"
scripts_path = f"s3://{bucket}/scripts/"
output_path = f"s3://{bucket}/models/tabnet-new/"

print(f"Region: {region}")
print(f"Role: {role}")
print(f"Image URI: {image_uri}")

# Set Hyperparameters
hyperparameters = {
    "epochs": 20,
    "batch_size": 2048,
    "train_path": "s3://fireguarddata/data/preprocessed_data/train.csv",
    "val_path": "s3://fireguarddata/data/preprocessed_data/val.csv",
}

# Configure the Estimator for Training Job
estimator = PyTorch(
    entry_point="train_tabnet.py",  # Updated script name
    source_dir="./scripts",        # Directory containing the script
    role=role,
    instance_count=1,
    instance_type="ml.g4dn.8xlarge", # Updated to the cheaper instance
    image_uri=image_uri,
    framework_version="1.13.1",
    py_version="py39",
    output_path=output_path,       # Updated output path
    hyperparameters=hyperparameters,
    sagemaker_session=sagemaker_session
)

# Launch the Training Job
estimator.fit(job_name="pytorch-training-tabnet4")
print("Training job launched with ml.g4dn.8xlarge!")


Step 5: Track the Training Job

You can monitor the job's progress either through the AWS Console under SageMaker > Training Jobs or via this notebook by running:

estimator.latest_training_job.wait(logs="All")

## 2. SageMaker Python SDK Training Job for train_tabnet_balanced.py

In [None]:
# -- Step 1: Import Required Libraries and Set Up
import sagemaker
from sagemaker.pytorch import PyTorch
import boto3

# Initialize SageMaker session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()  # Get the current SageMaker role
region = boto3.Session().region_name

# Image URI for PyTorch training (GPU version as per your setup)
image_uri = "placeholder/pytorch-training:1.13.1-gpu-py39"
bucket = "fireguarddata"  # Your S3 bucket name
scripts_path = f"s3://{bucket}/scripts/"
output_path = f"s3://{bucket}/models/tabnet/"

print(f"Region: {region}")
print(f"Role: {role}")
print(f"Image URI: {image_uri}")

# -- Step 2: Set Hyperparameters and Configuration -- 
# Hyperparameters for training
hyperparameters = {
    "epochs": 10,
    "batch_size": 2048,
    "train_path": "s3://fireguarddata/data/preprocessed_data/train.csv",
    "val_path": "s3://fireguarddata/data/preprocessed_data/val.csv"
}

# --Step 3: Configure the Estimator for Training Job --
# Initialize the PyTorch estimator for the balanced model
estimator_balanced = PyTorch(
    entry_point="train_tabnet_balanced.py",  # Your balanced training script
    source_dir="./scripts",  # The local directory where your script is present 
    role=role,
    instance_count=1,
    instance_type="ml.g4dn.8xlarge",
    image_uri=image_uri,
    framework_version="1.13.1",
    py_version="py39",
    output_path=output_path,
    hyperparameters=hyperparameters,
    sagemaker_session=sagemaker_session
)


#--Step 4: Launch the Training Job -- 
# Start the training job - job_name="pytorch-training-balanced"
estimator_balanced.fit()
print("Balanced training job launched!")

## 2. SageMaker Python SDK Training Job for train_tabnet_focal.py

In [None]:
# -- Step 1: Import Required Libraries and Set Up
import sagemaker
from sagemaker.pytorch import PyTorch
import boto3

# Initialize SageMaker session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()  # Get the current SageMaker role
region = boto3.Session().region_name

# Image URI for PyTorch training (GPU version as per your setup)
image_uri = "placeholder/ amazonaws.com/pytorch-training:1.13.1-gpu-py39"
bucket = "fireguarddata"  # Your S3 bucket name
scripts_path = f"s3://{bucket}/scripts/"
output_path = f"s3://{bucket}/models/tabnet/"

print(f"Region: {region}")
print(f"Role: {role}")
print(f"Image URI: {image_uri}")

# -- Step 2: Set Hyperparameters and Configuration -- 
# Hyperparameters for training
hyperparameters = {
    "epochs": 5,
    "batch_size": 8192,
    "train_path": "s3://fireguarddata/data/preprocessed_data/train.csv",
    "val_path": "s3://fireguarddata/data/preprocessed_data/val.csv"
}

# Step 3: Configure the Estimator for the Focal Loss Model
estimator_focal = PyTorch(
    entry_point="train_tabnet_focal.py",  # Updated to the focal loss script
    source_dir="./scripts",  # Local path where the script is stored
    role=role,
    instance_count=1,
    instance_type="ml.g5.12xlarge",
    image_uri=image_uri,
    framework_version="1.13.1",
    py_version="py39",
    output_path=output_path,
    hyperparameters=hyperparameters,
    sagemaker_session=sagemaker_session
)


# Step 4: Launch the Focal Loss Training Job
estimator_focal.fit(job_name="pytorch-tabnet-training-focal")
print("Focal loss training job launched!")

**Step 7: Access the Trained Model**
After training, the model will be saved in your S3 bucket:

**Key Concepts:**
Training Script (train.py): Contains your training logic.

SageMaker Training Job: Runs your script on AWS infrastructure.

Pre-built AWS Container: We used the AWS PyTorch container to simplify dependency management.

Output: Your trained model is saved to an S3 bucket.

## Finding the PyTorch Image URI
AWS SageMaker provides pre-built Docker images for popular frameworks like PyTorch. The image URI depends on the following: framework(pytorch), version, python version (e.g. py39), instance type (cpu/gpu), region(e.g. us. east)
- Using the SageMaker SDK to get the correct image URI:
- the following snippet in your Jupyter notebook to automatically get the correct image URI:

In [None]:
import sagemaker
from sagemaker.pytorch import PyTorch

# Get the AWS region
region = sagemaker.Session().boto_region_name

# Specify the framework, version, Python version, and instance type
framework_version = "1.13.1"
py_version = "py39"
instance_type = "ml.g4dn.8xlarge"  # GPU instance

# Get the correct image URI
image_uri = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=region,
    version=framework_version,
    py_version=py_version,
    instance_type=instance_type,
    image_scope="training"  # Important for training jobs
)

print(f"Image URI: {image_uri}")
