## 1. Model Selection:
Choose Algorithm: Select an appropriate machine learning algorithm based on the nature of the problem (classification, regression, clustering, etc.), the size of the dataset, and other factors.

## 2. Model Building:
Instantiate Model: Create an instance of the chosen machine learning algorithm.

Fit Model: Train the model on the training data by calling the fit() method. During training, the model learns the patterns and relationships present in the data.

## XGBoost (Extreme Gradient Boosting) 

is an advanced implementation of gradient boosting algorithm designed for efficiency, flexibility, and scalability. It is widely used in machine learning competitions and real-world applications due to its state-of-the-art performance and robustness. Here's a detailed description of XGBoost:

## 1. Gradient Boosting Algorithm:
Boosting Ensemble Method: XGBoost belongs to the family of boosting ensemble methods, where multiple weak learners (usually decision trees) are trained sequentially, and each subsequent model corrects the errors made by the previous models.

Gradient Boosting: XGBoost employs the gradient boosting framework, which optimizes a differentiable loss function by iteratively fitting weak learners to the negative gradient of the loss function.

## 2. Key Features of XGBoost:
Tree Ensemble Method: XGBoost builds an ensemble of decision trees, known as a gradient boosted decision tree (GBDT), to make predictions. Each tree is added sequentially to the ensemble, and subsequent trees learn from the residuals (errors) of the previous trees.

Regularization Techniques: 
XGBoost integrates various regularization techniques to prevent overfitting, including L1 (Lasso) and L2 (Ridge) regularization on leaf weights, and tree pruning to control tree depth and complexity.

Customizable Loss Functions: XGBoost supports customizable loss functions for both regression and classification tasks, allowing users to define their own objectives or use predefined objectives like logistic loss, squared loss, etc.

Parallel and Distributed Computing: XGBoost is highly optimized for parallel and distributed computing, leveraging multiple CPU cores and supporting distributed computing frameworks like Apache Hadoop and Apache Spark.

Optimized Tree Construction: XGBoost employs a number of optimization techniques to speed up tree construction, including approximate tree learning, column block for parallelization, and out-of-core computing for handling large datasets.

## 3. Advantages of XGBoost:
High Performance: XGBoost is known for its high prediction accuracy and efficiency, making it suitable for both small and large-scale datasets.

Flexibility: XGBoost can handle various types of data and tasks, including classification, regression, and ranking, and supports custom loss functions and evaluation metrics.

Feature Importance: XGBoost provides built-in feature importance scores, which help in feature selection and understanding the relative importance of input features in making predictions.

Robustness: XGBoost is robust to overfitting and can handle noisy data and missing values effectively, thanks to its regularization techniques and handling of missing values during tree construction.

## 4. Limitations of XGBoost:
Parameter Tuning: XGBoost requires careful parameter tuning, especially for hyperparameters like learning rate, tree depth, and regularization parameters, to achieve optimal performance.

Computationally Intensive: Training an XGBoost model can be computationally intensive, especially for large datasets or deep trees, requiring substantial computational resources.

Interpretability: While XGBoost provides feature importance scores, the resulting models may not be as interpretable as simpler models like decision trees or linear models.

Overall, XGBoost is a powerful and versatile algorithm that excels in a wide range of machine learning tasks. With its robustness, efficiency, and flexibility, XGBoost has become a popular choice for both practitioners and researchers in the field of machine learning and data science.

In [34]:
# Writing the train and test dataset to S3 bucket
import sagemaker

session = sagemaker.Session()
bucket = session.default_bucket()
print(bucket)
print(session)

sagemaker-us-east-1-851725479967
<sagemaker.session.Session object at 0x7f2c6257eec0>


In [2]:
# Upload train dataset to S3
X_train_data_location = session.upload_data(path="../datasets/resnet_X_train.pkl",bucket=bucket, key_prefix="datasets")
y_train_data_location = session.upload_data(path="../datasets/resnety_train.pkl",bucket=bucket, key_prefix="datasets")


# Upload test dataset to S3
X_test_data_location = session.upload_data(path="../datasets/resnetX_test.pkl", bucket=bucket, key_prefix="datasets")
y_test_data_location = session.upload_data(path="../datasets/resnety_test.pkl",bucket=bucket, key_prefix="datasets")

# Print the S3 locations
print("X Train data location:", X_train_data_location)
print("y train data location:", y_train_data_location)
print("X test data location:", X_test_data_location)
print("y Test data location:", y_test_data_location)


X Train data location: s3://sagemaker-us-east-1-851725479967/datasets/resnet_X_train.pkl
y train data location: s3://sagemaker-us-east-1-851725479967/datasets/resnety_train.pkl
X test data location: s3://sagemaker-us-east-1-851725479967/datasets/resnetX_test.pkl
y Test data location: s3://sagemaker-us-east-1-851725479967/datasets/resnety_test.pkl


In [3]:
# Import the required library 
import xgboost as xgb
import os
import pickle
import boto3
from io import BytesIO

In [4]:
# Get the current directory
current_dir = os.getcwd()

# Get the parent directory (one level up)
parent_dir = os.path.dirname(current_dir)

# Print the parent directory
print("Parent Directory:", parent_dir)

Parent Directory: /home/sagemaker-user/faultFinding_aws_sagemaker


In [5]:
preprocessed_data_dir = parent_dir+'/datasets/'
model_dir = parent_dir+'/models/'

In [6]:
# Initialize the S3 client
s3 = boto3.client('s3')

# Specify the bucket name and key (path) of the pickle file
bucket_name = bucket
X_train_file_key = "datasets/resnet_X_train.pkl"
y_train_file_key = "datasets/resnety_train.pkl"
X_test_file_key = "datasets/resnetX_test.pkl"
y_test_file_key = "datasets/resnety_test.pkl"

def loadData(file_key):
    # Read the pickle file from S3
    response = s3.get_object(Bucket=bucket_name, Key=file_key)
    pickle_bytes = response['Body'].read()
    # Load the pickle file from bytes
    data = pickle.loads(pickle_bytes)

    return data

X_train = loadData(X_train_file_key)
y_train =  loadData(y_train_file_key)
X_test =  loadData(X_test_file_key)
y_test =  loadData(y_test_file_key)

In [7]:
# Verifying the shape
X_train.shape , y_train.shape

((1484, 2048), (1484,))

In [8]:
# Creating a copy of labels
y_train_1 = y_train.copy()

In [9]:
# Replace all occurrences defective as 0 and good as 1
for i in range(len(y_train)):
    if y_train[i]=='defective' : 
        y_train[i] = 0
    else:
        y_train[i] = 1

In [10]:
# Conert it into string to int
y_train=y_train.astype(int)

In [11]:
# Model selection
xgb_model = xgb.XGBClassifier()

In [12]:
# Train the model
xgb_model.fit(X_train, y_train)

In [13]:
train_accuracy = xgb_model.score(X_train,y_train)
print(f"Training accuracy: {train_accuracy: .4f}")

Training accuracy:  1.0000


In [14]:
model_dir = parent_dir+'/models/'
model_dir

'/home/sagemaker-user/faultFinding_aws_sagemaker/models/'

In [15]:
# Save the trained model to a file
with open(os.path.join(model_dir,'RESNET50_xgbClassifier_model.pkl'),'wb') as f:
    pickle.dump(xgb_model, f)

In [41]:
%%writefile train.py


# Import the required library
import xgboost as xgb
import os
import pickle
import boto3
import argparse
import sagemaker

model_file_name = "pipeline_model"



# Main Function
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    args, _ = parser.parse_known_args()

    # Specify the AWS region
    region_name = 'us-east-1'  # Change this to your desired region
    
    # Create a SageMaker session with the specified region
    session = boto3.Session(region_name=region_name)
    sagemaker_session = sagemaker.Session(boto_session=session)
    bucket = sagemaker_session.default_bucket()
    
    # Initialize the S3 client
    s3 = boto3.client('s3')
    
    # Specify the bucket name and key (path) of the pickle file
    bucket_name = bucket
    X_train_file_key = "datasets/resnet_X_train.pkl"
    y_train_file_key = "datasets/resnety_train.pkl"
    X_test_file_key = "datasets/resnetX_test.pkl"
    y_test_file_key = "datasets/resnety_test.pkl"

    def loadData(file_key):
        # Read the pickle file from S3
        response = s3.get_object(Bucket=bucket_name, Key=file_key)
        pickle_bytes = response['Body'].read()
        # Load the pickle file from bytes
        data = pickle.loads(pickle_bytes)
    
        return data

    X_train = loadData(X_train_file_key)
    y_train =  loadData(y_train_file_key)
    X_test =  loadData(X_test_file_key)
    y_test =  loadData(y_test_file_key)
    
    # Replace all occurrences defective as 0 and good as 1
    for i in range(len(y_train)):
        if y_train[i] == 'defective': 
            y_train[i] = 0
        else:
            y_train[i] = 1
    # Conert it into string to int
    y_train = y_train.astype(int)
    # Model selection
    xgb_model = xgb.XGBClassifier()
    # Train the model
    xgb_model.fit(X_train, y_train)

    # Replace all occurrences defective as 0 and good as 1
    for i in range(len(y_test)):
        if y_test[i] == 'defective':
            y_test[i] = 0
        else:
            y_test[i] = 1
    # Conert it into string to int
    y_test = y_test.astype(int)
    # train accuracy
    train_accuracy = xgb_model.score(X_train, y_train)
    # test accuracy
    test_accuracy = xgb_model.score(X_test, y_test)
    # Save Model
    model_save_path = os.path.join(args.model_dir, model_file_name)
    with open(model_save_path,'wb') as f:
        pickle.dump(xgb_model, f)
    print(f"Model save at path: {model_save_path}")
    print(f"Training accuracy: {train_accuracy: .4f}")
    print(f"Testing accuracy: {test_accuracy: .4f}")


# Check if the script is being executed as the main module
if __name__ == "__main__":
    # Call the main function
    main()


Overwriting train.py


In [42]:
%%writefile requirements.txt
xgboost
boto3
sagemaker
fsspec
s3fs

Overwriting requirements.txt


In [43]:
# Train with the help of sagemaker estimator
# Choose framework

from sagemaker import get_execution_role
from sagemaker.xgboost.estimator import XGBoost

xgb_estimator = XGBoost(
                          role=get_execution_role(),
                          base_job_name="xgb-pipeline-run",
                          entry_point="train.py",
                          framework_version='1.7-1',
                          dependencies=['requirements.txt'],
                          instance_count=1,
                          instance_type='ml.m5.xlarge',
                          hyperparameters={'eta': 0.2,
                                           'gamma': 4,
                                           'max_depth': 5,
                                           'min_child_weight': 6,
                                           'objective': 'binary:logistic',
                                           'subsample': 0.8,
                                           },
                          use_spot_instances=True,
                          max_wait=600,
                          max_run=600,
                          )

xgb_estimator.fit()

INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py3.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.m5.xlarge.
INFO:sagemaker:Creating training-job with name: xgb-pipeline-run-2024-05-23-19-35-24-098


2024-05-23 19:35:24 Starting - Starting the training job...
2024-05-23 19:35:39 Starting - Preparing the instances for training...
2024-05-23 19:36:07 Downloading - Downloading input data...
2024-05-23 19:36:37 Downloading - Downloading the training image...
2024-05-23 19:37:13 Training - Training image download completed. Training in progress..[34m[2024-05-23 19:37:23.467 ip-10-0-242-240.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2024-05-23 19:37:23.490 ip-10-0-242-240.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2024-05-23:19:37:23:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2024-05-23:19:37:23:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-05-23:19:37:23:INFO] Invoking user training script.[0m
[34m[2024-05-23:19:37:24:INFO] Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m[2024-05-23:19:37:24:INFO] Generating setup.cfg[0

In [54]:
import boto3
import sagemaker

# Initialize SageMaker client
sm_client = boto3.client("sagemaker")
# Get the training job name
training_job_name = xgb_estimator.latest_training_job.name

# Use the training job name to describe the training job
model_artifact = sm_client.describe_training_job(TrainingJobName=training_job_name)["ModelArtifacts"]["S3ModelArtifacts"]

print(f"Model storage location: {model_artifact}")
print(f"Training Job Name: {training_job_name}")

Model storage location: s3://sagemaker-us-east-1-851725479967/xgb-pipeline-run-2024-05-23-19-35-24-098/output/model.tar.gz
Training Job Name: xgb-pipeline-run-2024-05-23-19-35-24-098


In [56]:
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, ContinuousParameter

# Define hyperparameters to tune
hyperparameter_ranges = {
    'max_depth': IntegerParameter(3, 10),
    'eta': ContinuousParameter(0.01, 0.2),
    'min_child_weight': IntegerParameter(1, 6),
    'subsample': ContinuousParameter(0.5, 0.9),
    'gamma': ContinuousParameter(0, 10)
}

# Create Optimizer

optimizer = HyperparameterTuner(
    base_tuning_job_name="xgb-pipeline-run",
    estimator=xgb_estimator,
    hyperparameter_ranges=hyperparameter_ranges,
    objective_type="Maximize",
    objective_metric_name="validation:auc",
    max_jobs=10,
    max_parallel_jobs=2,
)
# Launch optimizer to fit

optimizer.fit()


INFO:sagemaker:Creating hyperparameter tuning job with name: xgb-pipeline-run-240523-2007


..................................................................................................................*


UnexpectedStatusException: Error for HyperParameterTuning job xgb-pipeline-run-240523-2007: Failed. Reason: No objective metrics found after running 5 training jobs. Please ensure that the custom algorithm is emitting the objective metric as defined by the regular expression provided.

In [None]:
# Analysis the tunning results:
results = optimizer.analytics().datafram()
results.sort_values("Final Objective value", ascending=False).head()
