## 1. Model Selection:
Choose Algorithm: Select an appropriate machine learning algorithm based on the nature of the problem (classification, regression, clustering, etc.), the size of the dataset, and other factors.

## 2. Model Building:
Instantiate Model: Create an instance of the chosen machine learning algorithm.

Fit Model: Train the model on the training data by calling the fit() method. During training, the model learns the patterns and relationships present in the data.

## XGBoost (Extreme Gradient Boosting) 

is an advanced implementation of gradient boosting algorithm designed for efficiency, flexibility, and scalability. It is widely used in machine learning competitions and real-world applications due to its state-of-the-art performance and robustness. Here's a detailed description of XGBoost:

## 1. Gradient Boosting Algorithm:
Boosting Ensemble Method: XGBoost belongs to the family of boosting ensemble methods, where multiple weak learners (usually decision trees) are trained sequentially, and each subsequent model corrects the errors made by the previous models.

Gradient Boosting: XGBoost employs the gradient boosting framework, which optimizes a differentiable loss function by iteratively fitting weak learners to the negative gradient of the loss function.

## 2. Key Features of XGBoost:
Tree Ensemble Method: XGBoost builds an ensemble of decision trees, known as a gradient boosted decision tree (GBDT), to make predictions. Each tree is added sequentially to the ensemble, and subsequent trees learn from the residuals (errors) of the previous trees.

Regularization Techniques: 
XGBoost integrates various regularization techniques to prevent overfitting, including L1 (Lasso) and L2 (Ridge) regularization on leaf weights, and tree pruning to control tree depth and complexity.

Customizable Loss Functions: XGBoost supports customizable loss functions for both regression and classification tasks, allowing users to define their own objectives or use predefined objectives like logistic loss, squared loss, etc.

Parallel and Distributed Computing: XGBoost is highly optimized for parallel and distributed computing, leveraging multiple CPU cores and supporting distributed computing frameworks like Apache Hadoop and Apache Spark.

Optimized Tree Construction: XGBoost employs a number of optimization techniques to speed up tree construction, including approximate tree learning, column block for parallelization, and out-of-core computing for handling large datasets.

## 3. Advantages of XGBoost:
High Performance: XGBoost is known for its high prediction accuracy and efficiency, making it suitable for both small and large-scale datasets.

Flexibility: XGBoost can handle various types of data and tasks, including classification, regression, and ranking, and supports custom loss functions and evaluation metrics.

Feature Importance: XGBoost provides built-in feature importance scores, which help in feature selection and understanding the relative importance of input features in making predictions.

Robustness: XGBoost is robust to overfitting and can handle noisy data and missing values effectively, thanks to its regularization techniques and handling of missing values during tree construction.

## 4. Limitations of XGBoost:
Parameter Tuning: XGBoost requires careful parameter tuning, especially for hyperparameters like learning rate, tree depth, and regularization parameters, to achieve optimal performance.

Computationally Intensive: Training an XGBoost model can be computationally intensive, especially for large datasets or deep trees, requiring substantial computational resources.

Interpretability: While XGBoost provides feature importance scores, the resulting models may not be as interpretable as simpler models like decision trees or linear models.

Overall, XGBoost is a powerful and versatile algorithm that excels in a wide range of machine learning tasks. With its robustness, efficiency, and flexibility, XGBoost has become a popular choice for both practitioners and researchers in the field of machine learning and data science.

In [1]:
#  Writing the train and test dataset to S3 bucket
import sagemaker

session = sagemaker.Session()
bucket = session.default_bucket()
print(bucket)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
sagemaker-us-east-1-851725479967


In [2]:
# Import the required library 
import xgboost as xgb
import os
import pickle

In [3]:
# Get the current directory
current_dir = os.getcwd()

# Get the parent directory (one level up)
parent_dir = os.path.dirname(current_dir)

# Print the parent directory
print("Parent Directory:", parent_dir)

Parent Directory: /home/sagemaker-user/faultFinding_aws_sagemaker


In [4]:
preprocessed_data_dir = parent_dir+'/datasets/'
model_dir = parent_dir+'/models/'

In [5]:
# Load X_test from file
with open(os.path.join(preprocessed_data_dir,'resnet_X_train.pkl'), 'rb') as f:
    X_train = pickle.load(f)
    
# Load y_test from file
with open(os.path.join(preprocessed_data_dir,'resnety_train.pkl'), 'rb') as f:
    y_train = pickle.load(f)

In [6]:
# Verifying the shape
X_train.shape , y_train.shape

((1484, 2048), (1484,))

In [7]:
# Creating a copy of labels
y_train_1 = y_train.copy()

In [8]:
# Replace all occurrences defective as 0 and good as 1
for i in range(len(y_train)):
    if y_train[i]=='defective' : 
        y_train[i] = 0
    else:
        y_train[i] = 1

In [9]:
# Conert it into string to int
y_train=y_train.astype(int)

In [10]:
# Model selection
xgb_model = xgb.XGBClassifier()

In [11]:
# Train the model
xgb_model.fit(X_train, y_train)

In [12]:
train_accuracy = xgb_model.score(X_train,y_train)
print(f"Training accuracy: {train_accuracy: .4f}")

Training accuracy:  1.0000


In [13]:
model_dir = parent_dir+'/models/'
model_dir

'/home/sagemaker-user/faultFinding_aws_sagemaker/models/'

In [14]:
# Save the trained model to a file
with open(os.path.join(model_dir,'RESNET50_xgbClassifier_model.pkl'),'wb') as f:
    pickle.dump(xgb_model, f)

In [15]:
%%writefile train.py

# Import the required library 
import xgboost as xgb
import os
import pickle

model_file_name = "pipeline_model"

# Main Function
def main():
    parser = argparse.ArgumentParser()

    parser.add_argument("--model_dir",type = str , default = os.environ.get("SM_MODEL_DIR"))

    args, _ = parser.parse_known_args()
    
    # Get the current directory
    current_dir = os.getcwd()

    # Get the parent directory (one level up)
    parent_dir = os.path.dirname(current_dir)
    
    # Print the parent directory
    print("Parent Directory:", parent_dir)
    
    preprocessed_data_dir = parent_dir+'/datasets/'
    model_dir = parent_dir+'/models/'
    
    # Load X_test from file
    with open(os.path.join(preprocessed_data_dir,'resnet_X_train.pkl'), 'rb') as f:
        X_train = pickle.load(f)
        
    # Load y_test from file
    with open(os.path.join(preprocessed_data_dir,'resnety_train.pkl'), 'rb') as f:
        y_train = pickle.load(f)
    
    # Creating a copy of labels
    y_train_1 = y_train.copy()
    
    # Replace all occurrences defective as 0 and good as 1
    for i in range(len(y_train)):
        if y_train[i]=='defective' : 
            y_train[i] = 0
        else:
            y_train[i] = 1
    
    # Conert it into string to int
    y_train=y_train.astype(int)
    
    # Model selection
    xgb_model = xgb.XGBClassifier()
    
    # Train the model
    xgb_model.fit(X_train, y_train)

    # Load X_test from file
    with open(os.path.join(preprocessed_data_dir,'resnetX_test.pkl'), 'rb') as f:
        X_test = pickle.load(f)
        
    # Load y_test from file
    with open(os.path.join(preprocessed_data_dir,'resnety_test.pkl'), 'rb') as f:
        y_test = pickle.load(f)
    
    # Creating a copy of labels
    y_test_1 = y_test.copy()
    
    # Replace all occurrences defective as 0 and good as 1
    for i in range(len(y_test)):
        if y_test[i]=='defective' : 
            y_test[i] = 0
        else:
            y_test[i] = 1
    
    # Conert it into string to int
    y_test=y_test.astype(int)

    #train accuracy
    train_accuracy = xgb_model.score(X_train,y_train)
    print(f"Training accuracy: {train_accuracy: .4f}")
    
    #test accuracy
    test_accuracy = xgb_model.score(X_test,y_test)
    print(f"Testing accuracy: {test_accuracy: .4f}")

    model_save_path = os.path.join(args.model_dir,model_file_name)
    pickle.dump(xgb_model, model_save_path)
    print(f"Model save at path: {model_save_path}")

if __name__=="main":
    main()

Writing train.py


In [16]:
%%writefile requirments.txt
xgboost
os
pickle
fsspec
s3fc

Writing requirments.txt


In [17]:
#Train 
# Choose framework

from sagemaker import get_execution_role
from sagemaker.xgboost.estimator import XGBoost

xgb_estimator = XGBoost(
                          role=get_execution_role(),
                          base_job_name = "xgb-pipeline-run",
                          entry_point="train.py",
                         framework_version='1.7-1',
                          dependencies = ['requirments.txt'],
                          instance_count=1,
                          instance_type='ml.m5.xlarge',
                          hyperparameters={'objective': 'binary:logistic',
                                           'max_depth': '5',
                                           'eta': '0.2',
                                           'gamma': '4',
                                           'min_child_weight': '6',
                                           'subsample': '0.8',
                                           'silent': '0',
                                           'eval_metric': 'auc'},
                          use_spot_instances = True,
                          max_wait=600,
                          max_run = 600,
                          )

xgb_estimator.fit()

INFO:sagemaker:Creating training-job with name: xgb-pipeline-run-2024-05-22-21-06-36-214


2024-05-22 21:06:36 Starting - Starting the training job...
2024-05-22 21:06:53 Starting - Preparing the instances for training...
2024-05-22 21:07:22 Downloading - Downloading input data...
2024-05-22 21:07:42 Downloading - Downloading the training image...
2024-05-22 21:08:23 Training - Training image download completed. Training in progress..[34m[2024-05-22 21:08:33.954 ip-10-0-159-108.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2024-05-22 21:08:33.975 ip-10-0-159-108.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2024-05-22:21:08:34:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2024-05-22:21:08:34:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-05-22:21:08:34:INFO] Invoking user training script.[0m
[34m[2024-05-22:21:08:34:INFO] Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m[2024-05-22:21:08:34:INFO] Generating setup.cfg[0

In [20]:
import boto3
sm_client = boto3.client("sagemaker")
training_job_name = xgb_estimator.latest_training_job
# model_artifact = sm_client.describe_training_job(TrainingJobName=training_job_name)["ModelArtifacts"]["S3ModelArtifacts"]
print(f"Training Job Name: {training_job_name}")
# print(f"Model storage location: {model_artifact}")

Training Job Name: <sagemaker.estimator._TrainingJob object at 0x7f5db4ce7b80>
