## 1. Model Selection:
Choose Algorithm: Select an appropriate machine learning algorithm based on the nature of the problem (classification, regression, clustering, etc.), the size of the dataset, and other factors.

## 2. Model Building:
Instantiate Model: Create an instance of the chosen machine learning algorithm.

Fit Model: Train the model on the training data by calling the fit() method. During training, the model learns the patterns and relationships present in the data.

## XGBoost (Extreme Gradient Boosting) 

is an advanced implementation of gradient boosting algorithm designed for efficiency, flexibility, and scalability. It is widely used in machine learning competitions and real-world applications due to its state-of-the-art performance and robustness. Here's a detailed description of XGBoost:

## 1. Gradient Boosting Algorithm:
Boosting Ensemble Method: XGBoost belongs to the family of boosting ensemble methods, where multiple weak learners (usually decision trees) are trained sequentially, and each subsequent model corrects the errors made by the previous models.

Gradient Boosting: XGBoost employs the gradient boosting framework, which optimizes a differentiable loss function by iteratively fitting weak learners to the negative gradient of the loss function.

## 2. Key Features of XGBoost:
Tree Ensemble Method: XGBoost builds an ensemble of decision trees, known as a gradient boosted decision tree (GBDT), to make predictions. Each tree is added sequentially to the ensemble, and subsequent trees learn from the residuals (errors) of the previous trees.

Regularization Techniques: 
XGBoost integrates various regularization techniques to prevent overfitting, including L1 (Lasso) and L2 (Ridge) regularization on leaf weights, and tree pruning to control tree depth and complexity.

Customizable Loss Functions: XGBoost supports customizable loss functions for both regression and classification tasks, allowing users to define their own objectives or use predefined objectives like logistic loss, squared loss, etc.

Parallel and Distributed Computing: XGBoost is highly optimized for parallel and distributed computing, leveraging multiple CPU cores and supporting distributed computing frameworks like Apache Hadoop and Apache Spark.

Optimized Tree Construction: XGBoost employs a number of optimization techniques to speed up tree construction, including approximate tree learning, column block for parallelization, and out-of-core computing for handling large datasets.

## 3. Advantages of XGBoost:
High Performance: XGBoost is known for its high prediction accuracy and efficiency, making it suitable for both small and large-scale datasets.

Flexibility: XGBoost can handle various types of data and tasks, including classification, regression, and ranking, and supports custom loss functions and evaluation metrics.

Feature Importance: XGBoost provides built-in feature importance scores, which help in feature selection and understanding the relative importance of input features in making predictions.

Robustness: XGBoost is robust to overfitting and can handle noisy data and missing values effectively, thanks to its regularization techniques and handling of missing values during tree construction.

## 4. Limitations of XGBoost:
Parameter Tuning: XGBoost requires careful parameter tuning, especially for hyperparameters like learning rate, tree depth, and regularization parameters, to achieve optimal performance.

Computationally Intensive: Training an XGBoost model can be computationally intensive, especially for large datasets or deep trees, requiring substantial computational resources.

Interpretability: While XGBoost provides feature importance scores, the resulting models may not be as interpretable as simpler models like decision trees or linear models.

Overall, XGBoost is a powerful and versatile algorithm that excels in a wide range of machine learning tasks. With its robustness, efficiency, and flexibility, XGBoost has become a popular choice for both practitioners and researchers in the field of machine learning and data science.

In [1]:
#Import Required Library [Details are available in README.md file]
import xgboost  as xgb
import os
import matplotlib.pyplot as plt
import random
import cv2
import pickle


In [2]:
# Get the current directory
current_dir = os.getcwd()

# Get the parent directory (one level up)
current_dir = os.path.dirname(current_dir)

# Get the parent directory (one level up)
current_dir = os.path.dirname(current_dir)

# Get the parent directory (one level up)
parent_dir = os.path.dirname(current_dir)

# Print the parent directory
print("Parent Directory:", parent_dir)

Parent Directory: E:\upgrade_capston_project-main


In [3]:
preprocessed_data_dir = parent_dir+'/datasets/processed_dataset/'

In [4]:
#Load the preprocessed data
with open(os.path.join(preprocessed_data_dir,'X_train.pkl'), 'rb') as f:
    X_train = pickle.load(f)

# Load y_train from file
with open(os.path.join(preprocessed_data_dir,'y_train.pkl'), 'rb') as f:
    y_train = pickle.load(f)


In [5]:
# Define a function to train a XGB
def train_xgb_model(X_train, y_train):
	xgb_classifier = xgb.XGBClassifier()
	xgb_classifier.fit(X_train, y_train)
	return xgb_classifier

In [6]:
# Replace all occurrences defective as 0 and good as 1
for i in range(len(y_train)):
    if y_train[i]=='defective' : 
        y_train[i] = 0
    else:
        y_train[i] = 1

In [7]:
# Train Random Forest classifiers
xgb_classifier = train_xgb_model(X_train, y_train)

In [8]:
model_dir = parent_dir+'/models/'

In [9]:
# Save the trained model to a file
with open(os.path.join(model_dir,'xgbClassifier.pkl'), 'wb') as f:
    pickle.dump(xgb_classifier, f)