# XGBoost

XGBoost, short for eXtreme Gradient Boosting, is a popular machine learning library that implements the gradient boosting framework. It is designed to be highly efficient, scalable, and flexible, making it one of the most widely used algorithms for supervised learning tasks, particularly in structured/tabular data settings.

### What is XGBoost?
XGBoost belongs to the ensemble learning family, specifically boosting algorithms. Boosting is a sequential technique where the model learns from the mistakes of its predecessors and improves over time by focusing more on the misclassified data points. XGBoost extends traditional gradient boosting methods by incorporating regularization techniques and parallel processing, which significantly improves performance.

### Why is XGBoost Used?
1. <b>Performance</b>: XGBoost often outperforms other algorithms in terms of predictive accuracy. It is particularly effective for structured/tabular data where there are a large number of features.
2. <b>Scalability</b>: XGBoost is highly scalable and can handle large datasets efficiently. It supports parallel and distributed computing, making it suitable for big data applications.
3. <b>Flexibility</b>: It supports a variety of objective functions and evaluation metrics, allowing users to customize the model according to their specific needs.
4. <b>Robustness</b>: XGBoost includes built-in regularization techniques to prevent overfitting, such as L1 and L2 regularization, which improve model generalization.

### Advantages of XGBoost:
1. <b>Highly Accurate</b>: XGBoost often produces state-of-the-art results on many machine learning competitions and real-world datasets.
2. <b>Handles Missing Values</b>: XGBoost has built-in capabilities to handle missing data, reducing the need for preprocessing.
3. <b>Feature Importance</b>: It provides feature importance scores, allowing users to interpret the model and identify important predictors.
4. <b>Wide Range of Applications</b>: XGBoost can be applied to various machine learning tasks, including classification, regression, ranking, and recommendation systems.

### Disadvantages of XGBoost:
1. <b>Complexity</b>: The algorithm can be more complex to understand and tune compared to simpler models like linear regression or decision trees.
2. <b>Computationally Intensive</b>: Training XGBoost models can be computationally intensive, especially for large datasets or complex models.
3. <b>Hyperparameter Tuning</b>: While XGBoost provides many hyperparameters for tuning, finding the optimal combination can require extensive experimentation.

# XGBoost Algorithm Implementation

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

## Disable Warnings

In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter("ignore")

## Loading Data

In [None]:
data = pd.read_csv("/kaggle/input/melbourne-house-dataset/melb_house_data.csv")

In [None]:
data

## Data Statistics

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.isna().sum

In [None]:
data.isnull().sum()

In [None]:
data = data.dropna(axis = 0)

In [None]:
data.columns

## Data Splitting

In [None]:
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
x = data[cols_to_use]

In [None]:
y = data.Price

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(x,
                                               y,
                                               test_size = 0.1)

## Model Initialization

In [None]:
xgb1 = XGBRegressor()

In [None]:
xgb1.fit(xtrain, ytrain)
xgb1.score(xtrain, ytrain)

In [None]:
predictions = xgb1.predict(xtest)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, ytest)))

## Parameter Tuning

XGBoost has a few features that can drastically affect the accuracy and speed of training. The first feature you need to understand are:

### n_estimators

   - Definition: `n_estimators` refers to the number of boosting rounds (decision trees) to be built in the XGBoost model.
   - Purpose: Increasing the number of estimators typically improves the performance of the XGBoost model, as it allows for more iterations of the boosting process, leading to better fitting of the training data.
   - Effect: However, a higher number of estimators also increases the risk of overfitting, especially if the dataset is small or noisy. It's important to find a balance where the model achieves good performance without overfitting.

In [None]:
xgb2 = XGBRegressor(n_estimators=500)
xgb2.fit(xtrain, ytrain)
xgb2.score(xtrain, ytrain)

### early_stopping_rounds

   - `early_stopping_rounds` is a technique used to prevent overfitting and improve training efficiency by stopping the training process early if the performance metric on the validation set stops improving.
   - <b>Purpose</b>: Instead of training for a fixed number of iterations, early stopping allows the model to automatically determine the optimal number of boosting rounds based on the validation performance.
   - <b>Effect</b>: This helps prevent overfitting and saves computational resources by avoiding unnecessary iterations. It's a crucial parameter to tune for optimizing the training process in XGBoost.

In [None]:
xgb3 = XGBRegressor(n_estimators=500)
xgb3.fit(xtrain, ytrain, 
             early_stopping_rounds=5, 
             eval_set=[(xtest, ytest)],
             verbose=False)
xgb3.score(xtrain, ytrain)

### learning_rate

   - `learning_rate` (also known as shrinkage or eta) determines the step size at which the model weights are updated during each boosting iteration.
   - <b>Purpose</b>: A lower learning rate makes the model training more conservative by taking smaller steps towards the optimal solution, which helps in achieving better generalization and robustness.
   - <b>Effect</b>: However, a very low learning rate requires more boosting rounds to converge, increasing the computational cost. On the other hand, a higher learning rate may lead to faster convergence but can also cause overshooting and instability in training.

In [None]:
xgb4 = XGBRegressor(n_estimators=1000, learning_rate=0.05)
xgb4.fit(xtrain, ytrain, 
             early_stopping_rounds=5, 
             eval_set=[(xtest, ytest)],
             verbose=False)
xgb4.score(xtrain, ytrain)

### n_jobs

   - `n_jobs` specifies the number of parallel jobs to run during model training and prediction.
   - <b>Purpose</b>: Setting `n_jobs` to a value greater than 1 enables parallelism, allowing XGBoost to utilize multiple CPU cores for faster computation.
   - <b>Effect</b>: Increasing the value of `n_jobs` can significantly reduce the training and prediction time, especially for large datasets and complex models. However, it also increases the memory usage, so it's essential to consider the available resources and adjust `n_jobs` accordingly to avoid memory-related issues.

In [None]:
xgb5 = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
xgb5.fit(xtrain, ytrain, 
             early_stopping_rounds=5, 
             eval_set=[(xtest, ytest)],
             verbose=False)
xgb5.score(xtrain, ytrain)