<a href="https://colab.research.google.com/github/pyayivargitam/Infosys-Assignments/blob/main/GBM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Gradient Boosting Machine (GBM)

`GBM` is an ensemble machine learning technique that builds models sequentially. Each new model tries to correct errors made by previous models by focusing on residual errors.

- Key features:
  - Combines weak learners (typically decision trees) sequentially
  - Optimizes a loss function using gradient descent
  - Effective for regression and classification tasks
  - Often delivers superior performance over base models such as decision trees or random forests

Below is a GBM implementation using scikit-learn's `GradientBoostingRegressor`.

1. Import Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error

2. Load the dataset

In [2]:
data = pd.read_csv('/content/preprocessed_earthquake_data (2).csv')
data.head()

Unnamed: 0,Latitude,Longitude,Type,Depth,Magnitude,Magnitude Type,Root Mean Square,Source,Status,Year,...,Source_ISCGEM,Source_ISCGEMSUP,Source_NC,Source_NN,Source_OFFICIAL,Source_PR,Source_SE,Source_US,Source_UW,Status_Reviewed
0,0.583377,0.844368,Earthquake,0.495984,0.277668,MW,-0.103839,ISCGEM,Automatic,-1.915523,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.006109,0.698849,Earthquake,0.075272,-0.195082,MW,-0.103839,ISCGEM,Automatic,-1.915523,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.739162,-1.701962,Earthquake,-0.413928,0.750418,MW,-0.103839,ISCGEM,Automatic,-1.915523,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-2.017599,-0.503524,Earthquake,-0.454694,-0.195082,MW,-0.103839,ISCGEM,Automatic,-1.915523,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.340688,0.691479,Earthquake,-0.454694,-0.195082,MW,-0.103839,ISCGEM,Automatic,-1.915523,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
# Define the target and features
target = 'Magnitude'
categorical_cols = ['Type', 'Magnitude Type', 'Source', 'Status']

X = data.drop(columns=[target] + categorical_cols)
y = data[target]

In [4]:
# Train-test splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.head()

Unnamed: 0,Latitude,Longitude,Depth,Root Mean Square,Year,Day,Month_sin,Month_cos,Hour_sin,Hour_cos,...,Source_ISCGEM,Source_ISCGEMSUP,Source_NC,Source_NN,Source_OFFICIAL,Source_PR,Source_SE,Source_US,Source_UW,Status_Reviewed
16953,-0.078008,0.755211,-0.495461,1.003024,0.787991,-1.116502,-1.215716,0.705254,-0.713894,1.234637,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
15800,-0.471466,1.014252,1.194724,-1.087719,0.649349,0.844142,1.218537,-0.717954,-1.006043,-0.995938,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
9014,0.213764,-0.62189,-0.495461,0.511085,-0.321143,0.498146,0.704119,-1.238884,-0.713894,-1.221272,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
15516,-1.274852,0.310949,-0.495461,-0.534287,0.580028,0.036818,-1.215716,-0.717954,-1.419204,0.006682,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
17837,0.01046,0.456978,-0.405774,-0.718764,0.926632,1.074806,1.218537,-0.717954,-1.419204,0.006682,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0


3. Initialize the GBM model

In [5]:
gbm = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

4. Train the model

In [6]:
gbm.fit(X_train, y_train)

5. Predict on Test Data

In [7]:
y_pred = gbm.predict(X_test)

6. Evaluate the performance

In [8]:
mse =  mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae}")


Mean Squared Error: 0.8940879959182769
Mean Absolute Error: 0.681031579768013


7. Cross-Validation

In [9]:
cv_scores = cross_val_score(gbm, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Cross-Validation Scores: {-np.mean(cv_scores):.4f}")

Cross-Validation Scores: 0.9400


- **Task:** Understand and apply Gradient Boosting Machine to predict a categorical target variable, evaluate performance, and interpret results and notedown your observations.

# **LightGBM (Light Gradient Boosting Machine)**

In [None]:
What is LightGBM?
LightGBM is an open-source, high-performance gradient boosting framework developed by Microsoft. It builds decision trees in a leaf-wise manner rather than a level-wise manner, which generally leads to faster training and higher accuracy. It uses histogram-based algorithms for faster splitting and has specialized features like Gradient-Based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to speed up training on large datasets efficiently. LightGBM supports both classification and regression tasks.

Key features:

Leaf-wise tree growth for better accuracy

Histogram-based splits for speed and memory efficiency

GOSS to focus training on samples with large gradients

EFB to reduce dimensionality by bundling features

Supports GPU acceleration

Example usage (regression with Python LightGBM):

In [10]:
import lightgbm as lgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Create LightGBM dataset format
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

# Set parameters
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

# Train model
num_rounds = 100
model = lgb.train(params, train_data, num_rounds, valid_sets=[test_data])

# Predict and evaluate
y_pred = model.predict(X_test, num_iteration=model.best_iteration)
rmse = mean_squared_error(y_test, y_pred)
print(f'RMSE: {rmse}')

RMSE: 0.23285525605518215


# **CatBoost (Categorical Boosting)**

What is CatBoost?
CatBoost is a gradient boosting framework developed by Yandex that handles categorical features natively without the need for extensive preprocessing like one-hot encoding. It uses ordered boosting, which avoids prediction shifts and overfitting, and offers automatic handling of categorical variables, making it extremely effective for datasets with mixed categorical and numerical features. CatBoost is also competitive in speed and accuracy and supports GPU acceleration.

Key features:

Native categorical feature support

Ordered boosting to reduce overfitting

Efficient handling of missing data

Easy hyperparameter tuning

Supports classification, regression, ranking, and multi-class problems

In [12]:
%pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [13]:
from catboost import CatBoostClassifier
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset (example: Adult dataset with categorical features)
data = fetch_openml(name='adult', version=2, as_frame=True)
X = data.data
y = data.target

# Specify categorical feature indices (example: categorical columns for Adult dataset)
cat_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

# Fill NaN values in specified categorical columns with a string placeholder
for col in cat_features:
    if X[col].isnull().any():
        X[col] = X[col].astype('object').fillna('Unknown')

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize CatBoost model
model = CatBoostClassifier(iterations=1000, learning_rate=0.05, depth=6, verbose=100)

# Train model with categorical features specified
model.fit(X_train, y_train, cat_features=cat_features, eval_set=(X_test, y_test))

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('object').fillna('Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('object').fillna('Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('object').fillna('Unknown')


0:	learn: 0.6447736	test: 0.6439227	best: 0.6439227 (0)	total: 276ms	remaining: 4m 35s
100:	learn: 0.2907311	test: 0.2855474	best: 0.2855474 (100)	total: 19.2s	remaining: 2m 50s
200:	learn: 0.2786445	test: 0.2766110	best: 0.2766110 (200)	total: 30.6s	remaining: 2m 1s
300:	learn: 0.2693688	test: 0.2704353	best: 0.2704353 (300)	total: 39.6s	remaining: 1m 31s
400:	learn: 0.2652878	test: 0.2690773	best: 0.2690773 (400)	total: 47.2s	remaining: 1m 10s
500:	learn: 0.2614034	test: 0.2682083	best: 0.2681948 (498)	total: 56.3s	remaining: 56.1s
600:	learn: 0.2577463	test: 0.2673236	best: 0.2673236 (600)	total: 1m 5s	remaining: 43.5s
700:	learn: 0.2547281	test: 0.2670082	best: 0.2670082 (700)	total: 1m 13s	remaining: 31.1s
800:	learn: 0.2516728	test: 0.2667126	best: 0.2666970 (798)	total: 1m 22s	remaining: 20.4s
900:	learn: 0.2489046	test: 0.2665288	best: 0.2665009 (874)	total: 1m 31s	remaining: 10s
999:	learn: 0.2464334	test: 0.2662221	best: 0.2662221 (999)	total: 1m 40s	remaining: 0us

bestTest 