### LightGBM (Light Gradient Boosting Machine)
- What is LightGBM?

`LightGBM` is an open-source, high-performance gradient boosting framework developed by Microsoft. It builds decision trees in a leaf-wise manner rather than a level-wise manner, which generally leads to faster training and higher accuracy. It uses histogram-based algorithms for faster splitting and has specialized features like Gradient-Based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to speed up training on large datasets efficiently. LightGBM supports both classification and regression tasks.

**Key features:**

  - Leaf-wise tree growth for better accuracy

  - Histogram-based splits for speed and memory efficiency

  - GOSS to focus training on samples with large gradients

  - EFB to reduce dimensionality by bundling features

  - Supports GPU acceleration

Example usage (regression with Python LightGBM):

In [3]:
import lightgbm as lgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Create LightGBM dataset format
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

# Set parameters
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

# Train model
num_rounds = 100
model = lgb.train(params, train_data, num_rounds, valid_sets=[test_data])

# Predict and evaluate
y_pred = model.predict(X_test, num_iteration=model.best_iteration)
rmse = mean_squared_error(y_test, y_pred)
print(f'RMSE: {rmse}')


RMSE: 0.23285525605518215


### CatBoost (Categorical Boosting)
- What is CatBoost?

`CatBoost` is a gradient boosting framework developed by Yandex that handles categorical features natively without the need for extensive preprocessing like one-hot encoding. It uses ordered boosting, which avoids prediction shifts and overfitting, and offers automatic handling of categorical variables, making it extremely effective for datasets with mixed categorical and numerical features. CatBoost is also competitive in speed and accuracy and supports GPU acceleration.

**Key features:**

  - Native categorical feature support

  - Ordered boosting to reduce overfitting

  - Efficient handling of missing data

  - Easy hyperparameter tuning

  - Supports classification, regression, ranking, and multi-class problems

Example usage (classification with Python CatBoost):

In [None]:
from catboost import CatBoostClassifier
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset (example: Adult dataset with categorical features)
data = fetch_openml(name='adult', version=2, as_frame=True)
X = data.data
y = data.target

# Specify categorical feature indices (example: categorical columns for Adult dataset)
cat_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

# Fill NaN values in specified categorical columns with a string placeholder
for col in cat_features:
    if X[col].isnull().any():
        X[col] = X[col].astype('object').fillna('Unknown')

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize CatBoost model
model = CatBoostClassifier(iterations=1000, learning_rate=0.05, depth=6, verbose=100)

# Train model with categorical features specified
model.fit(X_train, y_train, cat_features=cat_features, eval_set=(X_test, y_test))

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('object').fillna('Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('object').fillna('Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('object').fillna('Unknown')


0:	learn: 0.6447736	test: 0.6439227	best: 0.6439227 (0)	total: 117ms	remaining: 1m 57s
100:	learn: 0.2907311	test: 0.2855474	best: 0.2855474 (100)	total: 6.37s	remaining: 56.7s
200:	learn: 0.2774430	test: 0.2757754	best: 0.2757754 (200)	total: 12.5s	remaining: 49.6s
300:	learn: 0.2689627	test: 0.2705039	best: 0.2704879 (299)	total: 18.5s	remaining: 42.9s
400:	learn: 0.2645073	test: 0.2687909	best: 0.2687909 (400)	total: 25.3s	remaining: 37.9s
500:	learn: 0.2607042	test: 0.2679836	best: 0.2679836 (500)	total: 33.9s	remaining: 33.8s
600:	learn: 0.2570087	test: 0.2671981	best: 0.2671981 (600)	total: 43.1s	remaining: 28.6s
700:	learn: 0.2538711	test: 0.2670646	best: 0.2669976 (666)	total: 50.9s	remaining: 21.7s
800:	learn: 0.2509253	test: 0.2665801	best: 0.2665433 (788)	total: 59.4s	remaining: 14.7s
900:	learn: 0.2482663	test: 0.2664843	best: 0.2664843 (900)	total: 1m 7s	remaining: 7.38s
999:	learn: 0.2456063	test: 0.2661571	best: 0.2661421 (995)	total: 1m 14s	remaining: 0us

bestTest = 0.