# what is light gbm?

LightGBM (Light Gradient Boosting Machine) is a gradient boosting framework that uses tree-based learning algorithms. It is designed for distributed and efficient training, making it suitable for large datasets.
 
### Pros:
 - High efficiency and speed: LightGBM is faster than many other boosting algorithms due to its histogram-based approach.
 - Scalability: It can handle large datasets and is capable of training on distributed systems.
 - Support for categorical features: LightGBM can directly handle categorical features without the need for one-hot encoding.
 - Flexibility: It offers various hyperparameters to tune, allowing for better model performance.
 
### Cons:
 - Complexity: The model can be complex to tune, requiring careful selection of hyperparameters.
 - Overfitting: LightGBM can overfit on small datasets if not properly regularized.
 - Less interpretability: Like many ensemble methods, the model can be less interpretable compared to simpler models.


In [1]:
import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a LightGBM dataset
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test)

# Set parameters for LightGBM
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
}

# Train the model
model = lgb.train(params, train_data, num_boost_round=100)

# Make predictions
y_pred = model.predict(X_test)
y_pred_binary = [1 if x >= 0.5 else 0 for x in y_pred]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred_binary)
print(f'Accuracy: {accuracy:.2f}')


[LightGBM] [Info] Number of positive: 405, number of negative: 395
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001802 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5100
[LightGBM] [Info] Number of data points in the train set: 800, number of used features: 20
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.506250 -> initscore=0.025001
[LightGBM] [Info] Start training from score 0.025001
Accuracy: 0.93
