<a href="https://www.kaggle.com/code/vladtasca/s4e4-putting-a-lightgbm-regressor-to-work?scriptVersionId=174718757" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# S4E4: Putting a LightGBM Regressor to Work
After a bit of experimentation, a straightforward LightGBM model with a little bit of hyperparameter tuning performed best. This notebook only contains the code required for generating the submission file.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import make_scorer, mean_squared_log_error
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from lightgbm import LGBMRegressor

In [2]:
# Load in dataset that competition data is based on
original_df = pd.read_csv('/kaggle/input/abalone-dataset/abalone.csv')
original_df.columns = [x.replace(' ', '_') for x in original_df.columns]

In [3]:
# Load the dataset
train_df = pd.read_csv('/kaggle/input/playground-series-s4e4/train.csv', index_col='id')
test_df = pd.read_csv('/kaggle/input/playground-series-s4e4/test.csv', index_col='id')

# Match original column names
column_name_mapping = {
    'Whole weight': 'Whole_weight',
    'Whole weight.1': 'Shucked_weight',
    'Whole weight.2': 'Viscera_weight',
    'Shell weight': 'Shell_weight'
}

train_df = train_df.rename(columns=column_name_mapping)
test_df = test_df.rename(columns=column_name_mapping)

# Enlarge training df by the original df
train_df = pd.concat([train_df, original_df])

In [4]:
# Separate features and target variable
X_train = train_df.drop(columns=['Rings'])  # Features
y_train = train_df['Rings']  # Target variable
X_test = test_df  # Test features

In [5]:
# Define numerical and categorical features
numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns

In [6]:
# Define preprocessing steps for numerical and categorical features
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())  # Scale numerical features
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical features
])

In [7]:
# Combine preprocessing steps for numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [8]:
# Create custom scorer for this competition's scoring metric
def rmsle(y_true, y_pred):
    """Custom scorer for root mean squared error."""
    return np.sqrt(mean_squared_log_error(y_true, y_pred))

rmsle_scorer = make_scorer(rmsle, greater_is_better=False)

In [9]:
# Result of randomized search CV (code omitted)
lgbm_best_params = {
    'subsample': 0.8,
    'reg_lambda': 0.5,
    'reg_alpha': 0,
    'num_leaves': 40,
    'n_estimators': 500,
    'min_child_weight': 0.01,
    'max_depth': 30,
    'learning_rate': 0.1,
    'colsample_bytree': 0.8
}

In [10]:
# Use these discovered parameters for the final model
final_lgbm_pipeline = Pipeline(steps=[
    ('column_transforms', preprocessor),
    ('model', LGBMRegressor(**lgbm_best_params, random_state=13))
])


# Fit and predict
final_lgbm_pipeline.fit(X_train, y_train)
lgbm_preds = final_lgbm_pipeline.predict(X_test)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.010479 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1339
[LightGBM] [Info] Number of data points in the train set: 94792, number of used features: 10
[LightGBM] [Info] Start training from score 9.707233


In [11]:
# Generate submission file
submission_df = pd.DataFrame({'id': test_df.index, 'Rings': lgbm_preds})
submission_df.to_csv('submission.csv', index=False)