<a href="https://colab.research.google.com/github/xingweird/WIDS-Google/blob/main/notebooks/wids_datathon_2023_code_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Women in Data Science Hackaton


This notebook is a tutorial for the WiDS Hackathon. In this notebook, we will walk through the end to end process of getting the data, exploring it, feature engineering, modeling, evaluation and submission.

This notebook is only meant to be a starting point. There are multiple areas that we will not cover. When we deal with issues such as data cleaning or model choice, we would only explore 1 or 2 options.

We will use 2 Machine Learning models: a simple linear regression, and gradient boosting trees using the LightGBM implementation.

Other approaches that can be explored are time series methods such as [ARIMA](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average)/SARIMAX and deep learning methods for squence modeling and tabular data.
This notebook should be used as a benchmark, upon which you can improve your model.

\
**Reference Links**:

[Registration Form](https://airtable.com/shrSmOC8mMDjc4dFl) for Participating;

[Kaggle Datathon Challenge Page](https://www.kaggle.com/competitions/widsdatathon2023/overview);



### Download Data from Kaggle
1. First, you should log in or sign-up to [Kaggle](https://www.kaggle.com/)
2. Go to "Account"
3. Click on "Create New API Token" under 'API' section
4. Step 3 should trigger the download of the "kaggle.json" credential (likely be sitting in your Downloads/)
5. Upload the "kaggle.json" file to this Colab:

In [None]:
from google.colab import files
files.upload()

In [None]:
#@title Download the WiDS datasets
#@markdown Make sure your credentials are up-to-date and you have accepted the competition's terms and conditions

# setups
! pip install -q kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets list

! cd content
! kaggle competitions download -c widsdatathon2023
! unzip /content/widsdatathon2023.zip

# Setup

### Import Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import preprocessing
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import FunctionTransformer

# Models
from sklearn.linear_model import LinearRegression # Linear regression
import lightgbm as lgb # Gradient Boosting Trees

import time

### Load Data

Please find the data dictionary [here](https://www.kaggle.com/competitions/widsdatathon2023/data).

In [None]:
train_df = pd.read_csv('/content/train_data.csv')
display(train_df.head().style.set_caption('Train data'))

test_df = pd.read_csv('/content/test_data.csv')
display(test_df.head().style.set_caption('Test data'))

# Exploratory Data Analysis (EDA)

EDA is one of the most important parts of beginning an ML engagement. Understanding the data allows the modelers to find any discrepancies, such as outliers or missing values, and learn the shape and distribution of values. Understanding this is critical for the model's performance and helps inform future feature engineering and design decisions. The main objectives of EDA are:

1. **Examine the data and missing Value Analysis**\
Understand and resolve any potential issues with the data, such as redistributing outliers or imputing missing values.
2. **Univariate Analysis: checking one variable**\
Understand the schema of the available data, which will drive the model's metadata to help during future Continuous Training (CT) cycles to detect potential data skew.
3. **Multivariate Analysis: Checking correlation**\
Inform which type(s) of models will perform best, given the shape of the data, sparsity of features, and relationship between existing fields.

### 1. Examine the data and missing Value Analysis
- Missing value could be caused by mutiple reasons.
- Sometimes missing values could be caused by pulling or delivering mistakes. Examinng the data, especially the missing values can help the modeler to validate the dataset and check with the data provider as early as possible.
- Missing values (either nulls or zeroes) may also be a known scenario in a dataset. Eg. Some questions people choose not to answer in the survey data. The modelers will need to analyze to operate accordingly.
- However, there are times when it was caused by improper data collection and thus affect model performance.

#### Dimensions
Let's look at the data. First, let's see the shape. How many columns and rows we have:

The reason for the difference:
* there is no 'Target' column
* The test data size is smaller than train

In [None]:
train_df.shape

In [None]:
test_df.shape

#### Time Range

Let's transform the time feature to datetime check the time range


In [None]:
# convert to datetime
train_df.startdate = pd.to_datetime(train_df.startdate)
test_df.startdate = pd.to_datetime(test_df.startdate)

In [None]:
# check data time range
print('Max startdate - train_df:', train_df.startdate.max())
print('Min startdate - train_df:', train_df.startdate.min())

print('Max startdate - test_df:', test_df.startdate.max())
print('Min startdate - test_df:', test_df.startdate.min())

#### Column Types & Missing Values

In [None]:
train_df.dtypes

In [None]:
# does has missing values
print(train_df.isnull().values.any())
print(test_df.isnull().values.any())

### 2. Univariate Analysis: checking one variable
- For analyzing the data, it's important to go through each feature individually and look at the distribution.
- This includes analyzing common metrics including minimum, maximum, mean, and *frequency*. This can help detect potential outliers that may affect model imperformance.
- After conducting this analysis, you may then engage in data preprocessing to remove any outliers in the data.

#### Continuous Variables
- When dealing with continuous variables, it’s important to know the variable’s central tendency and spread. Statistical metrics visualization methods such as Box-plot, Histogram/Distribution Plot are used to measure this.
- When the continuous variables are time series, it's important to analyze the variables according to time. Plotting over time is usually helpful for the modelers to recgonize the seasonality and the trend.

**Target**\
Target column should be `contest-tmp2m-14d__tmp2m`, which appears in the training data, but it doesn't appear in the test data:

In [None]:
target = [c for c in train_df.columns if c not in test_df.columns][0]
print(target)

In [None]:
train_df[target].describe()

Let's plot the target variable over time:

In [None]:
plt.figure(figsize=(12, 8))
plt.plot(train_df.startdate,
         train_df[target],
         'o',
         alpha=0.03)
plt.title('Temperature over time')
plt.ylabel('Temperature')
plt.xlabel('Date')

Let's also look at the distribution of temperatures:

In [None]:
plt.figure(figsize=(12, 8))
sns.distplot(train_df[target])

#### Categorical Variables
We’ll utilize a frequency table to study the distribution of categorical variables. Count and Count percent against each category are two metrics that can be used to assess it. As a visualization, a count-plot or a bar chart can be employed.

Let's look at our categorical features:

In [None]:
train_df.dtypes.unique()

In [None]:
train_df.dtypes.sort_values()

We have only one categorical feature, and that is "climateregions__climateregion"

In [None]:
train_df.groupby('climateregions__climateregion')['climateregions__climateregion'].size()

In [None]:
plt.figure(figsize=(12, 8))
plt.bar(x = train_df.groupby('climateregions__climateregion').size().index,
        height = 100 * train_df.groupby('climateregions__climateregion')['climateregions__climateregion'].size() / train_df.shape[0]
        )
plt.title('Data size percentage of each region')
plt.xlabel('region');

### 3. Multivariate Analysis: checking correlation
- Correlation analysis measures the statistical relationship between two different variables. The result will show how the change in one parameter would impact the other parameter.
- Correlation analysis is a very important concept, popular in the field of predictive analytics.
- Though correlation analysis helps us in understanding the association between two variables in a dataset, it can't explain, or measure, the cause.

#### Continuous varibles and target


---


Let's start with a very simple and naive correlation plot and see what features correlate (linearly) with our target variable. We should keep in mind that correlations are a very basic tool. They can't capture non-linear relations, and are not ideal for categorical and some raw features (like cooredinates)

In [None]:
target = 'contest-tmp2m-14d__tmp2m'
train_df.corr()[target].sort_values()

We can see that some features are very highly correlated (both negative and positive are very informative). We can use the most informative features for a simple benchmark model

In [None]:
plt.figure(figsize=(12, 8))
plt.scatter(train_df['nmme-tmp2m-56w__cfsv2'],
            train_df[target],
            alpha=0.01)
plt.ylabel('Tempreature')
plt.xlabel('Most correlated feature (nmme-tmp2m-56w__cfsv2)')

#### Categonical variables and target


---

Let's check quickly about the mean of the target at each category.

In [None]:
train_df.groupby('climateregions__climateregion').mean()[target]

We can see that the 'climateregions__climateregion' feature is also high informative as the average temperature varies significantly between regions

We have multiple regions, let's create a new feature called "loc_group", based on lat-lon cooredinates:

In [None]:
train_df['loc_group'] = train_df.groupby(['lat','lon']).ngroup()
train_df['loc_group'].nunique()

We have 514 different regions.

Let's plot the temperature for the different location groups:

In [None]:
ax = sns.relplot(data=train_df,
            x='startdate',
            y='contest-tmp2m-14d__tmp2m',
            hue='loc_group')
ax.fig.set_figwidth(12)
ax.fig.set_figheight(8)

#### Continuous variables and Continuous variables


---


1. Observe by visualization(usally scatter plot)
<img src="https://www.mathsisfun.com/data/images/correlation-examples.svg">
2. Correlation coefficient \
<img src="https://vitalflux.com/wp-content/uploads/2020/09/Screenshot-2020-09-29-at-11.19.40-AM.png" width="500">
3. Variance inflation factor(VIF) to flag multicolinearity \
<img src="https://www.reneshbedre.com/assets/posts/reg/multicol.webp?ezimgfmt=ng%3Awebp%2Fngcb2%2Frs%3Adevice%2Frscb2-1" width="600">





# Preprocessing & Feature Engineering

Preprocessing should take into account:
1. Missing and invalid values
2. Categorical values
3. Time Variables

### Missing Values

We have missing values in a few columns. not in our target columns though. We can deal with missing values with a few imperfect ways as below:

<img src='https://miro.medium.com/v2/resize:fit:1400/format:webp/1*_RA3mCS30Pr0vUxbp25Yxw.png' width="600">

We will use the very imperfect mean imputation
* replacing the missing value with the column mean, not the best approach 

Let's check for null values in the training *data*:

In [None]:
train_df.isnull().values.any()

In [None]:
for col in train_df.columns:
    if train_df[col].isnull().values.any():
        print(col, train_df[col].isnull().values.sum())

In [None]:
for col in train_df.columns:
    if train_df[col].isnull().values.any():
        train_df[col].fillna(train_df[col].mean(), inplace=True)

### Categorical Variables

Next we will deal with the categorical feature. Below are some common encoding techniques to convert catgoricalvariables into numerical values.

<img src="https://ai-ml-analytics.com/wp-content/uploads/2021/02/Encoding-1.png">

We only have one categorical feature. We can encode it in several ways:
1. One-hot encoding - turn the feature into 15 binary columns (because there are 15 distinct values) where the "hot" value is 1 and all the others are 0. That's a very good method when there is a small number of unique values, an gets worse the more unique values we have. What is "large" heavily depends on the data distribution and amount of information stored in the categorical feature. 15 is somewhat borderline.
2. Label-encoding - replace the categorical value with an integer index. This does not increase the dimensionality of the data (we don't have 15 new columns now). However, the integer index is somewhat arbitrary and it implies relations between the categories that are not necessarily true.
3. Target-encoding - replace the categorical value with the average target value for this category. This method is very good but has the risk of overfitting.




Below is an example to use LabelEncoder to transform a categorical variable:

In [None]:
le = preprocessing.LabelEncoder()
train_df['climateregions__climateregion'] = le.fit_transform(train_df['climateregions__climateregion'])

In tabular problems, feature engineering is often the most important part. In feature engineering we create new features that capture the relationship between the target variable and our features best, based on our domain knowledge or from the EDA.

In [None]:
train_df.climateregions__climateregion.unique()

### Time Features
1. dummy variables
2. (optional) cyclical encoding with sine/cosine transformation

In [None]:
# extract year, month, day of year
def create_time_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['year'] = df.startdate.dt.year
    df['month'] = df.startdate.dt.month
    df['dayofyear'] = train_df.startdate.dt.day_of_year
    return df

train_df = create_time_features(train_df)
train_df.head()

In [None]:
# Copied from https://colab.research.google.com/drive/10r73mOp1R7cORfeuP97V65a-rgwGyfWr?usp=sharing#scrollTo=c9ZkVb2aU-S7
def add_season(df: pd.DataFrame) -> None:
    month_to_season = {
      1: 0,
      2: 0,
      3: 1,
      4: 1,
      5: 1,
      6: 2,
      7: 2,
      8: 2,
      9: 3,
      10: 3,
      11: 3,
      12: 0
  }
    df['season'] = df['month'].apply(lambda x: month_to_season[x])

add_season(train_df)

(Optional) Since time is cyclical, let's add features that express the seasonality and cyclicalness of our data (that's a common transformation for time features):

In [None]:
# Copied from https://colab.research.google.com/drive/10r73mOp1R7cORfeuP97V65a-rgwGyfWr?usp=sharing#scrollTo=c9ZkVb2aU-S7

def sin_transformer(period):
    return FunctionTransformer(lambda x: np.sin(x / period * 2 * np.pi))


def cos_transformer(period):
    return FunctionTransformer(lambda x: np.cos(x / period * 2 * np.pi))

def encode_cyclical(df):
    # encode the day with a period of 365
    df['day_of_year_sin'] = sin_transformer(365).fit_transform(df['dayofyear'])
    df['day_of_year_cos'] = cos_transformer(365).fit_transform(df['dayofyear'])

    # encode the month with a period of 12
    df['month_sin'] = sin_transformer(12).fit_transform(df['month'])
    df['month_cos'] = cos_transformer(12).fit_transform(df['month'])

    # encode the season with a period of 4
    df['season_sin'] = sin_transformer(4).fit_transform(df['season'])
    df['season_cos'] = cos_transformer(4).fit_transform(df['season'])

encode_cyclical(train_df)

In [None]:
train_df.head()

In [None]:
cyc_df_eg = train_df[train_df.loc_group == 3]
fig, ax = plt.subplots(figsize = (12,8))
ax.plot(cyc_df_eg.startdate, cyc_df_eg['day_of_year_sin'], label='day sin')
ax.plot(cyc_df_eg.startdate, cyc_df_eg['month_sin'], label='month sin')
ax.plot(cyc_df_eg.startdate, cyc_df_eg['season_sin'], label='season sin')
ax.plot(cyc_df_eg.startdate, (cyc_df_eg['day_of_year_cos'] + cyc_df_eg['day_of_year_cos'])/2,
        label='day sin + day cos')
plt.legend()

^ one soluion here works best in selecting how best to capture seasonality 

# Modeling
We will use 2 models:
1. Linear Regression
2. Lightgbm


### Train - Validate Split

Feature selection


In [None]:
exclude_cols = ['index', 'startdate']
features = [c for c in train_df.columns if ((c != target) & (c not in exclude_cols))]


# train_df.sort_values(by='startdate', inplace=True) # Verify the data is sorted by time
# train_df.reset_index(inplace=True)
split_point = 0.98 # 98 % training, 2% validation, because we have a lot of data, 2% validation can be enough
train = train_df[:int(split_point*len(train_df))]
val  = train_df[int(split_point*len(train_df)):]

# Altenative - split by time:
# train = train_df[train_df['startdate'] <= '2016-08-17']
# val  = train_df[train_df['startdate'] > '2016-08-17']

X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]

## Linear Model

In [None]:
model = LinearRegression()

model.fit(X_train, y_train)

##  Evaluation

When we evaluate our model we need to choose the right metric. The right metric would fit the data distribution as well as the final business KPI we actually care about.

This is a regression model, and therefore we should choose an evaluation metric for a regression problem.
There are a few possiblities:
1. [Root Mean Squared Error (RMSE)](https://en.wikipedia.org/wiki/Root-mean-square_deviation)
2. [R2](https://en.wikipedia.org/wiki/Coefficient_of_determination)
3. [Mean Absolotue Error](https://en.wikipedia.org/wiki/Mean_absolute_error)

And others. We will use the r2_score since running from 0 to 1 (in most cases) is the most intuitive one, as well as RMSE since this is the one used in the Kaggle competitioN.

In [None]:
print(f'Training RMSE: {mean_squared_error(y_train, model.predict(X_train), squared=False)}')
print(f'Validation RMSE: {mean_squared_error(y_val, model.predict(X_val), squared=False)}')

In [None]:
plt.figure(figsize=(14, 10))

plt.subplot(211)
plt.scatter(model.predict(X_train), y_train, alpha=0.01)
plt.xlabel('Prediction')
plt.ylabel('Actual')
plt.title(f'Trainig RMSE for Linear model is {mean_squared_error(y_train, model.predict(X_train), squared=False)}')

plt.subplot(212)
plt.scatter(model.predict(X_val), y_val, alpha=0.01)
plt.xlabel('Prediction')
plt.ylabel('Actual')
plt.title(f'Validation RMSE for Linear model is {mean_squared_error(y_val, model.predict(X_val), squared=False)}')

##  LightGBM
LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

* Faster training speed and higher efficiency.
* Lower memory usage.
* Better accuracy.
* Support of parallel, distributed, and GPU learning.
* Capable of handling large-scale data.

[XGBoost vs. LightGBM](https://neptune.ai/blog/xgboost-vs-lightgbm)



In [None]:
# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_val, y_val, reference=lgb_train)

# specify your configurations as a dict
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 5
}

print('Starting training...')
# train
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=200,
                valid_sets=lgb_eval)

We can see that while our model fits the training data well (with r2=0.98) it doesn't generalize very well to the validation set and we get r2=0.979. This means we suffer from some overfitting.

In [None]:
print(f'Training RMSE: {mean_squared_error(y_train, gbm.predict(X_train), squared=False)}')
print(f'Validation RMSE: {mean_squared_error(y_val, gbm.predict(X_val), squared=False)}')

In [None]:
plt.figure(figsize=(14, 10))

plt.subplot(211)
plt.scatter(gbm.predict(X_train), y_train, alpha=0.01)
plt.xlabel('Prediction')
plt.ylabel('Actual')
plt.title(f'Trainig RMSE for LightGBM model is {mean_squared_error(y_train, gbm.predict(X_train), squared=False)}')

plt.subplot(212)
plt.scatter(gbm.predict(X_val), y_val, alpha=0.01)
plt.xlabel('Prediction')
plt.ylabel('Actual')
plt.title(f'Validation RMSE for LightGBM model is {mean_squared_error(y_val, gbm.predict(X_val), squared=False)}')

# Feature Importance

Which features explain the most variation in the data 

In [None]:
lgb.plot_importance(gbm, max_num_features=20, figsize=(8,15))

# Hyperparameter Tuning
Hyperparameters are parameters that are not learned from the data. Each model has different hyperparameters. For a linear regression this can be the strength of the regularization parameter. For tree based models such as gradient boosting trees this can constraints on the tree structure such as maximal depth, minimal number of samples per leaf, number of trees, etc.

Many times the hyperparameters are used to optimize the trade-off between fitting the training data and generalization (in other words, the bias-variance or the underiftting-overfitting trade-off/problem).

Hyperoarameters tuning can be done automatically with grid-search, Bayesian optimization or any other kind of way to sample and optimize the space of hyperparameters. However, when we do that we should be careful not to overfit to our validation set. We should also take into account that searching over this multi-dimensional space of possible values may take a lot of time

We will try to tweek some of the lightGBM model hyperparameters.

Our validation score is still lower than the training score. This means that either our model is too simple, and more signal can be captured, our that our model is too complex and therefore doesn't generalize well anymore and is overfitting.

Let's try making our model a bit more complicated and see how the results change:

In [None]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 5
}

print('Starting training...')
# train
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=300,
                valid_sets=lgb_eval)

In [None]:
print(f'MSE for training data {mean_squared_error(y_train, gbm.predict(X_train))}')
print(f'MSE for validation data {mean_squared_error(y_val, gbm.predict(X_val))}')

In [None]:
plt.figure(figsize=(14, 10))

plt.subplot(211)
plt.scatter(gbm.predict(X_train), y_train, alpha=0.01)
plt.xlabel('Prediction')
plt.ylabel('Actual')
plt.title(f'Trainig r2 is {r2_score(y_train, gbm.predict(X_train))}')

plt.subplot(212)
plt.scatter(gbm.predict(X_val), y_val, alpha=0.01)
plt.xlabel('Prediction')
plt.ylabel('Actual')
plt.title(f'Validation r2 is {r2_score(y_val, gbm.predict(X_val))}')

We can see that the validation score improved from 0.99 to 0.992 and more importantly the RMSE when down from 1.24 to 1,01. It may still be that we are overfitting to our validation set. Only after submission we can see how we did on the test dataset.

# Submission

Before submission we need to transform the test dataset using the same transofmrations we used for the training dataset, *but use only transformations and data from the training dataset, to avoid overfitting*

Deal with missing values. Notice that we use the training dataset values to avoid overfitting (since the test data is actually available one might argue that it is OK to use them. And maybe for the sake of competition it's worth trying using the test dataset values for missing values imputation. But in general this os not a good practice):

In [None]:
for col in test_df.columns:
    if test_df[col].isnull().values.any():
        test_df[col].fillna(train_df[col].mean(), inplace=True)

Categorical data. Here we also have to use the label encoder trained on the training data:

In [None]:
test_df['climateregions__climateregion'] = le.transform(test_df['climateregions__climateregion'])

Text Features:

In [None]:
test_df = create_time_features(test_df)

add_season(test_df)
encode_cyclical(test_df)

Encode log group based on loc groups of training data

In [None]:
temp = train_df.groupby(['lat','lon']).mean()['loc_group'].reset_index()
regions_dict = dict()
for row in temp.iterrows():
  key = str(row[1].lat) + '_' + str(row[1].lon)
  regions_dict[key] = row[1].loc_group

test_df['regions_key'] = test_df.apply(lambda x: str(x.lat) + '_' + str(x.lon),
                                       axis=1)
test_df['loc_group'] = test_df.regions_key.apply(lambda x: regions_dict.get(x, -1))

In [None]:
submission = pd.read_csv('sample_solution.csv')
display(submission)
submission[target] = gbm.predict(test_df[features])

submission.to_csv('submission.csv',
                  index = False) # Set index to false to avoid issues in evaluation

Upload this file to the competition page. This submission should give a RMSE of 1.41. which as of January 29th would put you in the 146th place :)

# Next Steps
Ways you can consider in order to improve your prediction:

#### Explore different modeling strategies
- **Time Series** models such as SARIMAX (which takes into account both seasonality and exogenous variables)
- **Different ML models**: You can try neaural networks. Look for either networks for tabular data (such as TabNet and TabTransformer) or neural networks for squence modeling (like RNNs, LSTM and transformer-based models)
- **Ensambling** different models to obtain a weighted average prediction

#### Improve data prepocessing and feature engineering
- Change the categorical features representation to one-hot encoding, target encoding or embedding (you can also use other models that are designed to work with categorical features such as CatBoost)
- Change the missing values strategy. For example, denote the missing values with a special value, or try to predict the missing value based on other features
- Think of new features that you can engineer from the existing features that would better represent the data using your knowledge of the problem
- Try splitting the train-validation data based on a different logic (such as seasons)

#### Hyperparameter Tuning
- Try tweaking different hyperparameters
- Grid search with more granularity

# References
- [How to Handle Missing Data](https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4)
- [Different types of Encoding](https://ai-ml-analytics.com/encoding/)
- [Three Approaches to Encoding Time Information as Features for ML Models](https://developer.nvidia.com/blog/three-approaches-to-encoding-time-information-as-features-for-ml-models/)
- [LightGBM](https://lightgbm.readthedocs.io/en/v3.3.2/)
- [XGBoost vs LightGBM: How Are They Different](https://neptune.ai/blog/xgboost-vs-lightgbm)
