Out of bound access when dataset in continual training has fewer features than in the loaded model #5156

shiyu1994 · 2022-04-17T15:52:55Z

Description

When the training dataset has fewer features than in the loaded model, out of bound access can happen at least in one place here

LightGBM/src/boosting/gbdt_model_text.cpp

Line 642 in fc0c8fd

feature_importances[models_[iter]->split_feature(split_idx)] += 1.0;

where feature_importance has the size of the feature number in the training dataset in continual training, while the feature indices in trees come from the loaded model and can cause out of bound access.

Reproducible example

A reproducible example will be added in the PR fixing this bug, also as a test case.

Environment info

LightGBM version or commit hash:
LightGBM master branch

Additional Comments

When the input dataset is from LibSVM file, this can be a bug because it is possible that in the dataset for continual training all values of a feature are missing, and thus when loading the dataset, the found number of features is fewer than before.

However, in terms of other input formats, this problem should be classified as a misusing case. And proper warning, or fatal message should be provided.

The text was updated successfully, but these errors were encountered:

shiyu1994 added the bug label Apr 17, 2022

shiyu1994 linked a pull request Apr 17, 2022 that will close this issue

Fix number of features mismatching in continual training (fix #5156) #5157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of bound access when dataset in continual training has fewer features than in the loaded model #5156

Out of bound access when dataset in continual training has fewer features than in the loaded model #5156

shiyu1994 commented Apr 17, 2022