File size with IterativeImputer is huge. Is there a way to reduce it? #28006

dstanner · 2023-12-22T18:58:22Z

dstanner
Dec 22, 2023

I'm working with the IterativeImputer to fill out missing values of a dataset. The training set has 24k rows, and there are 82 total features. I'm using RandomForestRegressor as the estimator with 150 trees, and 5 iterations of the imputer. I'm using random forests because the default Bayesian Ridge regressor returned a number of impossible values, and even with setting max and min value constraints there were a lot of very implausible imputed values. The non-linear estimator did a better job and performed better when validating on non-missing values.

The saved file size for the model is 100GB, and it takes 15min to load back into memory on my VM.

Looking at the sizes in memory of each of the object attributes, it looks like the culprit (as expected) is the imputation_sequence_ attribute, which contains all of the individual RF estimators for each feature for each iteration.

I understand that random forests may lead to bigger files than linear models, but 100GB for 82 features seems excessive.

Is there any way to streamline the imputer object to make a more manageable size? Or am I stuck with this as long as I use random forests as the estimator?

Answered by glemaitre

Dec 23, 2023

I am not that surprised. You have 150 trees x 82 features stored. Each tree will be unpruned meaning you are trying to split in the worst case scenario 24k samples where each sample can become a leaf.

So you can try to limit the size of the forest reducing the number of trees and the number of leaves.

View full answer

glemaitre · 2023-12-23T11:24:54Z

glemaitre
Dec 23, 2023
Maintainer

I am not that surprised. You have 150 trees x 82 features stored. Each tree will be unpruned meaning you are trying to split in the worst case scenario 24k samples where each sample can become a leaf.

So you can try to limit the size of the forest reducing the number of trees and the number of leaves.

1 reply

dstanner Dec 23, 2023
Author

Thanks -- also consider the x 5 for the 5 iterations. Hadn't thought about the fact that full trees are built for each one. This all makes sense. Appreciate it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File size with IterativeImputer is huge. Is there a way to reduce it? #28006

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

File size with IterativeImputer is huge. Is there a way to reduce it? #28006

dstanner Dec 22, 2023

Replies: 1 comment · 1 reply

glemaitre Dec 23, 2023 Maintainer

dstanner Dec 23, 2023 Author

dstanner
Dec 22, 2023

Replies: 1 comment 1 reply

glemaitre
Dec 23, 2023
Maintainer

dstanner Dec 23, 2023
Author