The oob score #56

wjj5881005 · 2021-06-08T06:22:45Z

I think the oob score computed in the fit function is wrong.

The authors get the oob sample indices by "mask = ~samples", and then apply X[mask, :] to get the oob samples.
Actually, I test the case and found that there are many same elements between samples and X[mask,:], and the length of training samples and mask samples are the same. For example, if we totally have 100 samples, when 80 samples are used to train the model, then the length of oob samples should be 100-80=20 (without considering replacement).

I also turn to the implementation of sampling oob of randomforest, and I found following codes:

random_instance = check_random_state(random_state)
sample_indices = random_instance.randint(0, samples, max_samples) # get the indices of training samples
sample_counts = np.bincount(sample_indices, minlength=len(samples))
unsampled_mask = sample_counts == 0
indices_range = np.arange(len(samples))
unsampled_indices = indices_range[unsampled_mask] # get the indices of oob samples

then the unsampled_indices is the truely oob sample indices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The oob score #56

The oob score #56

wjj5881005 commented Jun 8, 2021

The oob score #56

The oob score #56

Comments

wjj5881005 commented Jun 8, 2021