forked from elastic/ml-cpp
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[ML] Improve regression and classification QoR for small data sets (e…
…lastic#1960) This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of elastic#1941. In particular, 1. We could miss out rare classes altogether from our validation set for small data sets. 2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets. Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired counts per class based on the frequencies in the remainder in the loop which samples each new fold. Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small data sets runtime is never problematic. For the multi-class classification problem which showed up this problem accuracy increases from around 0.2 to 0.9 as a result of this change.
- Loading branch information
Showing
8 changed files
with
76 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters