-
Notifications
You must be signed in to change notification settings - Fork 394
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add categorical detection to be coverage based in addition to unique …
…count based (#473) Related issues Currently SmartTextVectorizer and SmartTextMapVectorizer will count the number of unique entries in a text field (up to a threshold, currently 50) and treat the feature as categorical if it has < 50 unique entries. You can still run into features that are effectively categorical, but may have a long tail of low-frequency entries. We would get better signal extraction if we treated these as categorical instead of hashing them. Describe the proposed solution Adding an extra check for Text(Map) features in order to become categoricals. This only applies to features that have a cardinality higher than the threshold and therefore would be hashed. A better approach to detecting text features that are really categorical would be to use a coverage criteria. For example, the topK entries with minimum support cover at least 90% of the entries, then this would be a good feature to pivot by entry instead of hash by token. The value of 90% can be tuned by the user thanks to a param. Extra checks need to be passed : Cardinality must be greater than maxCard (already mentioned above). Cardinality must also be greater than topK. Finally, the computed coverage of the topK with minimum support must be > 0. If there is m < TopK elements with the required minimum support, then we are looking at the coverage of these m elements. Describe alternatives you've considered I've considered using Algebird Count Min Sketch in order to compute the current TextStats. However I ran into multiple issue : Lack of transparency: TopNCMS only returns the "HeavyHitters" however you need much more than that(e.g. cardinality) in order to use the coverage method. Serialization issues when writing to JSON A branch still exists : mw/coverage, but it is in shambles. Additional context Some criticism regarding TextStats. It seems not to be a semi group as it is not associative. Was it intended? * First Logic * First tests * Second Tests * Removing prin ts * fix test * Fix 2 * Adding comments * Line change * Adding more comments * Removing useless condition * Fixing tests Co-authored-by: Michael Weil <mweil@salesforce.com>
- Loading branch information
1 parent
7d0c33e
commit 24cdbc4
Showing
5 changed files
with
299 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.