Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow TextStats length distribution to be token-based and refactor for testability #464
Allow TextStats length distribution to be token-based and refactor for testability #464
Changes from 8 commits
1c2cbc2
83881c0
ca2e122
c190d97
75b0770
305c57e
ff53dc4
13b6076
fc6cd07
1434128
e117fbd
87878e0
fe36ec5
cd31e32
a322363
32ad893
ab9d2a7
4cb5c88
e463685
e44ca68
ce5663e
b866bb3
e72af36
307a014
49127d9
95bc3e7
aad13ba
d78868d
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we reach
RawFeatureFilter.MaxCardinality
forvalueCounts
, willlengthCounts
also stop accumulating ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, this is taken care of by
val newLengthCounts = additionHelper(l.lengthCounts, r.lengthCounts, maxCardinality)
, pls disregard this comment :DThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now this function can explode with NullPointerException is textString is null, while before it could not have happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I'm not sure if it's possible for the
textString
argument to benull
in practice though. When this function used for tokenizing the map entries, a value that was originally null there will just not show up as an entry in the map. When it's used for tokenizing a normal Text entry, then we should have already safely converted any nulls or missing elements into anOption[String]
, right?The actual tokenize call during vectorization is still
tokenize(v.toText)
wherev
is the value in a text map. I'd actually argue that that should be changed totokenizeString(v)
to save time converting it to Text and back again.I agree it's technically less safe, but I don't think it's necessary to have null checking at this point in the flow. We should make sure the data gets created in a safe way, which I think we already do. Are there some specific edge cases I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the simplest is to add a null check
tokenizeString
and returnThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it more, I'm pretty sure the old tokenize function would also give a NPE if you fed it a sneaky null value. The
SomeValue.unapply
function explicitly callsv.isEmpty
which would also fail if v wasnull
.I put back the old tokenize function as
oldTokenize
and triedwhich did indeed throw a NPE.
We have tests all over the place (eg. our vectorizer tests and FeatureTypeSparkConverterTest) that make sure we can handle
null
values in dataframes and safely convert them into our types. I'm not aware of any explicit null checks in our functions elsewhere, so it just feels weird to put one here.@leahmcguire any opinions on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SomeValue.unapply operates on value which is Option[String]. Null check is done during the construction of Text when the values are extracted from Dataframe / RDD. NullPointerException is indeed unlikely to be thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your example you provided is not currently possible and also not a fair one :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jauntbox is this only called from the Option[String] version below? if so make it private and it is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact make them both private please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we only keeping tokens with length >
minTokenLength
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was existing behavior. It's a configurable parameter (defaulting to 1), so is not required.