Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate name detection into SmartTextVectorizer #456

Closed
wants to merge 142 commits into from

Conversation

MWYang
Copy link
Contributor

@MWYang MWYang commented Jan 14, 2020

Describe the proposed solution
Incorporates the changes in #445 and #457 into SmartTextVectorizer and SmartTextMapVectorizer.

Additional context
Merge #457 before merging this PR. Compare the diff between this PR and that one on my forked repo.

Changes from #455 needs to be merged before this PR is ready.

… Still need to fix HLL serialization in Spark issue
…stead of moments of text length; Still need to fix no moments higher than the 1st being calculated
@MWYang MWYang marked this pull request as ready for review January 22, 2020 00:24
@codecov
Copy link

codecov bot commented Jan 24, 2020

Codecov Report

Merging #456 into master will decrease coverage by 12.27%.
The diff coverage is 84.05%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #456       +/-   ##
===========================================
- Coverage      87%   74.72%   -12.28%     
===========================================
  Files         341      341               
  Lines       11485    11532       +47     
  Branches      378      597      +219     
===========================================
- Hits         9992     8617     -1375     
- Misses       1493     2915     +1422
Impacted Files Coverage Δ
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 72.22% <ø> (-9.73%) ⬇️
...main/scala/com/salesforce/op/test/TestCommon.scala 40.9% <0%> (-9.1%) ⬇️
...e/op/stages/impl/feature/SmartTextVectorizer.scala 95.79% <100%> (+0.18%) ⬆️
...m/salesforce/op/utils/stages/NameDetectUtils.scala 86.11% <100%> (-1.94%) ⬇️
...s/impl/feature/OPCollectionHashingVectorizer.scala 93.87% <66.66%> (-2.68%) ⬇️
...p/stages/impl/feature/SmartTextMapVectorizer.scala 93.33% <73.91%> (-6.67%) ⬇️
...sforce/op/stages/base/binary/BinaryEstimator.scala 0% <0%> (-100%) ⬇️
...la/com/salesforce/op/aggregators/Geolocation.scala 0% <0%> (-100%) ⬇️
.../salesforce/op/aggregators/FeatureAggregator.scala 0% <0%> (-100%) ⬇️
...stages/base/sequence/BinarySequenceEstimator.scala 0% <0%> (-100%) ⬇️
... and 98 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c0f67b...9344e5f. Read the comment docs.

@MWYang MWYang closed this Jan 24, 2020
@MWYang MWYang reopened this Jan 24, 2020
dataset.map(_.map(computeTextStats(_, shouldCleanText)).toArray).reduce(_ + _),
Array.fill[NameDetectStats](inN.length)(NameDetectStats.empty)
)
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the formatting of both if branches look the same since they're doing nearly the same thing?

// In which case create SensitiveFeatureInformation for all features
case ((feature: String, key: Option[String]), stats: NameDetectStats)
if log.isDebugEnabled || computeTreatAsName(stats) =>
val N = stats.dictCheckResult.count.toDouble
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more descriptive name, please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants