Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Information-Gain & TF-IDF support in Contextual classification #1125

Closed
19 tasks done
etiennedi opened this issue Apr 20, 2020 · 0 comments · Fixed by #1131
Closed
19 tasks done

Information-Gain & TF-IDF support in Contextual classification #1125

etiennedi opened this issue Apr 20, 2020 · 0 comments · Fixed by #1131

Comments

@etiennedi
Copy link
Member

etiennedi commented Apr 20, 2020

This is the implementation for the POC findings of #1118, originally proposed in #1115.

Todos

  • finalize design in POC (POC for dynamic vector noise recognition #1118)
  • extend current classifier so it can run preparation steps (required for tf-idf and caching of target objects)
  • extend parameters for new configuration (as designed in API Discussion: New Parameters for Contextual classification #1124)
    • validate/set defaults
    • use (safely) in classification to replace hard-coded values
  • extend interfaces (and repositories if required) with new db capabilites
  • aggregate information before run
    • fetch target vectors
    • calculate tf-idf score
  • update run
    • calculate IG
    • strip by IG and TF-IDF
    • boost by IG if configured
  • fix/update broken/outdated/imprecise tests
    • unit
    • journey
  • check for unresolved TODOs in classification package
etiennedi added a commit that referenced this issue Apr 21, 2020
A prepare run is a run done before the individual per-item
classifications to provide an additional context. As part of this
prepare run we plan to fetch information that stays valid for the entire
run, such as target vectors. Additionally this is the place to calculate
tf-idf scores over all documents.
etiennedi added a commit that referenced this issue Apr 22, 2020
etiennedi added a commit that referenced this issue Apr 22, 2020
This should be an actual batch method in the c11y. This is just a
temporary shortcut to move a bit faster. However, this is very
problematic as we still have a lot of request overhead and have no clean
way to determine word-not-found errors at the moment.

This will be addressed before completing #1125
etiennedi added a commit that referenced this issue Apr 23, 2020
NOTE: The paramters are currently still hard-coded as we haven't added
validation/default setting for them yet
etiennedi added a commit that referenced this issue Apr 23, 2020
NOTE: The parameters are all still hard-coded atm.
etiennedi added a commit that referenced this issue Apr 27, 2020
This is a rather elaborate fake for "only" a unit test, but it provides
us with a lot of confidence that the elaborate implementation for the
contextual classification is working on a unit test level as well and
not just on an e2e level.
etiennedi added a commit that referenced this issue Apr 29, 2020
etiennedi added a commit that referenced this issue Apr 29, 2020
etiennedi added a commit that referenced this issue Apr 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant