Information-Gain & TF-IDF support in Contextual classification #1125

etiennedi · 2020-04-20T12:31:09Z

A prepare run is a run done before the individual per-item classifications to provide an additional context. As part of this prepare run we plan to fetch information that stays valid for the entire run, such as target vectors. Additionally this is the place to calculate tf-idf scores over all documents.

This should be an actual batch method in the c11y. This is just a temporary shortcut to move a bit faster. However, this is very problematic as we still have a lot of request overhead and have no clean way to determine word-not-found errors at the moment. This will be addressed before completing #1125

NOTE: The paramters are currently still hard-coded as we haven't added validation/default setting for them yet

NOTE: The parameters are all still hard-coded atm.

This is a rather elaborate fake for "only" a unit test, but it provides us with a lot of confidence that the elaborate implementation for the contextual classification is working on a unit test level as well and not just on an e2e level.

This was referenced Apr 20, 2020

DISCUSSION: In vectorization ignore words too far from specific concept #1115

Closed

Priorization Spring 2020 #1107

Closed

etiennedi added a commit that referenced this issue Apr 20, 2020

gh-1125 extend paramaters and prepare pre-run aggregations

2af4627

etiennedi added a commit that referenced this issue Apr 22, 2020

gh-1125 calculate tf-idf values as part of preparation

d865dc9

etiennedi added a commit that referenced this issue Apr 22, 2020

gh-1125 unit test tf-idf calculator

c5fe5d5

etiennedi mentioned this issue Apr 22, 2020

Provide MutliVectorForWord method weaviate/contextionary#29

Closed

etiennedi added a commit that referenced this issue Apr 22, 2020

gh-1125 calculate IG in contextual run

a801e2f

etiennedi added a commit that referenced this issue Apr 23, 2020

gh-1125 classify based on IG and TF IDF, add logging

1e97f58

NOTE: The paramters are currently still hard-coded as we haven't added validation/default setting for them yet

etiennedi added Classifications (ML) Core implementation labels Apr 23, 2020

etiennedi added a commit that referenced this issue Apr 23, 2020

gh-1125 boost words by IG

e6fe30b

NOTE: The parameters are all still hard-coded atm.

etiennedi added a commit that referenced this issue Apr 23, 2020

gh-1125 set defaults and use parameters

8c5487a

etiennedi added a commit that referenced this issue Apr 25, 2020

gh-1125 use latest contextionary client with MultiWfV

3d969ec

etiennedi added a commit that referenced this issue Apr 29, 2020

gh-1125 add full journey test

5121c50

etiennedi added a commit that referenced this issue Apr 29, 2020

gh-1125 update auto-generated files

ba04c7f

etiennedi added a commit that referenced this issue Apr 29, 2020

gh-1125 resolve TODOs around logging

fcb44a6

etiennedi mentioned this issue Apr 29, 2020

Feature/improved contextual classification #1131

Merged

etiennedi closed this as completed in #1131 Apr 29, 2020

This was referenced Apr 29, 2020

POC for dynamic vector noise recognition #1118

Closed

API Discussion: New Parameters for Contextual classification #1124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Information-Gain & TF-IDF support in Contextual classification #1125

Information-Gain & TF-IDF support in Contextual classification #1125

etiennedi commented Apr 20, 2020 •

edited

Information-Gain & TF-IDF support in Contextual classification #1125

Information-Gain & TF-IDF support in Contextual classification #1125

Comments

etiennedi commented Apr 20, 2020 • edited

Todos

etiennedi commented Apr 20, 2020 •

edited