Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental mode #3

Open
fedor57 opened this issue Sep 6, 2018 · 2 comments
Open

Incremental mode #3

fedor57 opened this issue Sep 6, 2018 · 2 comments

Comments

@fedor57
Copy link

fedor57 commented Sep 6, 2018

Hi, was able to use the aggregator actually, thank you very much!

It has squeezed out 1.8M responses into 500K labels using 4.5 hours on 1 thread on server and 35Gb of memory ;) I think we can incorporate the solution, but I need to implement some enchancements to make it more usefull in production scenario. I will share my thoughts here just to let you know what we think would be useful in our real situation:

  • for the MLtoRank scenario we need to constantly get new labels and aggregate new judgements. 5 hours delay for adding an extra 1000 labels may be too long and too much electricity to burn. So we need to learn how to perform incremental step. That may include a possibility to backup all distributions and other state, prefill new cells with defaults and perform 1-2 extra iterations.

  • the prior may be enchanced by implementing a "partial" steps that updates only part of the rows imperically close to the changed ones. Then after one partial step we can perform one full to settle down if needed.

  • we have a method to order extra 3 marks if the first 3 do not give a confident label. So we will need to output a confidence level for chosen label for the decision to make an extra order.

And one extra off topic:

  • actually we have a binary labels with extra "grey" option. This is not a true ordinal, because "grey" option is rare, a few percent: it is allowed to use it in complex situations, also it can be produced if we have a lot of confident answers with trues and false. I think we can write some sort of heuristic based on labels probability distribution. E.g. calculate P(white) * 0.3 + P(gray) + P(black) * 0.3 + and compare with P(white) and P(black).

I would be happy to hear any thoughts regarding this, thank you!

@fedor57
Copy link
Author

fedor57 commented Sep 6, 2018

Just to let you know. I was involved once in one of the search giants in calculating kind of freshness PageRank over constantly changing web graph. The algorithm somehow accumulated weight diff and distributed it to peers when weight exceeded some threshold. Also there were some heuristics to intensify processing near new nodes with big weights.

Regarding convergence in incremental scenario: perhaps we can backup values from previous steps and update peers of the worker / task in case there is a big change in value with a flag "include in the next partial iteration". Then run some partial iterations with full ones every 5 partial. If believe that such a technic could produce a VERY fast dawid skene algorithm implementation. ;) Especially for the incremental scenario.

@vbsinha
Copy link
Collaborator

vbsinha commented Oct 4, 2018

Hi,

One way to achieve the first two points would be to use an online algorithm.

  • One could use an online algorithm as described in the paper. The idea here would be to first estimate the true labels for all the questions that are available at the beginning. Then you could save the current state (class_marginals, error_rates, question_classes, counts). When you get a new batch of responses, you could load the saved variables, and do an EM pass on the new batch to estimate its correct labels. This would use the state learnt from previous batches and so will help for the current batch. You can then save the new set of parameters and repeat. This has not yet been implemented in this code.
  • To obtain the correct response of each question, we first calculate a term proportional to the probability (confidence) that a particular label would be correct and then choose the label which has the highest probability. So you can print question_classes[i, :] before this line to view the confidence that each label is correct for the ith question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants