Add is_scholarlyarticle feature to wikidatawiki#144
Conversation
|
(I'm unsure whether I should add the model and model_info from the build on my local device to this commit or whether there exists a reference machine for that purpose.) |
Scholarly articles have a different structure and often don't have many labels other than the one in the original language. This impacts them to a degree larger than what would be appropriate. Note that this effect likely cannot seen in the current training data was collected during a time when there were no scholarly articles and thus contains none.
Codecov Report
@@ Coverage Diff @@
## master #144 +/- ##
==========================================
+ Coverage 49.40% 49.47% +0.07%
==========================================
Files 49 49
Lines 1429 1431 +2
==========================================
+ Hits 706 708 +2
Misses 723 723
Continue to review full report at Codecov.
|
|
We have a reference machine to build this model on. But for now, we don't expect any improvements in model performance. I wonder if we could add some more training data to the model. Would you be willing to help recruit editors to label the quality of items? This might be a good opportunity to pull in some scholarly article items. |
|
That work is already ongoing ( https://labels.wmflabs.org/stats/wikidatawiki/95 ) and hopefully, it will result in better training data where we can see whether this feature makes any difference :) |
|
Great news! |
Ladsgroup
left a comment
There was a problem hiding this comment.
This has my blessings. It's good to go once we have new data (it doesn't make much sense to merge it without the new data)
|
Let's just merge this as the model is not going to be retrained yet. |
|
Do you folks need help building models? ores-misc-01 makes the work relatively painless. One concern with merging features like this without rebuilding is that we don't know if it has a positive, negative, or neutral effect on the model fitness. Adding features that do not provide utility adds complexity. Adding features that improve signal (but we don't know) might result in attributing that change in signal to another change. |
Scholarly articles have a different structure and often don't have many labels other than the one in the original language. This impacts them to a degree larger than what would be appropriate.
Note that this effect likely cannot be seen in the current training data was collected during a time when there were no scholarly articles and thus contains none.