-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start the statistics item value matching #39
Conversation
Based on the current scraped data, out of ~2000 cases, it produced 74 matches that 5 of them are wrong and the rest are correct. The precision and recall is 93% and 3.4% respectively. These number will go up when we have more scraped data. Left to do: - Unit tests - Integration tests - Fix make file Bug: T248992
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ext_id_property = None | ||
for pid in potential_match['reference']['referenceMetadata']: | ||
if pid not in self.whitelisted_external_identifiers: | ||
continue | ||
ext_id_property = pid | ||
break | ||
if not ext_id_property: | ||
return [potential_match] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it will be worthwhile to change the structure of reference blob to include a pid key (same as it's supposed to include url). This has 2 advantages:
- We will not have to make this iteration every time we want to get the pid
- This class will not have to depend on th whitelisted exernal identifiers anymore, as we already filter these out in Pipe 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would love to have something like that but I would say it's not part of scope of this PR. We should file a bug, track it and pick it up soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:type items: list | ||
""" | ||
# Everything is the same, bail out. | ||
if len(list(set(items))) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use len()
directly on set()
s: https://docs.python.org/3.7/library/stdtypes.html#set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooh, thanks.
for case in list(set(items)): | ||
if (items.count(case) / len(items)) > (1 - self.maximum_noise): | ||
return case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very Clever! 💯 Nice work.
Arguments: | ||
matchers {wikidatarefisland.data_model.ValueMatchers} -- | ||
A static class with value matcher functions | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be removed, this is not a parameter of this constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ken. It's just copy pasta error.
ext_id_property = None | ||
for pid in potential_match['reference']['referenceMetadata']: | ||
if pid not in self.whitelisted_external_identifiers: | ||
continue | ||
ext_id_property = pid | ||
break | ||
if not ext_id_property: | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as the one on item_statistical_matcher_pipe.py
: #39 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I responded there, I definitely thing we should pick it up but I think it's outside of scope of this PR, we should track it and then pick it up for the next sprint. What do you think?
pid = potential_match['statement']['pid'] | ||
if pid not in self.mapping[ext_id_property]: | ||
return [] | ||
if item_id == self.mapping[ext_id_property][pid].get(str(value)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Easy breezy :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a random thought (feel free to ignore) Should we maybe use the pattern we started in Pipe 3 and use a Value class for items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let me see what I can do about it. if it's not too complicated.
wikidatarefisland/run.py
Outdated
pipe = pipes.ItemStatisticalAnalysisPipe( | ||
whitelisted_ext_ids, | ||
config.get('minimum_repetitions_for_item_values'), | ||
config.get('maximum_noise_ratio_for_item_values') | ||
) | ||
observer_pump = pumps.ObserverPump(storage) | ||
observer_pump.run(pipe, args.input_path, '-') | ||
mapping = pipe.get_mapping() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is strange to me... we're running a pump to update the pipe class itself. I'm not fully convinced that this Analysis class really needs all of this, after all this get_mapping()
method could never really be called without the observation pump, so they don't make sense to me as separate classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see but the observer pump is stateless and makes the pipe running, otherwise the pipe itself needs to know about storage which is something I want avoid here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the observerpump need to write to disk? currently it's just reading from disk. and as this analyzer class is not really a "Pipe" per se, I don't see why it couldn't have storage as a direct dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding storage as a dependency makes the pipe to be coupled to the class and makes testing harder. Also even if this is not a "real" pipe, I would like to keep it as similar as possible to other ones for sake of consistency to make the code easier to understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coolio :) Thanks for the explanation
wikidatarefisland/run.py
Outdated
observer_pump = pumps.ObserverPump(storage) | ||
observer_pump.run(pipe, args.input_path, '-') | ||
mapping = pipe.get_mapping() | ||
pipe = pipes.ItemMappingMatcherPipe(mapping, whitelisted_ext_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reassigning the variable here can be a cause for future confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let me fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pinging on this tiny thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why I forgot about this. Sorry :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran it and BAM! 800 matches! Fantastic work. In terms of changes, There's really one tiny nitpick, but apart from that, I think we're good to go.
Also, there is something noticed in the output, but I'm not sure if it's in the scope of this task. How do we handle matches like the one below, where the match is essentially a piece of structured data?
{
"statement": {
"pid": "P921",
"datatype": "wikibase-item",
"value": {
"entity-type": "item",
"numeric-id": 700120,
"id": "Q700120"
}
},
"itemId": "Q318030",
"reference": {
"referenceMetadata": {
"P7859": "lccn-no2003078358",
"P248": "Q76630151",
"dateRetrieved": "2020-05-13 12:47:32"
},
"extractedData": [
{
"http://schema.org/additionalName": [
"HTO",
"Haupttreuhandstelle Ost",
"Haupttreuhandstelle Ost Posen Treuhandstelle Posen",
"Deutschland (1871-1945). Haupttreuhandstelle Ost",
"Germany. Haupttreuhandstelle Ost"
],
"http://schema.org/knows": [
"http://worldcat.org/identities/np-winkler",
"http://worldcat.org/identities/lccn-n2004102147",
"http://worldcat.org/identities/lccn-no2017106555",
"http://worldcat.org/identities/viaf-35458744",
"http://worldcat.org/identities/lccn-no2008015485",
"http://worldcat.org/identities/lccn-no2019085287",
"http://worldcat.org/identities/lccn-no2017102289",
"http://worldcat.org/identities/lccn-n99260267",
"http://worldcat.org/identities/viaf-306231433",
"http://worldcat.org/identities/nc-uppsala universitet$teknisk naturvetenskapliga vetenskapsomradet"
],
"http://schema.org/name": [
"Germany Haupttreuhandstelle Ost "
],
"http://schema.org/sameAs": [
"http://en.wikipedia.org/wiki/Special:Search?search=Haupttreuhandstelle_Ost",
"http://viaf.org/viaf/139898717",
"https://www.wikidata.org/wiki/Q318030",
"http://id.loc.gov/authorities/names/no2003078358"
]
}
]
}
}
data/references.jsonl: \ | ||
data/matched_references.jsonl \ | ||
data/matched_item_references.jsonl | ||
cat $^ > $@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, thanks!
Also general question (unrelated to this PR), I noticed that make tries to run all of the scripts even if the file is already present in the data folder. Is it the normal way it should work? Am I missing a parameter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the file exists, it should just skip it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"make" basically considers each step as a file. If the file exists, it considers the step as done and skip it. (The exception to this are "PHONY" steps)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weirdly enough it doesn't work that way for me :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's strange, that's how I do it in my localhost and the labs instance (by moving files around).
Maybe that's the problem I encountered before. If A depends on B and B doesn't exists but A does. It still re-runs A (after running B) because it says "the dependency has changed" I guess it's so linuxy and systemdy :D
REFERENCE_LINE = { | ||
'statement': {}, | ||
'itemId': ITEM_ID, | ||
'reference': {'referenceMetadata': {}, | ||
'extractedData': []}} | ||
|
||
EXAMPLE_LINE = { | ||
**REFERENCE_LINE, | ||
'statement': STATEMENT_BLOB, | ||
'reference': { | ||
'referenceMetadata': {WHITELISTED_EXT_ID: 'fooid'}, | ||
'extractedData': ['foo', 'bar'] | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fabulous! I'm happy to see that this file and the constants in it is also useful for other devs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing it
@pytest.mark.parametrize("mapping,given,expected", [ | ||
[given["mapping"]["simple_mapping"], given["item"]["not_item_value_type"], []], | ||
[given["mapping"]["simple_mapping"], given["item"]["one_item"], | ||
[given["item"]["one_item"]]], | ||
[given["mapping"]["non_matching_mapping_value"], given["item"]["one_item"], []], | ||
[given["mapping"]["non_matching_mapping_item"], given["item"]["one_item"], []], | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super clear! Thanks ❤️
wikidatarefisland/run.py
Outdated
observer_pump = pumps.ObserverPump(storage) | ||
observer_pump.run(pipe, args.input_path, '-') | ||
mapping = pipe.get_mapping() | ||
pipe = pipes.ItemMappingMatcherPipe(mapping, whitelisted_ext_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pinging on this tiny thing.
Currently, it turns the whole thing as string and then treat it as just a string value. It's hacky but funnily enough, it works. |
Great Work, thank you! |
|
Based on the current scraped data, out of ~2000 cases, it produced 74
matches that 5 of them are wrong and the rest are correct. The precision
and recall is 93% and 3.4% respectively. These number will go up when we
have more scraped data.
Left to do:
Bug: T248992