Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start the statistics item value matching #39

Merged
merged 3 commits into from
May 13, 2020
Merged

Start the statistics item value matching #39

merged 3 commits into from
May 13, 2020

Conversation

Ladsgroup
Copy link
Contributor

Based on the current scraped data, out of ~2000 cases, it produced 74
matches that 5 of them are wrong and the rest are correct. The precision
and recall is 93% and 3.4% respectively. These number will go up when we
have more scraped data.

Left to do:

  • Unit tests
  • Integration tests
  • Fix make file

Bug: T248992

@Ladsgroup Ladsgroup changed the title [wIP] Start the statistics item value matching [WIP] Start the statistics item value matching May 11, 2020
Based on the current scraped data, out of ~2000 cases, it produced 74
matches that 5 of them are wrong and the rest are correct. The precision
and recall is 93% and 3.4% respectively. These number will go up when we
have more scraped data.

Left to do:
 - Unit tests
 - Integration tests
 - Fix make file

Bug: T248992
Copy link
Member

@itamargiv itamargiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all it looks great, brilliant solution! I left a few comments, and I have not run it locally yet. ping me once it's out of WIP so I can run it.

Comment on lines 37 to 44
ext_id_property = None
for pid in potential_match['reference']['referenceMetadata']:
if pid not in self.whitelisted_external_identifiers:
continue
ext_id_property = pid
break
if not ext_id_property:
return [potential_match]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it will be worthwhile to change the structure of reference blob to include a pid key (same as it's supposed to include url). This has 2 advantages:

  1. We will not have to make this iteration every time we want to get the pid
  2. This class will not have to depend on th whitelisted exernal identifiers anymore, as we already filter these out in Pipe 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would love to have something like that but I would say it's not part of scope of this PR. We should file a bug, track it and pick it up soon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:type items: list
"""
# Everything is the same, bail out.
if len(list(set(items))) == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use len() directly on set()s: https://docs.python.org/3.7/library/stdtypes.html#set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooh, thanks.

Comment on lines 90 to 92
for case in list(set(items)):
if (items.count(case) / len(items)) > (1 - self.maximum_noise):
return case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very Clever! 💯 Nice work.

Comment on lines 10 to 13
Arguments:
matchers {wikidatarefisland.data_model.ValueMatchers} --
A static class with value matcher functions
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be removed, this is not a parameter of this constructor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ken. It's just copy pasta error.

Comment on lines +30 to +37
ext_id_property = None
for pid in potential_match['reference']['referenceMetadata']:
if pid not in self.whitelisted_external_identifiers:
continue
ext_id_property = pid
break
if not ext_id_property:
return []
Copy link
Member

@itamargiv itamargiv May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as the one on item_statistical_matcher_pipe.py: #39 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I responded there, I definitely thing we should pick it up but I think it's outside of scope of this PR, we should track it and then pick it up for the next sprint. What do you think?

pid = potential_match['statement']['pid']
if pid not in self.mapping[ext_id_property]:
return []
if item_id == self.mapping[ext_id_property][pid].get(str(value)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Easy breezy :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a random thought (feel free to ignore) Should we maybe use the pattern we started in Pipe 3 and use a Value class for items?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let me see what I can do about it. if it's not too complicated.

Comment on lines 69 to 76
pipe = pipes.ItemStatisticalAnalysisPipe(
whitelisted_ext_ids,
config.get('minimum_repetitions_for_item_values'),
config.get('maximum_noise_ratio_for_item_values')
)
observer_pump = pumps.ObserverPump(storage)
observer_pump.run(pipe, args.input_path, '-')
mapping = pipe.get_mapping()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is strange to me... we're running a pump to update the pipe class itself. I'm not fully convinced that this Analysis class really needs all of this, after all this get_mapping() method could never really be called without the observation pump, so they don't make sense to me as separate classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see but the observer pump is stateless and makes the pipe running, otherwise the pipe itself needs to know about storage which is something I want avoid here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the observerpump need to write to disk? currently it's just reading from disk. and as this analyzer class is not really a "Pipe" per se, I don't see why it couldn't have storage as a direct dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding storage as a dependency makes the pipe to be coupled to the class and makes testing harder. Also even if this is not a "real" pipe, I would like to keep it as similar as possible to other ones for sake of consistency to make the code easier to understand

Copy link
Member

@itamargiv itamargiv May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coolio :) Thanks for the explanation

observer_pump = pumps.ObserverPump(storage)
observer_pump.run(pipe, args.input_path, '-')
mapping = pipe.get_mapping()
pipe = pipes.ItemMappingMatcherPipe(mapping, whitelisted_ext_ids)
Copy link
Member

@itamargiv itamargiv May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reassigning the variable here can be a cause for future confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let me fix that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pinging on this tiny thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why I forgot about this. Sorry :(

@Ladsgroup Ladsgroup changed the title [WIP] Start the statistics item value matching Start the statistics item value matching May 13, 2020
Copy link
Member

@itamargiv itamargiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran it and BAM! 800 matches! Fantastic work. In terms of changes, There's really one tiny nitpick, but apart from that, I think we're good to go.

Also, there is something noticed in the output, but I'm not sure if it's in the scope of this task. How do we handle matches like the one below, where the match is essentially a piece of structured data?

{
  "statement": {
    "pid": "P921",
    "datatype": "wikibase-item",
    "value": {
      "entity-type": "item",
      "numeric-id": 700120,
      "id": "Q700120"
    }
  },
  "itemId": "Q318030",
  "reference": {
    "referenceMetadata": {
      "P7859": "lccn-no2003078358",
      "P248": "Q76630151",
      "dateRetrieved": "2020-05-13 12:47:32"
    },
    "extractedData": [
      {
        "http://schema.org/additionalName": [
          "HTO",
          "Haupttreuhandstelle Ost",
          "Haupttreuhandstelle Ost Posen Treuhandstelle Posen",
          "Deutschland (1871-1945). Haupttreuhandstelle Ost",
          "Germany. Haupttreuhandstelle Ost"
        ],
        "http://schema.org/knows": [
          "http://worldcat.org/identities/np-winkler",
          "http://worldcat.org/identities/lccn-n2004102147",
          "http://worldcat.org/identities/lccn-no2017106555",
          "http://worldcat.org/identities/viaf-35458744",
          "http://worldcat.org/identities/lccn-no2008015485",
          "http://worldcat.org/identities/lccn-no2019085287",
          "http://worldcat.org/identities/lccn-no2017102289",
          "http://worldcat.org/identities/lccn-n99260267",
          "http://worldcat.org/identities/viaf-306231433",
          "http://worldcat.org/identities/nc-uppsala universitet$teknisk naturvetenskapliga vetenskapsomradet"
        ],
        "http://schema.org/name": [
          "Germany Haupttreuhandstelle Ost "
        ],
        "http://schema.org/sameAs": [
          "http://en.wikipedia.org/wiki/Special:Search?search=Haupttreuhandstelle_Ost",
          "http://viaf.org/viaf/139898717",
          "https://www.wikidata.org/wiki/Q318030",
          "http://id.loc.gov/authorities/names/no2003078358"
        ]
      }
    ]
  }
}

Comment on lines +2 to +5
data/references.jsonl: \
data/matched_references.jsonl \
data/matched_item_references.jsonl
cat $^ > $@
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, thanks!

Also general question (unrelated to this PR), I noticed that make tries to run all of the scripts even if the file is already present in the data folder. Is it the normal way it should work? Am I missing a parameter here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the file exists, it should just skip it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"make" basically considers each step as a file. If the file exists, it considers the step as done and skip it. (The exception to this are "PHONY" steps)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly enough it doesn't work that way for me :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's strange, that's how I do it in my localhost and the labs instance (by moving files around).
Maybe that's the problem I encountered before. If A depends on B and B doesn't exists but A does. It still re-runs A (after running B) because it says "the dependency has changed" I guess it's so linuxy and systemdy :D

Comment on lines +46 to +59
REFERENCE_LINE = {
'statement': {},
'itemId': ITEM_ID,
'reference': {'referenceMetadata': {},
'extractedData': []}}

EXAMPLE_LINE = {
**REFERENCE_LINE,
'statement': STATEMENT_BLOB,
'reference': {
'referenceMetadata': {WHITELISTED_EXT_ID: 'fooid'},
'extractedData': ['foo', 'bar']
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fabulous! I'm happy to see that this file and the constants in it is also useful for other devs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing it

Comment on lines +34 to +40
@pytest.mark.parametrize("mapping,given,expected", [
[given["mapping"]["simple_mapping"], given["item"]["not_item_value_type"], []],
[given["mapping"]["simple_mapping"], given["item"]["one_item"],
[given["item"]["one_item"]]],
[given["mapping"]["non_matching_mapping_value"], given["item"]["one_item"], []],
[given["mapping"]["non_matching_mapping_item"], given["item"]["one_item"], []],
])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super clear! Thanks ❤️

observer_pump = pumps.ObserverPump(storage)
observer_pump.run(pipe, args.input_path, '-')
mapping = pipe.get_mapping()
pipe = pipes.ItemMappingMatcherPipe(mapping, whitelisted_ext_ids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pinging on this tiny thing.

@Ladsgroup
Copy link
Contributor Author

I ran it and BAM! 800 matches! Fantastic work. In terms of changes, There's really one tiny nitpick, but apart from that, I think we're good to go.

Also, there is something noticed in the output, but I'm not sure if it's in the scope of this task. How do we handle matches like the one below, where the match is essentially a piece of structured data?

{
  "statement": {
    "pid": "P921",
    "datatype": "wikibase-item",
    "value": {
      "entity-type": "item",
      "numeric-id": 700120,
      "id": "Q700120"
    }
  },
  "itemId": "Q318030",
  "reference": {
    "referenceMetadata": {
      "P7859": "lccn-no2003078358",
      "P248": "Q76630151",
      "dateRetrieved": "2020-05-13 12:47:32"
    },
    "extractedData": [
      {
        "http://schema.org/additionalName": [
          "HTO",
          "Haupttreuhandstelle Ost",
          "Haupttreuhandstelle Ost Posen Treuhandstelle Posen",
          "Deutschland (1871-1945). Haupttreuhandstelle Ost",
          "Germany. Haupttreuhandstelle Ost"
        ],
        "http://schema.org/knows": [
          "http://worldcat.org/identities/np-winkler",
          "http://worldcat.org/identities/lccn-n2004102147",
          "http://worldcat.org/identities/lccn-no2017106555",
          "http://worldcat.org/identities/viaf-35458744",
          "http://worldcat.org/identities/lccn-no2008015485",
          "http://worldcat.org/identities/lccn-no2019085287",
          "http://worldcat.org/identities/lccn-no2017102289",
          "http://worldcat.org/identities/lccn-n99260267",
          "http://worldcat.org/identities/viaf-306231433",
          "http://worldcat.org/identities/nc-uppsala universitet$teknisk naturvetenskapliga vetenskapsomradet"
        ],
        "http://schema.org/name": [
          "Germany Haupttreuhandstelle Ost "
        ],
        "http://schema.org/sameAs": [
          "http://en.wikipedia.org/wiki/Special:Search?search=Haupttreuhandstelle_Ost",
          "http://viaf.org/viaf/139898717",
          "https://www.wikidata.org/wiki/Q318030",
          "http://id.loc.gov/authorities/names/no2003078358"
        ]
      }
    ]
  }
}

Currently, it turns the whole thing as string and then treat it as just a string value. It's hacky but funnily enough, it works.
I think this is outside of scope of this PR and ticket because they are in their title only focused on statistical matching and not any other type of matching (the "sameAs" is nice, we should definitely try it). I would say, let's track it and then pick it up hopefully soon.

@itamargiv
Copy link
Member

Great Work, thank you!

@itamargiv itamargiv merged commit 4937b2b into master May 13, 2020
@itamargiv itamargiv deleted the pipe4evar branch May 13, 2020 18:54
@Ladsgroup
Copy link
Contributor Author

I ran it and BAM! 800 matches! Fantastic work. In terms of changes, There's really one tiny nitpick, but apart from that, I think we're good to go.
Also, there is something noticed in the output, but I'm not sure if it's in the scope of this task. How do we handle matches like the one below, where the match is essentially a piece of structured data?

{
  "statement": {
    "pid": "P921",
    "datatype": "wikibase-item",
    "value": {
      "entity-type": "item",
      "numeric-id": 700120,
      "id": "Q700120"
    }
  },
  "itemId": "Q318030",
  "reference": {
    "referenceMetadata": {
      "P7859": "lccn-no2003078358",
      "P248": "Q76630151",
      "dateRetrieved": "2020-05-13 12:47:32"
    },
    "extractedData": [
      {
        "http://schema.org/additionalName": [
          "HTO",
          "Haupttreuhandstelle Ost",
          "Haupttreuhandstelle Ost Posen Treuhandstelle Posen",
          "Deutschland (1871-1945). Haupttreuhandstelle Ost",
          "Germany. Haupttreuhandstelle Ost"
        ],
        "http://schema.org/knows": [
          "http://worldcat.org/identities/np-winkler",
          "http://worldcat.org/identities/lccn-n2004102147",
          "http://worldcat.org/identities/lccn-no2017106555",
          "http://worldcat.org/identities/viaf-35458744",
          "http://worldcat.org/identities/lccn-no2008015485",
          "http://worldcat.org/identities/lccn-no2019085287",
          "http://worldcat.org/identities/lccn-no2017102289",
          "http://worldcat.org/identities/lccn-n99260267",
          "http://worldcat.org/identities/viaf-306231433",
          "http://worldcat.org/identities/nc-uppsala universitet$teknisk naturvetenskapliga vetenskapsomradet"
        ],
        "http://schema.org/name": [
          "Germany Haupttreuhandstelle Ost "
        ],
        "http://schema.org/sameAs": [
          "http://en.wikipedia.org/wiki/Special:Search?search=Haupttreuhandstelle_Ost",
          "http://viaf.org/viaf/139898717",
          "https://www.wikidata.org/wiki/Q318030",
          "http://id.loc.gov/authorities/names/no2003078358"
        ]
      }
    ]
  }
}

Currently, it turns the whole thing as string and then treat it as just a string value. It's hacky but funnily enough, it works.
I think this is outside of scope of this PR and ticket because they are in their title only focused on statistical matching and not any other type of matching (the "sameAs" is nice, we should definitely try it). I would say, let's track it and then pick it up hopefully soon.

Filed: https://phabricator.wikimedia.org/T252720

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants