Start the statistics item value matching #39

Ladsgroup · 2020-05-11T23:56:18Z

Based on the current scraped data, out of ~2000 cases, it produced 74
matches that 5 of them are wrong and the rest are correct. The precision
and recall is 93% and 3.4% respectively. These number will go up when we
have more scraped data.

Left to do:

Unit tests
Integration tests
Fix make file

Bug: T248992

Based on the current scraped data, out of ~2000 cases, it produced 74 matches that 5 of them are wrong and the rest are correct. The precision and recall is 93% and 3.4% respectively. These number will go up when we have more scraped data. Left to do: - Unit tests - Integration tests - Fix make file Bug: T248992

itamargiv

All in all it looks great, brilliant solution! I left a few comments, and I have not run it locally yet. ping me once it's out of WIP so I can run it.

itamargiv · 2020-05-12T13:23:44Z

wikidatarefisland/pipes/item_statistical_matcher_pipe.py

+        ext_id_property = None
+        for pid in potential_match['reference']['referenceMetadata']:
+            if pid not in self.whitelisted_external_identifiers:
+                continue
+            ext_id_property = pid
+            break
+        if not ext_id_property:
+            return [potential_match]


IMHO it will be worthwhile to change the structure of reference blob to include a pid key (same as it's supposed to include url). This has 2 advantages:

We will not have to make this iteration every time we want to get the pid

This class will not have to depend on th whitelisted exernal identifiers anymore, as we already filter these out in Pipe 1.

Yes, I would love to have something like that but I would say it's not part of scope of this PR. We should file a bug, track it and pick it up soon

Filed: https://phabricator.wikimedia.org/T252718

itamargiv · 2020-05-12T13:26:16Z

wikidatarefisland/pipes/item_statistical_matcher_pipe.py

+        :type items: list
+        """
+        # Everything is the same, bail out.
+        if len(list(set(items))) == 1:


I think we can use len() directly on set()s: https://docs.python.org/3.7/library/stdtypes.html#set

oooh, thanks.

itamargiv · 2020-05-12T13:27:53Z

wikidatarefisland/pipes/item_statistical_matcher_pipe.py

+        for case in list(set(items)):
+            if (items.count(case) / len(items)) > (1 - self.maximum_noise):
+                return case


Very Clever! 💯 Nice work.

itamargiv · 2020-05-12T13:29:20Z

wikidatarefisland/pipes/item_mapping_matcher_pipe.py

+        Arguments:
+            matchers {wikidatarefisland.data_model.ValueMatchers} --
+                A static class with value matcher functions
+        """


Needs to be removed, this is not a parameter of this constructor

ken. It's just copy pasta error.

itamargiv · 2020-05-12T13:31:06Z

wikidatarefisland/pipes/item_mapping_matcher_pipe.py

+        ext_id_property = None
+        for pid in potential_match['reference']['referenceMetadata']:
+            if pid not in self.whitelisted_external_identifiers:
+                continue
+            ext_id_property = pid
+            break
+        if not ext_id_property:
+            return []


Same comment as the one on item_statistical_matcher_pipe.py: #39 (comment)

I responded there, I definitely thing we should pick it up but I think it's outside of scope of this PR, we should track it and then pick it up for the next sprint. What do you think?

itamargiv · 2020-05-12T13:32:04Z

wikidatarefisland/pipes/item_mapping_matcher_pipe.py

+            pid = potential_match['statement']['pid']
+            if pid not in self.mapping[ext_id_property]:
+                return []
+            if item_id == self.mapping[ext_id_property][pid].get(str(value)):


👍 Easy breezy :)

Just a random thought (feel free to ignore) Should we maybe use the pattern we started in Pipe 3 and use a Value class for items?

yeah, let me see what I can do about it. if it's not too complicated.

itamargiv · 2020-05-12T13:39:25Z

wikidatarefisland/run.py

+        pipe = pipes.ItemStatisticalAnalysisPipe(
+            whitelisted_ext_ids,
+            config.get('minimum_repetitions_for_item_values'),
+            config.get('maximum_noise_ratio_for_item_values')
+        )
+        observer_pump = pumps.ObserverPump(storage)
+        observer_pump.run(pipe, args.input_path, '-')
+        mapping = pipe.get_mapping()


This pattern is strange to me... we're running a pump to update the pipe class itself. I'm not fully convinced that this Analysis class really needs all of this, after all this get_mapping() method could never really be called without the observation pump, so they don't make sense to me as separate classes.

I see but the observer pump is stateless and makes the pipe running, otherwise the pipe itself needs to know about storage which is something I want avoid here.

Does the observerpump need to write to disk? currently it's just reading from disk. and as this analyzer class is not really a "Pipe" per se, I don't see why it couldn't have storage as a direct dependency.

Adding storage as a dependency makes the pipe to be coupled to the class and makes testing harder. Also even if this is not a "real" pipe, I would like to keep it as similar as possible to other ones for sake of consistency to make the code easier to understand

coolio :) Thanks for the explanation

itamargiv · 2020-05-12T13:39:50Z

wikidatarefisland/run.py

+        observer_pump = pumps.ObserverPump(storage)
+        observer_pump.run(pipe, args.input_path, '-')
+        mapping = pipe.get_mapping()
+        pipe = pipes.ItemMappingMatcherPipe(mapping, whitelisted_ext_ids)


Reassigning the variable here can be a cause for future confusion.

yeah, let me fix that.

Just pinging on this tiny thing.

Why I forgot about this. Sorry :(

itamargiv

I ran it and BAM! 800 matches! Fantastic work. In terms of changes, There's really one tiny nitpick, but apart from that, I think we're good to go.

Also, there is something noticed in the output, but I'm not sure if it's in the scope of this task. How do we handle matches like the one below, where the match is essentially a piece of structured data?

{
  "statement": {
    "pid": "P921",
    "datatype": "wikibase-item",
    "value": {
      "entity-type": "item",
      "numeric-id": 700120,
      "id": "Q700120"
    }
  },
  "itemId": "Q318030",
  "reference": {
    "referenceMetadata": {
      "P7859": "lccn-no2003078358",
      "P248": "Q76630151",
      "dateRetrieved": "2020-05-13 12:47:32"
    },
    "extractedData": [
      {
        "http://schema.org/additionalName": [
          "HTO",
          "Haupttreuhandstelle Ost",
          "Haupttreuhandstelle Ost Posen Treuhandstelle Posen",
          "Deutschland (1871-1945). Haupttreuhandstelle Ost",
          "Germany. Haupttreuhandstelle Ost"
        ],
        "http://schema.org/knows": [
          "http://worldcat.org/identities/np-winkler",
          "http://worldcat.org/identities/lccn-n2004102147",
          "http://worldcat.org/identities/lccn-no2017106555",
          "http://worldcat.org/identities/viaf-35458744",
          "http://worldcat.org/identities/lccn-no2008015485",
          "http://worldcat.org/identities/lccn-no2019085287",
          "http://worldcat.org/identities/lccn-no2017102289",
          "http://worldcat.org/identities/lccn-n99260267",
          "http://worldcat.org/identities/viaf-306231433",
          "http://worldcat.org/identities/nc-uppsala universitet$teknisk naturvetenskapliga vetenskapsomradet"
        ],
        "http://schema.org/name": [
          "Germany Haupttreuhandstelle Ost "
        ],
        "http://schema.org/sameAs": [
          "http://en.wikipedia.org/wiki/Special:Search?search=Haupttreuhandstelle_Ost",
          "http://viaf.org/viaf/139898717",
          "https://www.wikidata.org/wiki/Q318030",
          "http://id.loc.gov/authorities/names/no2003078358"
        ]
      }
    ]
  }
}

itamargiv · 2020-05-13T12:34:30Z

Makefile

+data/references.jsonl: \
+	data/matched_references.jsonl \
+	data/matched_item_references.jsonl
+	cat $^ > $@


Great idea, thanks!

Also general question (unrelated to this PR), I noticed that make tries to run all of the scripts even if the file is already present in the data folder. Is it the normal way it should work? Am I missing a parameter here?

if the file exists, it should just skip it.

"make" basically considers each step as a file. If the file exists, it considers the step as done and skip it. (The exception to this are "PHONY" steps)

Weirdly enough it doesn't work that way for me :/

That's strange, that's how I do it in my localhost and the labs instance (by moving files around).
Maybe that's the problem I encountered before. If A depends on B and B doesn't exists but A does. It still re-runs A (after running B) because it says "the dependency has changed" I guess it's so linuxy and systemdy :D

itamargiv · 2020-05-13T12:38:13Z

tests/pipes/test_data/mock_data.py

+REFERENCE_LINE = {
+    'statement': {},
+    'itemId': ITEM_ID,
+    'reference': {'referenceMetadata': {},
+                  'extractedData': []}}

+EXAMPLE_LINE = {
+    **REFERENCE_LINE,
+    'statement': STATEMENT_BLOB,
+    'reference': {
+        'referenceMetadata': {WHITELISTED_EXT_ID: 'fooid'},
+        'extractedData': ['foo', 'bar']
+    }
+}


Fabulous! I'm happy to see that this file and the constants in it is also useful for other devs

Thank you for doing it

itamargiv · 2020-05-13T12:39:24Z

tests/pipes/test_item_mapping_matcher_pipe.py

+    @pytest.mark.parametrize("mapping,given,expected", [
+        [given["mapping"]["simple_mapping"], given["item"]["not_item_value_type"], []],
+        [given["mapping"]["simple_mapping"], given["item"]["one_item"],
+         [given["item"]["one_item"]]],
+        [given["mapping"]["non_matching_mapping_value"], given["item"]["one_item"], []],
+        [given["mapping"]["non_matching_mapping_item"], given["item"]["one_item"], []],
+    ])


Super clear! Thanks ❤️

itamargiv · 2020-05-13T12:45:47Z

wikidatarefisland/run.py

+        observer_pump = pumps.ObserverPump(storage)
+        observer_pump.run(pipe, args.input_path, '-')
+        mapping = pipe.get_mapping()
+        pipe = pipes.ItemMappingMatcherPipe(mapping, whitelisted_ext_ids)


Just pinging on this tiny thing.

Ladsgroup · 2020-05-13T18:34:51Z

I ran it and BAM! 800 matches! Fantastic work. In terms of changes, There's really one tiny nitpick, but apart from that, I think we're good to go.

Also, there is something noticed in the output, but I'm not sure if it's in the scope of this task. How do we handle matches like the one below, where the match is essentially a piece of structured data?

{
  "statement": {
    "pid": "P921",
    "datatype": "wikibase-item",
    "value": {
      "entity-type": "item",
      "numeric-id": 700120,
      "id": "Q700120"
    }
  },
  "itemId": "Q318030",
  "reference": {
    "referenceMetadata": {
      "P7859": "lccn-no2003078358",
      "P248": "Q76630151",
      "dateRetrieved": "2020-05-13 12:47:32"
    },
    "extractedData": [
      {
        "http://schema.org/additionalName": [
          "HTO",
          "Haupttreuhandstelle Ost",
          "Haupttreuhandstelle Ost Posen Treuhandstelle Posen",
          "Deutschland (1871-1945). Haupttreuhandstelle Ost",
          "Germany. Haupttreuhandstelle Ost"
        ],
        "http://schema.org/knows": [
          "http://worldcat.org/identities/np-winkler",
          "http://worldcat.org/identities/lccn-n2004102147",
          "http://worldcat.org/identities/lccn-no2017106555",
          "http://worldcat.org/identities/viaf-35458744",
          "http://worldcat.org/identities/lccn-no2008015485",
          "http://worldcat.org/identities/lccn-no2019085287",
          "http://worldcat.org/identities/lccn-no2017102289",
          "http://worldcat.org/identities/lccn-n99260267",
          "http://worldcat.org/identities/viaf-306231433",
          "http://worldcat.org/identities/nc-uppsala universitet$teknisk naturvetenskapliga vetenskapsomradet"
        ],
        "http://schema.org/name": [
          "Germany Haupttreuhandstelle Ost "
        ],
        "http://schema.org/sameAs": [
          "http://en.wikipedia.org/wiki/Special:Search?search=Haupttreuhandstelle_Ost",
          "http://viaf.org/viaf/139898717",
          "https://www.wikidata.org/wiki/Q318030",
          "http://id.loc.gov/authorities/names/no2003078358"
        ]
      }
    ]
  }
}

Currently, it turns the whole thing as string and then treat it as just a string value. It's hacky but funnily enough, it works.
I think this is outside of scope of this PR and ticket because they are in their title only focused on statistical matching and not any other type of matching (the "sameAs" is nice, we should definitely try it). I would say, let's track it and then pick it up hopefully soon.

It's confusing

itamargiv · 2020-05-13T18:53:40Z

Great Work, thank you!

Ladsgroup · 2020-05-13T21:27:08Z

I ran it and BAM! 800 matches! Fantastic work. In terms of changes, There's really one tiny nitpick, but apart from that, I think we're good to go.
Also, there is something noticed in the output, but I'm not sure if it's in the scope of this task. How do we handle matches like the one below, where the match is essentially a piece of structured data?

{
  "statement": {
    "pid": "P921",
    "datatype": "wikibase-item",
    "value": {
      "entity-type": "item",
      "numeric-id": 700120,
      "id": "Q700120"
    }
  },
  "itemId": "Q318030",
  "reference": {
    "referenceMetadata": {
      "P7859": "lccn-no2003078358",
      "P248": "Q76630151",
      "dateRetrieved": "2020-05-13 12:47:32"
    },
    "extractedData": [
      {
        "http://schema.org/additionalName": [
          "HTO",
          "Haupttreuhandstelle Ost",
          "Haupttreuhandstelle Ost Posen Treuhandstelle Posen",
          "Deutschland (1871-1945). Haupttreuhandstelle Ost",
          "Germany. Haupttreuhandstelle Ost"
        ],
        "http://schema.org/knows": [
          "http://worldcat.org/identities/np-winkler",
          "http://worldcat.org/identities/lccn-n2004102147",
          "http://worldcat.org/identities/lccn-no2017106555",
          "http://worldcat.org/identities/viaf-35458744",
          "http://worldcat.org/identities/lccn-no2008015485",
          "http://worldcat.org/identities/lccn-no2019085287",
          "http://worldcat.org/identities/lccn-no2017102289",
          "http://worldcat.org/identities/lccn-n99260267",
          "http://worldcat.org/identities/viaf-306231433",
          "http://worldcat.org/identities/nc-uppsala universitet$teknisk naturvetenskapliga vetenskapsomradet"
        ],
        "http://schema.org/name": [
          "Germany Haupttreuhandstelle Ost "
        ],
        "http://schema.org/sameAs": [
          "http://en.wikipedia.org/wiki/Special:Search?search=Haupttreuhandstelle_Ost",
          "http://viaf.org/viaf/139898717",
          "https://www.wikidata.org/wiki/Q318030",
          "http://id.loc.gov/authorities/names/no2003078358"
        ]
      }
    ]
  }
}

Currently, it turns the whole thing as string and then treat it as just a string value. It's hacky but funnily enough, it works.
I think this is outside of scope of this PR and ticket because they are in their title only focused on statistical matching and not any other type of matching (the "sameAs" is nice, we should definitely try it). I would say, let's track it and then pick it up hopefully soon.

Filed: https://phabricator.wikimedia.org/T252720

Ladsgroup changed the title ~~[wIP] Start the statistics item value matching~~ [WIP] Start the statistics item value matching May 11, 2020

Ladsgroup force-pushed the pipe4evar branch from 50c07dc to 001ad26 Compare May 11, 2020 23:58

itamargiv reviewed May 12, 2020

View reviewed changes

Add tests

dca1238

Ladsgroup changed the title ~~[WIP] Start the statistics item value matching~~ Start the statistics item value matching May 13, 2020

itamargiv requested changes May 13, 2020

View reviewed changes

Avoid reassigning variables when they are conceptually different

a2d13cc

It's confusing

itamargiv approved these changes May 13, 2020

View reviewed changes

itamargiv merged commit 4937b2b into master May 13, 2020

itamargiv deleted the pipe4evar branch May 13, 2020 18:54

itamargiv mentioned this pull request May 14, 2020

[Discussion] Update contract for Pipe 2 #48

Closed

Start the statistics item value matching #39

Start the statistics item value matching #39

Conversation

Ladsgroup commented May 11, 2020

itamargiv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itamargiv May 12, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itamargiv May 13, 2020 • edited

Choose a reason for hiding this comment

itamargiv May 12, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itamargiv left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ladsgroup commented May 13, 2020

itamargiv commented May 13, 2020

Ladsgroup commented May 13, 2020

itamargiv May 12, 2020 •

edited

itamargiv May 13, 2020 •

edited

itamargiv May 12, 2020 •

edited

itamargiv left a comment •

edited