Conversation
modified: Makefile modified: tests/data_model/test_filters.py modified: tests/pipes/test_item_extractor_pipe.py modified: tests/pipes/test_scraper.py modified: wikidatarefisland/data_model/__init__.py modified: wikidatarefisland/data_model/filters.py modified: wikidatarefisland/data_model/schemaorg_normalizer.py modified: wikidatarefisland/pipes/__init__.py modified: wikidatarefisland/pipes/item_extractor_pipe.py modified: wikidatarefisland/pipes/scraper.py modified: wikidatarefisland/pumps/pump.py modified: wikidatarefisland/services/__init__.py modified: wikidatarefisland/services/schemaorg_property_mapper.py
This way it would output non-latin text properly
tarrow
left a comment
There was a problem hiding this comment.
All looks good. I wondered if there is some way I should be "manually" testing that these work before I merge though?
| filtered_statements = potentially_ref_statement_filterer.filter_statements( | ||
| all_statements) | ||
| result_statements = map(_extract_statement, filtered_statements) | ||
| result_statements = reduce(_extract_statement, filtered_statements, []) |
There was a problem hiding this comment.
I'm not quite sure what using a reduce here does vs using a map? Can you help me understand what is the benefit?
There was a problem hiding this comment.
We talked about this and the answer is that now for n filtered_statements results_statements will be n-m long where m>=0. Whereas with map the length of filtered_statements equalsresults_statements
| that are any of the passed property ids | ||
| """ | ||
| return lambda statement: statement.get('pid', '') not in excluded_properties | ||
| return lambda statement: statement.get('mainsnak', {}).get('property', '') \ |
There was a problem hiding this comment.
This is definitely a good change; I'm really searching for which flavour of the dumps I found that had a "pid" key because clearly the prod dumps aren't having this. Or did I totally make it up?
There was a problem hiding this comment.
I never saw "pid" in any serialization.
| )) | ||
|
|
||
|
|
||
| @pytest.mark.skip |
There was a problem hiding this comment.
I generally get the idea of just skipping this test if it isn't useful but I'm a tiny bit concerned that we have to do this. I'd love to fix up the test rather than skip. To me this seems like the only test that checks the pipe obeys "the contract" from README.md
There was a problem hiding this comment.
The task to track re-enabling this test is T252036 (already picked up in the sprint). We marked it as skipped because it exploded when we fixed the first pipe. Fixing it should not be too hard TBH.
Fixed Makefile dependency
Fixed issue with missing resource urls
Fix issue with null statements breaking the scraper
Fixed initial results seem to only include P2888
Fixed issue with request headers
AND
Fixed issue with normalized data structure
Fixed issue with recursive comparisons
Fixed issue with char encoding
Isort imports