-
Notifications
You must be signed in to change notification settings - Fork 15
Conversation
Sync with original repo
…st of entries in time frame.
…_duplicate_entries
There are supposed to be 73 fatal crashes in 2018, and without duplicates we return 72. The de-duplication changes made here appear to be correct; I think we are missing the 30th fatal crash of the year, which I cannot locate a valid link for. The order of entries returned goes from 29 to 31. In another note, the changes I would like to make to this PR are 1) de-duplication ruins the order of cases right now, so reorder before the list is returned, and 2) to use the test scenarios instead of the test function. |
…unction, test length only since order of dicts is not guaranteed.
Nicely done, but it can be even simpler! Here is what I have done: diff --git a/scrapd/core/apd.py b/scrapd/core/apd.py
index 8848789..2d0973b 100644
--- a/scrapd/core/apd.py
+++ b/scrapd/core/apd.py
@@ -447,7 +447,7 @@ async def fetch_and_parse(session, url):
async def async_retrieve(pages=-1, from_=None, to=None):
"""Retrieve fatality data."""
- res = []
+ res = {}
page = 1
has_entries = False
no_date_within_range_count = 0
@@ -500,7 +500,8 @@ async def async_retrieve(pages=-1, from_=None, to=None):
break
# Otherwise store the results.
- res += entries_in_time_range
+ for entry in entries_in_time_range:
+ res.setdefault(entry.get(Fields.CASE), {}).update(entry)
# Stop if there is no further pages.
if not has_next(news_page):
@@ -511,6 +512,4 @@ async def async_retrieve(pages=-1, from_=None, to=None):
page += 1
- res = remove_duplicate_entries(res)
-
- return res, page
+ return list(res.values()), page Only 4 lines, the Let me know what you think. |
Thinking about it a bit more, the Instead, this would keep the first entry added: for entry in entries_in_time_range:
if entry.get(Fields.CASE) not in res:
res[Fields.CASE] = entry |
Ah, I see! If we don't break out removing duplicates as a separate function, should we just remove the test_remove_duplicate_entries test, since ensure_results will compare the entry_count over this time period anyway? |
Ah yes, the test will be unused, so you can remove it as well. |
…moved extra test.
So, the flake8 linter really doesn't like async_retrieve right now lol. With the if-statement it's just too complex, so I added them in reverse order instead, with newer overwriting older. But this is perhaps not as clear, so we could also either ignore flake8 or break out more of async_retrieve into a separate function. |
Hehe, very resourceful! I like that! 👍 However, you're making a full copy of the list in reversed order just to please flake8. This does not seem very efficient. What I would propose you is:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job! 👍
* Added remove_duplicate_entries() function, called before returning list of entries in time frame. * Added test to remove duplicate entries. * Formatted remove_duplicate_entries test. * Sort to preserve original order in de-duplication. * Added changes to CHANGELOG. * Changed structure of tests to align with pytest parameterize. * Removed forgotten debugging printouts. * Fix total number of 2018 entries, simplify remove_duplicate_entries function, test length only since order of dicts is not guaranteed. * Fixed formatting / linting errors. * Removed remove_duplicate_entries to contain within async_retrieve, removed extra test. * Removed unused remove_duplicate_entries code. * Made flake8 allow extra layer of complexity, added if-statement logic to res update.
Types of changes
Description
This change introduces a remove_duplicate_entries() function, which takes the final list of fatality entries as a parameter and returns the list with any entries with duplicate case numbers removed. This relies on case number being a unique identifier. When two entries have the same case number, the entries are merged into one, where for any conflicting entry the field of the second one (more recent) will take precedence.
Some of the cases appear multiple times in the current output, leading to an incorrect overall count, so this change attempts to fix that issue.
I tested it by adding a test to tests/core/test_apd (test_remove_duplicate_entries()) and by examining the output and deleted entries over several time periods. I am still working on the tests for this change and would like to ensure correctness for total fatalities in 2018.
Checklist:
#48
Fixes: #48